{"title": "A Screening Rule for l1-Regularized Ising Model Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 720, "page_last": 731, "abstract": "We discover a screening rule for l1-regularized Ising model estimation. The simple closed-form screening rule is a necessary and sufficient condition for exactly recovering the blockwise structure of a solution under any given regularization parameters. With enough sparsity, the screening rule can be combined with various optimization procedures to deliver solutions efficiently in practice. The screening rule is especially suitable for large-scale exploratory data analysis, where the number of variables in the dataset can be thousands while we are only interested in the relationship among a handful of variables within moderate-size clusters for interpretability. Experimental results on various datasets demonstrate the efficiency and insights gained from the introduction of the screening rule.", "full_text": "A Screening Rule for (cid:96)1-Regularized\n\nIsing Model Estimation\n\nZhaobin Kuang1, Sinong Geng2, David Page3\n\nUniversity of Wisconsin\n\nzkuang@wisc.edu1, sgeng2@wisc.edu2, page@biostat.wisc.edu3\n\nAbstract\n\nWe discover a screening rule for (cid:96)1-regularized Ising model estimation. The simple\nclosed-form screening rule is a necessary and suf\ufb01cient condition for exactly\nrecovering the blockwise structure of a solution under any given regularization\nparameters. With enough sparsity, the screening rule can be combined with various\noptimization procedures to deliver solutions ef\ufb01ciently in practice. The screening\nrule is especially suitable for large-scale exploratory data analysis, where the\nnumber of variables in the dataset can be thousands while we are only interested\nin the relationship among a handful of variables within moderate-size clusters for\ninterpretability. Experimental results on various datasets demonstrate the ef\ufb01ciency\nand insights gained from the introduction of the screening rule.\n\n1\n\nIntroduction\n\nWhile the \ufb01eld of statistical learning with sparsity [Hastie et al., 2015] has been steadily rising to\nprominence ever since the introduction of the lasso (least absolute shrinkage and selection operator)\nat the end of the last century [Tibshirani, 1996], it was not until the recent decade that various\nscreening rules debuted to further equip the ever-evolving optimization arsenals for some of the\nmost fundamental problems in sparse learning such as (cid:96)1-regularized generalized linear models\n(GLMs, Friedman et al. 2010) and inverse covariance matrix estimation [Friedman et al., 2008].\nScreening rules, usually in the form of an analytic formula or an optimization procedure that is\nextremely fast to solve, can accelerate learning drastically by leveraging the inherent sparsity of many\nhigh-dimensional problems. Generally speaking, screening rules can identify a signi\ufb01cant portion of\nthe zero components of an optimal solution beforehand at the cost of minimal computational overhead,\nand hence substantially reduce the dimension of the parameterization, which makes possible ef\ufb01cient\ncomputation for large-scale sparse learning problems.\nPioneered by Ghaoui et al. 2010, various screening rules have emerged to speed up learning for\ngenerative models (e.g. Gaussian graphical models) as well as for discriminative models (e.g. GLMs),\nand for continuous variables (e.g. lasso) as well as for discrete variables (e.g. logistic regression,\nsupport vector machines). Table 1 summarizes some of the iconic work in the literature, where, to the\nbest of our knowledge, screening rules for generative models with discrete variables are still notably\nabsent.\nContrasted with this notable absence is the ever stronger craving in the big data era for scaling\nup the learning of generative models with discrete variables, especially in a blockwise structure\nidenti\ufb01cation setting. For example, in gene mutation analysis [Wan et al., 2015, 2016], among tens of\nthousands of sparse binary variables representing mutations of genes, we are interested in identifying\na handful of mutated genes that are connected into various blocks and exert synergistic effects on\nthe cancer. While a sparse Ising model is a desirable choice, for such an application the scalability\nof the model could fail due to the innate NP-hardness [Karger and Srebro, 2001] of inference, and\nhence maximum likelihood learning, owing to the partition function. To date, even with modern\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Screening rules in the literature at a glance\n\nDiscriminative Models\n\nGenerative Models\n\nContinuous\nVariables\n\nGhaoui et al. 2010, Tibshirani et al. 2012\n\nLiu et al. 2013, Wang et al. 2013,\n\nFercoq et al. 2015, Xiang et al. 2016,\n\nLee et al. 2017\n\nDiscrete\nVariables\n\nGhaoui et al. 2010, Tibshirani et al. 2012\n\nWang et al. 2014, Ndiaye et al. 2015\n\nBanerjee et al. 2008 Honorio and Samaras 2010\nWitten et al. 2011,Mazumder and Hastie 2012\n\nDanaher et al. 2014, Luo et al. 2014\n\nYang et al. 2015\n\n?\n\napproximation techniques, a typical application with sparse discrete graphical models usually involves\nonly hundreds of variables [Viallon et al., 2014, Barber et al., 2015, Vuffray et al., 2016].\nBetween the need for the scalability of high-dimensional Ising models and the absence of screening\nrules that are deemed crucial to accelerated and scalable learning, we have a technical gap to bridge:\ncan we identify screening rules that can speed up the learning of (cid:96)1-regularized Ising models? The\nmajor contribution of this paper is to give an af\ufb01rmative answer to this question. Speci\ufb01cally, we\nshow the following.\n\u2022 The screening rule is a simple closed-form formula that is a necessary and suf\ufb01cient condition for\nexact blockwise structure recovery of the solution with a given regularization parameter. Upon the\nidenti\ufb01cation of blockwise structures, different blocks of variables can be considered as different\nIsing models and can be solved separately. The various blocks can even be solved in parallel to\nattain further ef\ufb01ciency. Empirical results on both simulated and real-world datasets demonstrate\nthe tremendous ef\ufb01ciency, scalability, and insights gained from the introduction of the screening\nrule. Ef\ufb01cient learning of (cid:96)1-regularized Ising models from thousands of variables on a single\nmachine is hence readily attainable.\n\u2022 As an initial attempt to \ufb01ll in the vacancy illustrated in Table 1, our work is instructive to further\nexploration of screening rules for other graphical models with discrete random variables, and\nto combining screening rules with various optimization methods to facilitate better learning.\nFurthermore, compared with its Gaussian counterpart, where screening rules are available (Table 1)\nand learning is scalable [Hsieh et al., 2013], the proposed screening rule is especially valuable and\ndesperately needed to address the more challenging learning problem of sparse Ising models.\n\nWe defer all the proofs in the paper to the supplement and focus on providing intuition and interpreta-\ntion of the technical results in the paper.\n\n2 Notation and Background\n\nIsing Models\n\n2.1\n(cid:62)\nLet X = [X1, X2,\u00b7\u00b7\u00b7 , Xp]\nbe a p \u00d7 1 binary random vector, with Xi \u2208 {\u22121, 1}, and i \u2208\n{1, 2,\u00b7\u00b7\u00b7 , p} (cid:44) V . Let there be a dataset X with n independent and identically distributed samples\n\nof X, denoted as X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9). Here, x(k) is a p\u00d71 vector of assignments that realizes\n\nX, where k \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}. We further use x(k)\nto denote the ith component of the kth sample in\nthe dataset. Let \u03b8 \u2208 \u0398 be a p \u00d7 p symmetric matrix whose diagonal entries are zeros. An Ising model\n[Wan et al., 2016] with the parameterization \u03b8 is:\n\ni\n\n\uf8eb\uf8edp\u22121(cid:88)\n\np(cid:88)\n\n\uf8f6\uf8f8 ,\n\nP\u03b8(x) =\n\n1\n\nZ(\u03b8)\n\nexp\n\n\u03b8ijxixj\n\ni=1\n\nj>i\n\n(1)\n\n(cid:80)\n\n(cid:16)(cid:80)p\u22121\n\nwhere \u03b8ij represents the component of \u03b8 at the ith row and the jth column, and xi and xj represent\nthe ith and the jth components of x, respectively. Z(\u03b8) is a normalization constant, partition\nfunction, that ensures the probabilities sum up to one. The partition function is given as Z(\u03b8) =\n. Note that for ease of presentation, we consider Ising\nmodels with only pairwise interaction/potential here. Generalization to Ising models with unary\npotentials is given in Section 6.\n\nx\u2208{\u22121,1}p exp\n\nj>i \u03b8ijxixj\n\n(cid:80)p\n\n(cid:17)\n\ni=1\n\n2\n\n\f2.2 Graphical Interpretation\n\nWith the notion of the probability given by an Ising model in (1), estimating an (cid:96)1-regularized Ising\nmodel is de\ufb01ned as \ufb01nding \u02c6\u03b8, the penalized maximum likelihood estimator (MLE) under the lasso\npenalty:\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\n1\nn\n\n(cid:107)\u03b8(cid:107)1\n\nn(cid:88)\nn(cid:88)\n\nk=1\n\nlog P\u03b8\n\np\u22121(cid:88)\n\nx(k)(cid:17) \u2212 \u03bb\n(cid:16)\np(cid:88)\n\n2\n\n(2)\n\n\u03b8\n\ni=1\n\nj>i\n\nk=1\n\n\u2212 1\nn\n\n(cid:107)\u03b8(cid:107)1.\n\n\u03b8ijx(k)\n\ni x(k)\n\n= arg min\n\nj + A(\u03b8) +\n\n\u03bb\nHere, A(\u03b8) = log Z(\u03b8) is the log-partition function; (cid:107)\u03b8(cid:107)1 =(cid:80)p\n(cid:80)p\n2\nj=1|\u03b8ij| is the lasso penalty\n(cid:80)p\nthat encourages a sparse parameterization. \u03bb \u2265 0 is a given regularization parameter. Using \u03bb\n2 is\nj>i|\u03b8ij|, which echoes the summations\nsuggestive of the symmetry of \u03b8 so that \u03bb\nin the negative log-likelihood function. Note that \u03b8 corresponds to the adjacency matrix constructed\nby the p components of X as nodes, and \u03b8ij (cid:54)= 0 indicates that there is an edge between Xi and\nXj. We further denote a partition of V into L blocks as {C1, C2,\u00b7\u00b7\u00b7 , CL}, where Cl, Cl(cid:48) \u2286 V ,\nl=1 Cl = V , l (cid:54)= l(cid:48), and for all l, l(cid:48) \u2208 {1, 2,\u00b7\u00b7\u00b7 , L}. Without loss of generality, we\nassume that the nodes in different blocks are ordered such that if i \u2208 Cl, j \u2208 Cl(cid:48), and l < l(cid:48), then\ni < j.\n\n2(cid:107)\u03b8(cid:107)1 = \u03bb(cid:80)p\u22121\n\nCl \u2229 Cl(cid:48) = \u2205,(cid:83)L\n\ni=1\n\ni=1\n\n2.3 Blockwise Solutions\n\nWe introduce the de\ufb01nition of a blockwise parameterization:\nDe\ufb01nition 1. We call \u03b8 blockwise with respect to the partition {C1, C2,\u00b7\u00b7\u00b7 , CL} if \u2200l and l(cid:48) \u2208\n{1, 2,\u00b7\u00b7\u00b7 , L}, where l (cid:54)= l(cid:48), and \u2200i \u2208 Cl, \u2200j \u2208 Cl(cid:48), we have \u03b8ij = 0.\nWhen \u03b8 is blockwise, we can represent \u03b8 in a block diagonal fashion:\n\n\u03b8 = diag (\u03b81, \u03b82,\u00b7\u00b7\u00b7 , \u03b8L) ,\n\n(3)\nwhere \u03b81, \u03b82, \u00b7\u00b7\u00b7 , and \u03b8L are symmetric matrices that correspond to C1, C2, \u00b7\u00b7\u00b7 , and CL, respectively.\nNote that if we can identify the blockwise structure of \u02c6\u03b8 in advance, we can solve each block\nindependently (See A.1). Since the size of each block could be much smaller than the size of the\noriginal problem, each block could be much easier to learn compared with the original problem.\nTherefore, ef\ufb01cient identi\ufb01cation of blockwise structure could lead to substantial speedup in learning.\n\n3 The Screening Rule\n\n3.1 Main Results\n\n(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given. De\ufb01ne EXXiXj = 1\n\nThe preparation in Section 2 leads to the discovery of the following strikingly simple screening rule\npresented in Theorem 1.\nTheorem 1. Let a partition of V, {C1, C2,\u00b7\u00b7\u00b7 , CL}, be given.\ni x(k)\ncondition for \u02c6\u03b8 to be blockwise with respect to the given partition is that\n\nthe dataset X =\n. A necessary and suf\ufb01cient\n\n(cid:80)n\n\nk=1 x(k)\n\nLet\n\nn\n\nj\n\n|EXXiXj| \u2264 \u03bb,\n\n(4)\n\nfor all l and l(cid:48) \u2208 {1, 2,\u00b7\u00b7\u00b7 , L}, where l (cid:54)= l(cid:48), and for all i \u2208 Cl, j \u2208 Cl(cid:48).\nIn terms of exact blockwise structure identi\ufb01cation, Theorem 1 provides a foolproof (necessary and\nsuf\ufb01cient) and yet easily checkable result by comparing the absolute second empirical moments\n|EXXiXj|\u2019s with the regularization parameter \u03bb. We also notice the remarkable similarity between\nthe proposed screening rule and the screening rule for Gaussian graphical model blockwise structure\nidenti\ufb01cation in Witten et al. 2011, Mazumder and Hastie 2012. In the Gaussian case, the screening\nrule can be attained by simply replacing the second empirical moment matrix in (4) with the sample\n\n3\n\n\fAlgorithm 1 Blockwise Minimization\n1: Input: dataset X, regularization parameter \u03bb.\n2: Output: \u02c6\u03b8.\n3: \u2200i, j \u2208 V such that j > i, compute the second empirical moments EXXiXj\u2019s .\n4: Identify the partition {C1, C2,\u00b7\u00b7\u00b7 , CL} using the second empirical moments from the previous\n5: \u2200l \u2208 L, perform blockwise optimization over Cl for \u02c6\u03b8l.\n6: Ensemble \u02c6\u03b8l\u2019s according to (3) for \u02c6\u03b8.\n7: Return \u02c6\u03b8.\n\nstep and according to Witten et al. [2011], Mazumder and Hastie [2012].\n\ncovariance matrix. While the exact solution in the Gaussian case can be computed in polynomial\ntime, estimating an Ising model via maximum likelihood in general is NP-hard . However, as a\nconsequence of applying the screening rule, the blockwise structure of an (cid:96)1-regularized Ising model\ncan be determined as easily as the blockwise structure of a Gaussian graphical model, despite the\nfact that within each block, exact learning of a sparse Ising model could still be challenging.\nFurthermore, the screening rule also provides us a principal approach to leverage sparsity for the gain\nof ef\ufb01ciency: by increasing \u03bb, the nodes of the Ising model will be shattered into smaller and smaller\nblocks, according to the screening rule. Solving many Ising models with small blocks of variables is\namenable to both estimation algorithm and parallelism.\n\n3.2 Regularization Parameters\n\nThe screening rule also leads to a signi\ufb01cant implication to the range of regularization parameters in\nwhich \u02c6\u03b8 (cid:54)= 0. Speci\ufb01cally, we have the following theorem.\n\nTheorem 2. Let the dataset X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given, and let \u03bb = \u03bbmax represent the\n\nsmallest regularization parameter such that \u02c6\u03b8 = 0 in (2). Then \u03bbmax = maxi,j\u2208V,i(cid:54)=j|EXXiXj| \u2264 1.\nWith \u03bbmax, one can decide the range of regularization parameters, [0, \u03bbmax], that generates graphs\nwith nonempty edge sets, which is an important \ufb01rst step for pathwise optimization algorithms\n(a.k.a. homotopy algorithms) that learn the solutions to the problem under a range of \u03bb\u2019s. Furthermore,\nthe fact that \u03bbmax \u2264 1 for any given dataset X suggests that comparison across different networks\ngenerated by different datasets is comprehensible. Finally, in Section 4, \u03bbmax will also help to\nestablish the connection between the screening rule for exact learning and some of the popular inexact\n(alternative) learning algorithms in the literature.\n\n3.3 Fully Disconnected Nodes\n\nAnother consequence of the screening rule is the necessary and suf\ufb01cient condition that determines\nthe regularization parameter with which a node is fully disconnected from the remaining nodes:\n\nCorollary 1. Let the dataset X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given. Xi is fully disconnected from\n\nthe remaining nodes in \u02c6\u03b8, where i \u2208 V (i.e., \u02c6\u03b8ij = \u02c6\u03b8ji = 0, \u2200j \u2208 V \\ {i}), if and only if\n\u03bb \u2265 maxj\u2208V \\{i}|EXXiXj|.\nIn high-dimensional exploratory data analysis, it is usually the case that most of the variables are\nfully disconnected [Danaher et al., 2014, Wan et al., 2016]. In this scenario, Corollary 1 provides a\nregularization parameter threshold with which we can identify exactly the subset of fully disconnected\nnodes. Since we can choose a threshold large enough to make any nodes fully disconnected, we can\ndiscard a signi\ufb01cant portion of the variables ef\ufb01ciently and \ufb02exibly at will with exact optimization\nguarantees due to Corollary 1. By discarding the large portion of fully disconnected variables, the\nlearning algorithm can focus on only a moderate number of connected variables, which potentially\nresults in a substantial ef\ufb01ciency gain.\n\n3.4 Blockwise Minimization\n\nWe conclude this section by providing the blockwise minimization algorithm in Algorithm 1 due\nto the screening rule. Note that both the second empirical moments and the partition of V in the\n\n4\n\n\falgorithm can be computed in O(p2) operations [Witten et al., 2011, Mazumder and Hastie, 2012].\nOn the contrary, the complexity of the exact optimization of a block of variables grows exponentially\nwith respect to the maximal clique size of that block. Therefore, by encouraging enough sparsity,\nthe blockwise minimization due to the screening rule can provide remarkable speedup by not only\nshrinking the size of the blocks in general but also potentially reducing the size of cliques within each\nblock via eliminating enough edges.\n\n4 Applications to Inexact (Alternative) Methods\n\nij\n\n(cid:54)= \u02c6\u03b8NW\n\nij = \u02c6\u03b8PL\nji ).\n\nWe now discuss the interplay between the screening rule and two popular inexact (alternative)\nestimation methods: node-wise (NW) logistic regression [Wainwright et al., 2006, Ravikumar et al.,\n2010] and the pseudolikelihood (PL) method [H\u00f6\ufb02ing and Tibshirani, 2009]. In what follows, we\nuse \u02c6\u03b8NW and \u02c6\u03b8PL to denote the solutions given by the node-wise logistic regression method and the\npseudolikelihood method, respectively. NW can be considered as an asymmetric pseudolikelihood\nmethod (i.e., \u2203i,j \u2208 V such that i (cid:54)= j and \u02c6\u03b8NW\nji ), while PL is a pseudolikelihood method that\nis similar to NW but imposes additional symmetric constraints on the parameterization (i.e., \u2200i,j \u2208 V\nwhere i (cid:54)= j, we have \u02c6\u03b8PL\nOur incorporation of the screening rule to the inexact methods is straightforward: after using the\nscreening rule to identify different blocks in the solution, we use inexact methods to solve each block\nfor the solution. As shown in Section 3, when combined with exact optimization, the screening\nrule is foolproof for blockwise structure identi\ufb01cation. However, in general, when combined with\ninexact methods, the proposed screening rule is not foolproof any more because the screening rule is\nderived from the exact problem in (2) instead of the approximate problems such as NW and PL. We\nprovide a toy example in A.6 to illustrate mistakes made by the screening rule when combined with\ninexact methods. Nonetheless, as we will show in this section, NW and PL are deeply connected to\nthe screening rule, and when given a large enough regularization parameter, the application of the\nscreening rule to NW and PL can be lossless in practice (see Section 5). Therefore, when applied to\nNW and PL, the proposed screening rule can be considered as a strong rule (i.e., a rule that is not\nfoolproof but barely makes mistakes) and an optimal solution can be safeguarded by adjusting the\nscreened solution to optimality based on the KKT conditions of the inexact problem [Tibshirani et al.,\n2012].\n\n4.1 Node-wise (NW) Logistic Regression and the Pseudolikelihood (PL) Method\nIn NW, for each i \u2208 V , we consider the conditional probability of Xi upon X\\i, where X\\i =\n{Xt | t \u2208 V \\ {i}}. This is equivalent to solving p (cid:96)1-regularized logistic regression problems\nseparately, i.e., \u2200i \u2208 V :\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)(cid:105)\n\n+ \u03bb(cid:13)(cid:13)\u03b8\\i\n\n(cid:13)(cid:13)1 ,\n\ni \u03b7(k)\\i + log\n\n1 + exp\n\n\u03b7(k)\\i\n\n(5)\n\n(cid:104)\u2212y(k)\n\nn(cid:88)\n\nk=1\n\n\u02c6\u03b8NW\\i = arg min\n\u03b8\\i\n\n1\nn\n\nwhere \u03b7(k)\\i = \u03b8(cid:62)\nunsuccessful event x(k)\n\n\\i(2x(k)\\i ), y(k)\n\ni = 1 represents a successful event x(k)\n\ni = 1, y(k)\n\ni = 0 represents an\n\ni = \u22121, and\n\u03b8i2\n\n\u03b8\\i =(cid:2)\u03b8i1\n(cid:104)\n\nx(k)\\i =\n\nx(k)\ni1\n\n\u00b7\u00b7\u00b7\n\n\u03b8i(i\u22121)\n\n\u03b8i(i+1)\n\n\u00b7\u00b7\u00b7\n\n\u03b8ip\n\nx(k)\ni2\n\n\u00b7\u00b7\u00b7\n\nx(k)\ni(i\u22121) x(k)\n\ni(i+1)\n\n\u00b7\u00b7\u00b7\n\nx(k)\nip\n\n(cid:3)(cid:62)\n\n,\n\n(cid:105)(cid:62)\n\n.\n\nNote that \u02c6\u03b8NW constructed from \u02c6\u03b8NW\\i \u2019s is asymmetric, and ad hoc post processing techniques are used\nto generate a symmetric estimation such as setting each pair of elements from \u02c6\u03b8NW in symmetric\npositions to the one with a larger (or smaller) absolute value.\nOn the other hand, PL can be considered as solving all p (cid:96)1-regularized logistic regression problems\nin (5) jointly with symmetric constraints over the parameterization [Geng et al., 2017]:\n(cid:107)\u03b8(cid:107)1 ,\n\n(cid:104)\u2212y(k)\n\n(cid:17)(cid:17)(cid:105)\n\nn(cid:88)\n\np(cid:88)\n\ni + log\n\ni \u03be(k)\n\n1 + exp\n\n\u03be(k)\ni\n\n(cid:16)\n\n(cid:16)\n\n(6)\n\n+\n\n\u02c6\u03b8PL = arg min\n\u03b8\u2208\u0398\n\n1\nn\n\n\u03bb\n2\n\nk=1\n\ni=1\n\n5\n\n\fi =(cid:80)\n\nwhere \u03be(k)\n.That is to say, if i < j, then \u03b8min{i,j},max{i,j} = \u03b8ij;\nif i > j, then \u03b8min{i,j},max{i,j} = \u03b8ji. Recall that \u0398 in (6) de\ufb01ned in Section 2.1 represents a space\nof symmetric matrices whose diagonal entries are zeros.\n\nj\u2208V \\{i} 2\u03b8min{i,j},max{i,j}x(k)\n\nj\n\n4.2 Regularization Parameters in NW and PL\n\nSince the blockwise structure of a solution is given by the screening rule under a \ufb01xed regularization\nparameter, the ranges of regularization parameters under which NW and PL can return nonzero\nsolutions need to be linked to the range [0, \u03bbmax] in the exact problem. Theorem 3 and Theorem 4\nestablish such relationships for NW and PL, respectively.\n\nTheorem 3. Let the dataset X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given, and let \u03bb = \u03bbNW\nTheorem 4. Let the dataset X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given, and let \u03bb = \u03bbPL\n\nsmallest regularization parameter such that \u02c6\u03b8NW\\i = 0 in (5), \u2200i \u2208 V . Then \u03bbNW\n\nmax represent the\n\nsmallest regularization parameter such that \u02c6\u03b8PL = 0 in (6), then \u03bbPL\n\nmax = 2\u03bbmax.\n\nmax = \u03bbmax.\n\nmax represent the\n\nLet \u03bb be the regularization parameter used in the exact problem. A strategy is to set the corresponding\n\u03bbNW = \u03bb when using NW and \u03bbPL = 2\u03bb when using PL, based on the range of regularization param-\neters given in Theorem 3 and Theorem 4 for NW and PL. Since the magnitude of the regularization\nparameter is suggestive of the magnitude of the gradient of the unregulated objective, the proposed\nstrategy leverages that the magnitudes of the gradients of the unregulated objectives for NW and\nPL are roughly the same as, and roughly twice as large as, that of the unregulated exact objective,\nrespectively.\nThis observation has been made in the literature of binary pairwise Markov networks [H\u00f6\ufb02ing and\nTibshirani, 2009, Viallon et al., 2014]. Here, by Theorem 3 and Theorem 4, we demonstrate that\nthis relationship is exactly true if the optimal parameterization is zero. H\u00f6\ufb02ing and Tibshirani 2009\neven further exploits this observation in PL for exact optimization. Their procedure can be viewed as\niteratively solving adjusted PL problems regularized by \u03bbPL = 2\u03bb in order to obtain an exact solution\nregularized by \u03bb. The close quantitative correspondence between the derivatives of the inexact\nobjectives and that of the exact objective also provides insights into why combing the screening rule\nwith inexact methods does not lose much in practice.\n\n4.3 Preservation for Fully Disconnectedness\n\nTheorem 5. Let the dataset X =(cid:8)x(1), x(2),\u00b7\u00b7\u00b7 , x(n)(cid:9) be given. Let \u02c6\u03b8NW\n\nWhile the screening rule is not foolproof when combined with NW and PL, it turns out that in terms\nof identifying fully disconnected nodes, the necessary and suf\ufb01cient condition in Corollary 1 can be\npreserved when applying NW with caution, as shown in the following.\nmin \u2208 \u0398 denote a symmetric\nmatrix derived from \u02c6\u03b8NW by setting each pair of elements from \u02c6\u03b8NW in symmetric positions to the\none with a smaller absolute value. A suf\ufb01cient condition for Xi to be fully disconnected from the\nmin, where i \u2208 V , is that \u03bbNW \u2265 maxj\u2208V \\{i}|EXXiXj|. Furthermore, when\nremaining nodes in \u02c6\u03b8NW\n\u02c6\u03b8NW\\i = 0, the suf\ufb01cient condition is also necessary.\nIn practice, the utility of Theorem 5 is to provide us a lower bound for \u03bb above which we can fully\ndisconnect Xi (suf\ufb01ciency). Moreover, if \u02c6\u03b8NW\\i = 0 also happens to be true, which is easily veri\ufb01able,\nwe can conclude that such a lower bound is tight (necessity).\n\n5 Experiments\n\nExperiments are conducted on both synthetic data and real world data. We will focus on ef\ufb01ciency in\nSection 5.1 and discuss support recovery performance in Section 5.2. We consider three synthetic\nnetworks (Table 2) with 20, 35, and 50 blocks of 20-node, 35-node, and 50-node subnetworks,\nrespectively. To demonstrate the estimation of networks with unbalanced-size subnetworks, we also\nconsider a 46-block network with power law degree distributed subnetworks of sizes ranging from 5\nto 50. Within each network, the subnetwork is generated according to a power law degree distribution,\nwhich mimics the structure of a biological network and is believed to be more challenging to recover\n\n6\n\n\f(a) Network 1\n\n(b) Network 2\n\n(c) Network 3\n\n(d) Network 4\n\nFigure 1: Runtime of pathwise optimization on networks in Table 2. Runtime plotted is the median\nruntime over \ufb01ve trials. The experiments of the baseline method PL without screening can not be\nfully conducted on larger networks due to high memory cost. NW: Node-wise logistic regression\nwithout screening; NW+screen: Node-wise logistic regression with screening; PL: Pseudolikelihood\nwithout screening; PL+screen: Pseudolikelihood with screening.\n\ncompared with other less complicated structures [Chen and Sharp, 2004, Peng et al., 2009, Danaher\net al., 2014]. Each edge of each network is associated with a weight \ufb01rst sampled from a standard\nnormal distribution, and then increased or decreased by 0.2 to further deviate from zero. For each\nnetwork, 1600 samples are generated via Gibbs sampling within each subnetwork. Experiments on\nexact optimization are reported in B.2.\n\n5.1 Pathwise Optimization\n\nof the(cid:0)p\n\n2\n\nPathwise optimization aims to compute solutions over a range of different \u03bb\u2019s. Formally, we denote\nthe set of \u03bb\u2019s used in (2) as \u039b = {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bb\u03c4}, and without loss of generality, we assume that\n\u03bb1 < \u03bb2 < \u00b7\u00b7\u00b7 < \u03bb\u03c4 .\nThe introduction of the screening rule provides us insightful heuristics for the determination of \u039b.\nWe start by choosing a \u03bb1 that re\ufb02ects the sparse blockwise structural assumption on the data. To\nachieve sparsity and avoid densely connected structures, we assume that the number of edges in the\nground truth network is O(p). This assumption coincides with networks generated according to a\npower law degree distribution and hence is a faithful representation of the prior knowledge stemming\nfrom many biological problems. As a heuristic, we relax and apply the screening rule in (4) on each\n\n(cid:1) second empirical moments and choose \u03bb1 such that the number of the absolute second\n\nempirical moments that are greater than \u03bb1 is about p log p. Given a \u03bb1 chosen this way, one can\ncheck how many blocks \u02c6\u03b8(\u03bb1) has by the screening rule. To encourage blockwise structures, we\nmagnify \u03bb1 via \u03bb1 \u2190 1.05\u03bb1 until the current \u02c6\u03b8(\u03bb1) has more than one block. We then choose \u03bb\u03c4\nsuch that the number of absolute second empirical moments that are greater than \u03bb\u03c4 is about p. In\nour experiments, we use an evenly spaced \u039b with \u03c4 = 25.\nTo estimate the networks in Table 2, we implement both NW and PL with and without screening\nusing glmnet [Friedman et al., 2010] in R as a building block for logistic regression according to\nRavikumar et al. 2010 and Geng et al. 2017. To generate a symmetric parameterization for NW, we\nset each pair of elements from \u03b8NW in symmetric positions to the element with a larger absolute value.\nGiven \u039b, we screen only at \u03bb1 to identify various blocks. Each block is then solved separately in a\npathwise fashion under \u039b without further screening. The rationale of performing only one screening\nis that starting from a \u03bb1 chosen in the aforementioned way has provided us a sparse blockwise\nstructure that sets a signi\ufb01cant portion of the parameterization to zeros; further screening over larger\n\u03bb\u2019s hence does not necessarily offer more ef\ufb01ciency gain.\nFigure 1 summarizes the runtime of pathwise optimization on the four synthetic networks in Table 2.\nThe experiments are conducted on a PowerEdge R720 server with two Intel(R) Xeon(R) E5-2620\nCPUs and 128GB RAM. As many as 24 threads can be run in parallel. For robustness, each runtime\nreported is the median runtime over \ufb01ve trials. When the sample size is less than 1600, each trial\nuses a subset of samples (subsamples) that are randomly drawn from the original datasets without\nreplacement. As illustrated in Figure 1, the ef\ufb01ciency gain due to the screening rule is self-evident.\nBoth NW and PL bene\ufb01t substantially from the application of the screening rule. The speedup is\nmore apparent with the increase of sample size as well as the increase of the dimension of the data. In\nour experiments, we observe that even with arguably the state-of-the-art implementation [Geng et al.,\n\n7\n\n5010040080012001600Sample SizeRuntime (s)MethodsPLNWPL+screenNW+screen0250500750100040080012001600Sample SizeRuntime (s)MethodsPLNWPL+screenNW+screen2505007501000125040080012001600Sample SizeRuntime (s)MethodsNWPL+screenNW+screen05001000150040080012001600Sample SizeRuntime (s)MethodsPLNWPL+screenNW+screen\findx\n\n1\n2\n3\n4\n\n#blk\n20\n35\n50\n46\n\n#nd/blk\n\n20\n35\n50\n5-50\n\nTL#nd\n400\n1225\n2500\n1265\n\nTable 2: Summary of the four syn-\nthetic networks used in the experi-\nments. indx represents the index\nof each network. #blk represents\nthe number of blocks each net-\nwork has. #nd/blk represents the\nnumber of nodes each block has.\nTL#nd represents the total number\nof nodes each network has.\n\n(a) Edge recovery AUC\n\n(b) Model selection runtime\n\nFigure 2: Model selection performance. Mix: provide PL\n+screen with the regularization parameter chosen by the model\nselection of NW+screen. Other legend labels are the same as\nin Figure 1.\n\n2017], PL without screening still has a signi\ufb01cantly larger memory footprint compared with that of\nNW. Therefore, the experiments for PL without screening are not fully conducted in Figure 1b,1c,\nand 1d for networks with thousands of nodes. On the contrary, PL with the screening rule has a\ncomparable memory footprint with that of NW. Furthermore, as shown in Figure 1, after applying the\nscreening rule, PL also has a similar runtime with NW. This phenomenon demonstrates the utility of\nthe screening rule for effectively reducing the memory footprint of PL, making PL readily available\nfor large-scale problems.\n\n5.2 Model Selection\n\nOur next experiment performs model selection by choosing an appropriate \u03bb from the regularization\nparameter set \u039b. We leverage the Stability Approach to Regularization Selection (StARS, Liu et al.\n2010) for this task. In a nutshell, StARS learns a set of various models, denoted as M, over \u039b using\nmany subsamples that are drawn randomly from the original dataset without replacement. It then\npicks a \u03bb\u2217 \u2208 \u039b that strikes the best balance between network sparsity and edge selection stability\namong the models in M. After the determination of \u03bb\u2217, it is used on the entire original dataset to\nlearn a model with which we compare the ground truth model and calculate its support recovery Area\nUnder Curve (AUC). Implementation details of model selection are provided in B.1.\nIn Figure 2, we summarize the experimental results of model selection, where 24 subsamples are used\nfor pathwise optimization in parallel to construct M. In Figure 2a, NW with and without screening\nachieve the same high AUC values over all four networks, while the application of the screening\nrule to NW provides roughly a 2x speedup, according to Figure 2b. The same AUC value shared by\nthe two variants of NW is due to the same \u03bb\u2217 chosen by the model selection procedure. Even more\nimportantly, it is also because that under the same \u03bb\u2217, the screening rule is able to perfectly identify\nthe blockwise structure of the parameterization.\nDue to high memory cost, the model selection for PL without screening (green bars in Figure 2)\nis omitted in some networks. To control the memory footprint, the model selection for PL with\nscreening (golden bars in Figure 2) also needs to be carried out meticulously by avoiding small \u03bb\u2019s\nin \u039b that correspond to dense structures in M during estimation from subsamples. While avoiding\ndense structures makes PL with screening the fastest among all (Figure 2b), it comes at the cost\nof delivering the least accurate (though still reasonably effective) support recovery performance\n(Figure 2a). To improve the accuracy of this approach, we also leverage the connection between\nNW and PL by substituting 2\u03bb\u2217\nNW for the resultant regularization parameter from model selection\nof PL, where \u03bb\u2217\nNW is the regularization parameter selected for NW. This strategy results in better\nperformance in support recovery (purple bars in Figure 2a).\n\n5.3 Real World Data\n\nOur real world data experiment applies NW with and without screening to a real world gene mutation\ndataset collected from 178 lung squamous cell carcinoma samples [Weinstein et al., 2013]. Each\nsample contains 13,665 binary variables representing the mutation statuses of various genes. For ease\n\n8\n\n0.000.250.500.751.001234Network IndexAUCMethodsPLNWPL+screenNW+screenMix03006009001234Network IndexRuntime (s)MethodsPLNWPL+screenNW+screenMix\fALPK2\n\nUNC13C\n\nKIAA1109\n\nSTAB2\n\nFN1\n\nPLXNA4\n\nUSP34\n\nCDH9\n\nDYNC1H1\n\nASTN2\n\nFBN2\n\nADAMTS20\n\nMYH4\n\nBAI3\n\nVCAN\n\nSYNE2\n\nWDR17\n\nPTPRT\n\nCOL12A1\n\nPDE4DIP\n\nELTD1\n\nHRNR\n\nTMEM132D\n\nZNF804A\n\nNRXN1\n\nVPS13B\n\nRIMS2\n\nFAT1\n\nCOL6A6\n\nSCN1A\n\nROS1\n\nTPR\n\nMAGEC1\n\nZNF676\n\nANKRD30A\n\nUNC5D\n\nTHSD7B\n\nCNTNAP2\n\nMYH1\n\nC20orf26\n\nFigure 3: Connected components learned from lung squamous cell carcinoma mutation data. Genes in\nred are (lung) cancer and other disease related genes [Uhl\u00e9n et al., 2015]. Mutation data are extracted\nvia the TCGA2STAT package [Wan et al., 2015] in R and the \ufb01gure is rendered by Cytoscape.\n\nof interpretation, we keep genes whose mutation rates are at least 10% across all samples, yielding\na subset of 145 genes in total. We use the model selection procedure introduced in Section 5.2 to\ndetermine a \u03bb\u2217\nNW with which we learn the gene mutation network whose connected components are\nshown in Figure 3. For model selection, other than the con\ufb01guration in B.1, we choose \u03c4 = 25. 384\ntrials are run in parallel using all 24 threads. We also choose \u03bb1 such that about 2p log(p) absolute\nsecond empirical moments are greater than \u03bb1. We choose \u03bb\u03c4 such that about 0.25p absolute second\nempirical moments are greater than \u03bb\u03c4 .\nIn our experiment, NW with and without screening select the same \u03bb\u2217\nNW, and generate the same\nnetwork. Since the dataset in question has a lower dimension and a smaller sample size compared with\nthe synthetic data, NW without screening is adequately ef\ufb01cient. Nonetheless, with screening NW is\nstill roughly 20% faster. This phenomenon once again indicates that in practice the screening rule\ncan perfectly identify the blockwise sparsity pattern in the parameterization and deliver a signi\ufb01cant\nef\ufb01ciency gain. The genes in red in Figure 3 represent (lung) cancer and other disease related genes,\nwhich are scattered across the seven subnetworks discovered by the algorithm. In our experiment, we\nalso notice that all the weights on the edges are positive. This is consistent with the biological belief\nthat associated genes tend to mutate together to cause cancer.\n\n6 Generalization\n\n\uf8eb\uf8ed p(cid:88)\n\np\u22121(cid:88)\n\np(cid:88)\n\n\u2212 1\nn\n\n\u02c6\u03b8 = arg min\n\nwhere (cid:107)\u03b8(cid:107)1,off = (cid:80)p\n\nWith unary potentials, the (cid:96)1-regularized MLE for the Ising model is de\ufb01ned as:\n\nn(cid:88)\n(cid:80)p\nj(cid:54)=i|\u03b8ij|. Note that the unary potentials are not penalized, which is a\ncommon practice [Wainwright et al., 2006, H\u00f6\ufb02ing and Tibshirani, 2009, Ravikumar et al., 2010,\nViallon et al., 2014] to ensure a hierarchical parameterization. The screening rule here is to replace\n(4) in Theorem 3 with:\n\n(cid:107)\u03b8(cid:107)1,off,\n\n\u03b8ijx(k)\n\ni x(k)\n\n\u03b8iix(k)\n\ni +\n\n(7)\n\nk=1\n\ni=1\n\ni=1\n\nj>i\n\n\u03b8\n\ni=1\n\n\uf8f6\uf8f8 + A(\u03b8) +\n\nj\n\n\u03bb\n2\n\n|EXXiXj \u2212 EXXiEXXj| \u2264 \u03bb.\n\n(8)\n\nExhaustive justi\ufb01cation, interpretation, and experiments are provided in Supplement C.\n7 Conclusion\nWe have proposed a screening rule for (cid:96)1-regularized Ising model estimation. The simple closed-form\nscreening rule is a necessary and suf\ufb01cient condition for exact blockwise structural identi\ufb01cation.\nExperimental results suggest that the proposed screening rule can provide drastic speedups for\nlearning when combined with various optimization algorithms. Future directions include deriving\nscreening rules for more general undirected graphical models [Liu et al., 2012, 2014b,a, Liu, 2014,\nLiu et al., 2016], and deriving screening rules for other inexact optimization algorithms [Liu and\nPage, 2013]. Further theoretical justi\ufb01cations regarding the conditions upon which the screening rule\ncan be combined with inexact algorithms to recover block structures losslessly are also desirable.\nAcknowledgment: The authors would like to gratefully acknowledge the NIH BD2K Initiative grant\nU54 AI117924 and the NIGMS grant 2RO1 GM097618.\n\n9\n\n\fReferences\nO. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\nestimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9\n(Mar):485\u2013516, 2008.\n\nR. F. Barber, M. Drton, et al. High-dimensional ising model selection with bayesian information\n\ncriteria. Electronic Journal of Statistics, 9(1):567\u2013607, 2015.\n\nH. Chen and B. M. Sharp. Content-rich biological network constructed by mining pubmed abstracts.\n\nBMC Bioinformatics, 5(1):147, 2004.\n\nP. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation\nacross multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n76(2):373\u2013397, 2014.\n\nO. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In Proceedings\n\nof The 32nd International Conference on Machine Learning, pages 333\u2013342, 2015.\n\nJ. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical\n\nlasso. Biostatistics, 9(3):432\u2013441, 2008.\n\nJ. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via\n\ncoordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\nS. Geng, Z. Kuang, and D. Page. An ef\ufb01cient pseudo-likelihood method for sparse binary pairwise\n\nMarkov network estimation. arXiv Preprint, 2017.\n\nL. E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination for the lasso and sparse supervised\n\nlearning problems. arXiv Preprint, 2010.\n\nT. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and\n\ngeneralizations. CRC Press, 2015.\n\nH. H\u00f6\ufb02ing and R. Tibshirani. Estimation of sparse binary pairwise Markov networks using pseudo-\n\nlikelihoods. Journal of Machine Learning Research, 10(Apr):883\u2013906, 2009.\n\nJ. Honorio and D. Samaras. Multi-task learning of gaussian graphical models. In Proceedings of the\n\n27th International Conference on Machine Learning (ICML-10), pages 447\u2013454, 2010.\n\nC.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. Big & quic: Sparse inverse\ncovariance estimation for a million variables. In Advances in Neural Information Processing\nSystems, pages 3165\u20133173, 2013.\n\nD. Karger and N. Srebro. Learning Markov networks: Maximum bounded tree-width graphs. In\nProceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 392\u2013401.\nSociety for Industrial and Applied Mathematics, 2001.\n\nD. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press,\n\n2009.\n\nS. Lee, N. Gornitz, E. P. Xing, D. Heckerman, and C. Lippert. Ensembles of lasso screening rules.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.\n\nH. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection (stars) for high\ndimensional graphical models. In Advances in Neural Information Processing Systems, pages\n1432\u20131440, 2010.\n\nJ. Liu. Statistical Methods for Genome-wide Association Studies and Personalized Medicine. PhD\n\nthesis, The University of Wisconsin-Madison, 2014.\n\nJ. Liu and D. Page. Structure learning of undirected graphical models with contrastive divergence.\nICML 2013 Workshop on Structured Learning: Inferring Graphs from Structured and Unstructured\nInputs, 2013.\n\n10\n\n\fJ. Liu, P. Peissig, C. Zhang, E. Burnside, C. McCarty, and D. Page. Graphical-model based multiple\ntesting under dependence, with applications to genome-wide association studies. In Uncertainty in\nArti\ufb01cial Intelligence, volume 2012, page 511. NIH Public Access, 2012.\n\nJ. Liu, Z. Zhao, J. Wang, and J. Ye. Safe screening with variational inequalities and its application to\n\nlasso. arXiv Preprint arXiv:1307.7577, 2013.\n\nJ. Liu, C. Zhang, E. Burnside, and D. Page. Learning heterogeneous hidden Markov random \ufb01elds.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 576\u2013584, 2014a.\n\nJ. Liu, C. Zhang, E. Burnside, and D. Page. Multiple testing under dependence via semiparametric\ngraphical models. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 955\u2013963, 2014b.\n\nJ. Liu, C. Zhang, D. Page, et al. Multiple testing under dependence via graphical models. The Annals\n\nof Applied Statistics, 10(3):1699\u20131724, 2016.\n\nP.-L. Loh, M. J. Wainwright, et al. Structure estimation for discrete graphical models: Generalized\ncovariance matrices and their inverses. In Advances in Neural Information Processing Systems,\npages 2096\u20132104, 2012.\n\nP.-L. Loh, M. J. Wainwright, et al. Structure estimation for discrete graphical models: Generalized\n\ncovariance matrices and their inverses. The Annals of Statistics, 41(6):3022\u20133049, 2013.\n\nS. Luo, R. Song, and D. Witten. Sure screening for gaussian graphical models. arXiv Preprint\n\narXiv:1407.7819, 2014.\n\nR. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale\n\ngraphical lasso. Journal of Machine Learning Research, 13(Mar):781\u2013794, 2012.\n\nE. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse multi-task and\nmulti-class models. In Advances in Neural Information Processing Systems, pages 811\u2013819, 2015.\n\nJ. Pena and R. Tibshirani. Lecture notes in machine learning 10-725/statistics 36-725-convex\n\noptimization (fall 2016), 2016.\n\nJ. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression\n\nmodels. Journal of the American Statistical Association, 104(486):735\u2013746, 2009.\n\nP. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using\n\nl1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\nR. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\nR. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules\nfor discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 74(2):245\u2013266, 2012.\n\nM. Uhl\u00e9n, L. Fagerberg, B. M. Hallstr\u00f6m, C. Lindskog, P. Oksvold, A. Mardinoglu, \u00c5. Sivertsson,\nC. Kampf, E. Sj\u00f6stedt, A. Asplund, et al. Tissue-based map of the human proteome. Science, 347\n(6220):1260419, 2015.\n\nV. Viallon, O. Banerjee, E. Jougla, G. Rey, and J. Coste. Empirical comparison study of approximate\nmethods for structure selection in binary graphical models. Biometrical Journal, 56(2):307\u2013331,\n2014.\n\nM. Vuffray, S. Misra, A. Lokhov, and M. Chertkov. Interaction screening: Ef\ufb01cient and sample-\noptimal learning of ising models. In Advances in Neural Information Processing Systems, pages\n2595\u20132603, 2016.\n\nM. J. Wainwright, J. D. Lafferty, and P. K. Ravikumar. High-dimensional graphical model selection\nusing l1-regularized logistic regression. In Advances in Neural Information Processing Systems,\npages 1465\u20131472, 2006.\n\n11\n\n\fY.-W. Wan, G. I. Allen, and Z. Liu. Tcga2stat: simple tcga data access for integrated statistical\n\nanalysis in r. Bioinformatics, page btv677, 2015.\n\nY.-W. Wan, G. I. Allen, Y. Baker, E. Yang, P. Ravikumar, M. Anderson, and Z. Liu. Xmrf: an r\npackage to \ufb01t Markov networks to high-throughput genetics data. BMC Systems Biology, 10(3):69,\n2016.\n\nJ. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. In\n\nAdvances in Neural Information Processing Systems, pages 1070\u20131078, 2013.\n\nJ. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A safe screening rule for sparse logistic regression. In\n\nAdvances in Neural Information Processing Systems, pages 1053\u20131061, 2014.\n\nJ. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmule-\nvich, C. Sander, J. M. Stuart, C. G. A. R. Network, et al. The cancer genome atlas pan-cancer\nanalysis project. Nature Genetics, 45(10):1113\u20131120, 2013.\n\nD. M. Witten, J. H. Friedman, and N. Simon. New insights and faster computations for the graphical\n\nlasso. Journal of Computational and Graphical Statistics, 20(4):892\u2013900, 2011.\n\nZ. J. Xiang, Y. Wang, and P. J. Ramadge. Screening tests for lasso problems. IEEE Transactions on\nPattern Analysis and Machine Intelligence, PP(99):1\u20131, 2016. ISSN 0162-8828. doi: 10.1109/\nTPAMI.2016.2568185.\n\nS. Yang, Z. Lu, X. Shen, P. Wonka, and J. Ye. Fused multiple graphical lasso. SIAM Journal on\n\nOptimization, 25(2):916\u2013943, 2015.\n\n12\n\n\f", "award": [], "sourceid": 477, "authors": [{"given_name": "Zhaobin", "family_name": "Kuang", "institution": "University of Wisconsin, Madison"}, {"given_name": "Sinong", "family_name": "Geng", "institution": "University of Wisconsin Madison"}, {"given_name": "David", "family_name": "Page", "institution": "UW-Madison"}]}