{"title": "Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 162, "page_last": 170, "abstract": null, "full_text": "Smooth-projected Neighborhood Pursuit for\nHigh-dimensional Nonparanormal Graph Estimation\n\nKathryn Roeder\nDepartment of Statistics\nCarnegie Mellon University\n\nTuo Zhao\nDepartment of Computer Science\nJohns Hopkins University\n\nHan Liu\nDepartment of Operations Research and Financial Engineering\nPrinceton University\n\nAbstract\nMany statistical methods gain robustness and exibility by sacricing convenient\ncomputational structure. In this paper, we illustrate this fundamental tradeoff by\nstudying a semiparametric graphical model estimation problem. We explain how\nnew computational techniques help to solve this type of problem. In particularly,\nwe propose a smooth-projected neighborhood pursuit method for efciently estimating high dimensional nonparanormal graphs with theoretical guarantees. Besides new computational and theoretical analysis, we also provide an alternative\nview to analyze the tradeoff between computational efciency and statistical error\nunder a smoothing optimization framework. We also report experimental results\non text and stock datasets.\n\n1\n\nIntroduction\n\nGraphical models provide a powerful modeling framework for exploring the relationships among\na large number of variables [21, 7, 3]. In particular, a d-dimensional random vector X =\n(X1 , ..., Xd )T can be represented as an undirected graph G = (V, E), where V contains nodes\ncorresponding to the d variables in X, and the edge set E describes the conditional independence\nrelationship among X1 , ..., Xd . We say the distribution of X is Markov to G if Xi is independent\nof Xj given X\\{i,j} for all (i, j) E, where X\\{i,j} = (Xk k = i, j). While often the graph G is\n/\nassumed given, here we want to estimate it from data.\nMany existing methods for high dimensional graph estimation assume the random vector X follows\na Gaussian distribution X N (, ). Under this assumption, the graph estimation problem can be\nsolved by estimating the sparsity pattern of the precision matrix = 1 [5]. There are two major\napproaches for learning high dimensional Gaussian graphical models: (i) graphical lasso [1, 22, 6]\nand (ii) neighborhood pursuit [14]. Graphical lasso maximizes the 1 -penalized Gaussian likelihood\nand simultaneously estimates the precision matrix and the graph G. In contrast, the neighborhood\npursuit method maximizes the 1 -penalized pseudo-likelihood and only estimates the graph structure\nG. Both methods are consistent in graph recovery for correctly specied models. However, these\ntwo methods have been observed to behave differently on real datasets in practical applications.\nTheoretically, [18] show that the neighborhood pursuit approach has a better sample complexity in\ngraph recovery than the graphical lasso. Scalable software such as glasso and huge have been\ndeveloped to implement these algorithms [24, 6].\nTo relax the restrictive normality assumption, [11] propose the semiparametric nonparanormal\nmodel. More specically, they assume that there exists a set of monotone transformations f =\n(fj )d , such that the transformed random vector f (X) = (f1 (X1 ), . . . , fd (Xd ))T follows a Gausj=1\nsian distribution, i.e., f (X) N (0, 1 ). [11] show that for the nonparanormal distribution, the\n1\n\n\fgraph G is also encoded by the sparsity pattern of . To estimate , [10] propose a rank-based\nestimator named nonparanormal SKEPTIC. Their main idea is to calculate a rank-correlation matrix\n(either based on Spearmans rho or Kendalls tau) and plug the estimated correlation matrix into the\ngraphical lasso to estimate and graph G. Such a procedure has been proven to achieve the same\nparametric rates of convergence as the graphical lasso [10]. Extending the nonparanormal SKEPTIC\nestimator to the neighborhood pursuit approach is still an open problem. The main challenge is that\nthe rank-based correlation matrix estimates does not guarantee positive semi-deniteness, which\nleads to a non-convex formulation with more challenging computational and theoretical analysis.\nIn this paper, we propose a novel smooth-projected neighborhood pursuit method to handle this challenge. More specically, we project the possibly indenite nonparanormal SKEPTIC estimator into\nthe cone of positive semi-denite matrices with respect to a smoothed elementwise -norm. Our\nproposed projection is closely related to the dual smoothing approach in [17]. We provide both computational and theoretical analysis of the obtained procedure. Computationally, unlike the ordinary\nelementwise -norm, the proposed smoothed elementwise -norm has nice computational properties so that we can develop an efcient fast proximal gradient solver with a provable convergence\n\nrate O(1/ ) ( is the desired accuracy of the objective value, [16]). Theoretically, we provide sufcient conditions that guarantee the consistency of the neighborhood pursuit based nonparanormal\ngraph estimation. Besides these new computational and theoretical analyses, we also provide an\nalternative view to analyze the tradeoff between computational efciency and statistical error under\nthe dual smoothing optimization framework. Existing literature [17, 4] considers the dual smoothing approach as a tradeoff between computational efciency and approximation error. Consequently\nthey have to control the smoothness to avoid a large approximation error, which results in a slower\n\nrate (O(1/ ) vs. O(1/ )). In contrast, we analyze this tradeoff by directly considering the statistical error. We show that the nal estimator obtained by our smoothing approach preserves the good\nstatistical properties and enjoys the computational efciency simultaneously.\nThe rest of this paper is organized as follows: The next section reviews the nonparanormal SKEPTIC\nin [10]; Section 3 introduces the smooth-projected neighborhood pursuit and derives the fast proximal gradient algorithm; Section 4 explore the statistical properties of the procedure; Section 5 and 6\npresent results on on both simulated and real datasets.\n\n2\n\nBackground\n\nWe start with some notation. Let A = [Ajk ] Rdd and B = [Bjk ] Rdd be two symmetric matrices, and v = (v1 , . . . , vd )T Rd . The notation min (A) and max (A) denote the\nsmallest and largest eigenvalues of A. A, B = tr(AT B) denote the inner product of A and B.\nd\nd\n2\n2\nWe dene vector norms: ||v||1 =\nj=1 |vj |, ||v||2 =\nj=1 vj , and ||v|| = max1jd |vi |.\nWe also dene matrix operator norms as ||A||1\n\n=\n\nmax1kd\n\nd\nj=1\n\n|Ajk |, ||A||\n\n=\n\nd\nk=1\n\nmax1jd\n|Ajk |, ||A||2 = max{|max (A)|, |min (A)|}/ and elementwise norms as\n|||A|||1 = 1j,kd |Ajk |,\n|||A||| = max1j,kd |Ajk |,\n||A||2 = 1j,kd |Ajk |2 . We\nF\ndenote v\\j = (v1 , . . . , vj1 , vj+1 , . . . , vd )T Rd1 as the sub vector of v with j th entry removed.\nSimilarly we denote A\\i,\\j as a sub-matrix of A with the ith row and j th column removing. We\ndenote Ai,\\j as the ith row of A with its j th entry removed. If I is a set of indices, then AII denotes\nthe sub-matrix of A by extracting entries with both column and row indices in I.\n2.1 The Nonparanormal SKEPTIC\nThe nonparanormal distribution extends the Gaussian distribution (parametric normal) by separately\nmodeling the marginal distribution and conditional independence structure.\nDenition 2.1 (Nonparanormal). Let f = {f1 , ..., fd } be a collection of non-decreasing univariate\nfunctions and Rdd be a correlation matrix with diag( ) = 1. We say a d-dimensional\nrandom variable X = (X1 , ..., Xd )T follows a nonparanormal distribution denoted by\nX N P Nd (f, ), if f (X) = (f1 (X1 ), ..., fd (Xd ))T N (0, ).\n\n(2.1)\n\nFor continuous distributions, [11] prove that the nonparanormal family is equivalent to the Gaussian\ncopula family [9, 19]. Similar to Gaussian graphical models, the nonparanormal graphical models\nalso have the conditional independence graph encoded by the sparsity pattern of = ( )1 .\n[10] propose a rank-based method namely nonparanormal SKEPTIC for estimating the correlation\nmatrix. They utilize Spearmans rho and Kendalls tau statistics to directly estimate the unknown\n2\n\n\fcorrelation matrix. This approach avoids explicitly calculating the marginal transformation functions\n{fj }d and achieves the optimal parametric rates of convergence. More specically, let xi =\nj=1\n(xi , ..., xi )T where x1 , ..., xn are n independent observations of X. Then Spearmans rho jk and\n1\nd\nKendalls tau jk are dened as (2.2) and (2.3),\nn\ni\ni\n\n\ni=1 (rj rj )(rk rk )\nSpearmans rho : jk =\n,\n(2.2)\nn\nn\ni\ni\n 2\n 2\ni=1 (rk rk )\ni=1 (rj rj ) \nKendalls tau :\n\njk =\n\n2\nn(n 1)\n\nsign xi xi\nj\nj\n\ni* 0 is the smoothing parameter. The rst term in (3.6) is well-known as the Fenchels\ndual representation, and the second term is the proximity function of U. We call |||A||| smoothed\n\nelementwise -norm. The next lemma characterizes the solution to (3.6).\nLemma 3.1. Equation (3.6) has a closed form solution, U with\nAjk\n , 0 ,\nUjk = sign (Ajk ) max\n\n\n(3.7)\n\nwhere is the minimum non-negative constant such that |||U|||1 1.\nThe proof of Lemma 3.1 is provided in the supplementary materials. The naive algorithm to calculate\n sorts the matrix and has the expected computational complexity O(d2 log d). See the supplementary materials for a more efcient algorithm with the expected computational complexity O(d2 ).\nThe smoothed elementwise -norm is convex, smooth and its gradient is easy to evaluate using\n(3.7). By taking A = S S, we have\n|||S S||| (S S)\n\n\n= U.\n(3.8)\n|||S S||| =\n\nS\n(S S)\nRecall U is a soft thresholding function, therefore it is Lipchitz continuous in S with the Lipchitz\nconstant 1 . Since the smoothed elementwise -norm has nice computational properties above\nwith a controllable loss in accuracy (see more details in the next section), we focus on the following\nalternative optimization problem,\nS = argmin |||S S||| s.t. S 0.\n(3.9)\n\nS\n\n3.2 Fast Proximal Gradient Algorithm\nMany existing fast proximal gradient solvers focus on unconstrained problems [4], and they are\nusually derived based on [17] or [2]. In (3.9), we face a minimization problem with a minimum\neigenvalue constraint. To handle the constraint, we derive the following fast proximal gradient\nalgorithm based on [16]. In this algorithm, we need two sequences of auxiliary variables M(t) and\nW(t) with M(0) = W(0) = S(0) , and a sequence of weights t = 2/(1 + t) where t = 0, 1, 2, ....\nThe proposed algorithm converts the smoothed elementwise norm minimization to the sequential\nFrobenius norm minimization, which involves the following projection problem\n+ (A) = argmin ||B A||2 s.t. B\nF\n\n0,\n\n(3.10)\n\nB\n\nwhere A Rdd is a symmetric matrix. + (A) is the projection of the symmetric matrix A to\nthe cone of all positive semi-denite matrices under the Frobenius norm. + (A) has a closed form\nsolution as shown in the following lemma.\nd\n\nT\nLemma 3.2. Suppose A has the eigenvalue decomposition as A = j=1 j vj vj , where j s are\nthe eigenvalues and vj s are corresponding eigenvectors. Let j = max{j , 0} for j = 1, ..., d,\nd\nT\nthen we have + (A) = j=1 j vj vj .\n\nThe proof of Lemma 3.2 is shown in the supplementary materials. Now we start the algorithm at the\nt-th iteration. We rst calculate the auxiliary variable M(t) as\n4\n\n\fM(t) = (1 t )S(t1) + t W(t1) .\nWe then evaluate the gradient using\nG(t) =\n\n(3.11)\n(t)\n\nSjk Mjk\n\n|||S M(t) |||\n(t)\n\n= sign Sjk Mjk max\nM(t)\n\n\n\n , 0\n\n. (3.12)\nj,k\n\nWe consider the following quadratic approximation\n1\n||W W(t) ||2 . (3.13)\nF\n2t\nAfter simple manipulations, the fast proximal gradient algorithm takes\n\nW(t) = argmin Q(W, W(t1) , ) = + W(t1) G(t) ,\n(3.14)\nt\nW 0\nQ(W, W(t1) , ) = |||S W(t1) ||| + G(t) , W W(t1) +\n\n\nwhere works as a step-size here. We further calculate S(t) for the t-th iteration as follow,\nS(t) = (1 t )S(t1) + t W(t) .\n\n(3.15)\n\nThe following theorem establishes the convergence rate of our fast proximal gradient algorithm.\nTheorem 3.3. Given the desired accuracy such that |||S S(t) ||| |||S S||| < , we need the\n\n\n2||S(0) S||2 /( ) 1 = O\nF\n\nnumber of iterations to be at most t =\n\n1/( ) .\n\nThe proof of Theorem 3.3 is provided in the supplementary materials. Theorem 3.3 guarantees\nthat our derived algorithm achieves the optimal rate of convergence for minimizing (3.9).Unlike\nthe existing analysis, the novelty in our work comes from directly analyzing the tradeoff between\nthe computational efciency and statistical error. Although (3.9) is not the same as the original\nprojection problem (3.4), our analysis shows that by choosing a suitable smooth parameter , S\nconcentrates to with a rate similar to Lemma 2.2 in high dimensions.\n\n4\n\nStatistical Properties\n\nThe following theorem establishes the concentration property of S under the elementwise\n\n\n\nnorm.\n\nTheorem 4.1. Given the nonparanormal SKEPTIC estimator S, for any large enough n, 4\nand > 0, we have the optimum to (3.9), S satisfying\nP |||S ||| 18 1 d2 exp(n2 ).\n\n(4.1)\n\nThe proof of Theorem 4.1 is provided in the supplementary materials. From the non-asymptotic perspective, Theorem 4.1 implies that we could choose a reasonable large to gain the computational\nefciency without losing statistical efciency. Now we will show that by combining with our proposed projection approach, the neighborhood pursuit asymptotically recovers the true neighborhood\nfor each node under the following irrepresentable condition.\nAssumption 1 (Irrepresentable Condition). Recall that Ij and Jj denote the true neighborhood and\nnon-neighborhood of node j respectively. There exist (0, 1), min > 0 and 1 < such\nthat for j = 1, .., d, the following conditions hold,\nC.1 || j Ij (j Ij )1 || ;\nJ\nI\nC.2 min (j Ij ) ,\nI\n\n(4.2)\n\n(j Ij )1\nI\n\n\n\n .\n\n(4.3)\n\nThe irrepresentable condition has been extensively studied in existing literature [23, 26, 20]. Here\nwe use this condition mainly for an illustrative purpose that the smoothed elementwise -norm\nprojection has theoretical guarantee. Our proposed approach can be also combined with other graph\nestimation method such as [25], in which the irrepresentable condition can be relaxed.\n\nTheorem 4.2 (Graph Recovery). Let = min |B | for all (j, k)s such that Gjk = 0. We\njk\n\nassume that satises Conditions 1 and 2. Let sj = |Ij | < n and we choose the \nsuch that min { /, 2}, then there exist positive universal constants c0 and c1 , such that\nP Jj = Jj , Ij = Ij 1 s2 exp\nj\n\nc1 n 2\n4s2\nj\n\nd exp(c1 n2 ), where satises that c0\n\n2\n\nn\n s2 exp c1s2\nj\n\nlog d\nn\n\nj\n\n2\n\nn\n (d sj )sj exp c1s2\n\n\n\nj\n\n(1)\n1\n\n\n min 1, 2 , (1) , 26(+1) , 142 , 14 .\n26\n\n5\n\n\fThe proof of Theorem 4.2 is provided in supplementary materials. Theorem 4.2 guarantees that\nfor each individual node, we can correctly recover its neighborhood with high probability. Consequently, an asymptotic result is provided.\nCorollary 4.3. Let s = max1jd sj , then under the same conditions as in Theorem 4.2, we have\nP(G = G) 1 if the following conditions hold: (C.3) , and are constants, which do not\nscale with the triplet (n, d, s); (C.4) The triplet (n, d, s) scales as s2 (log d + log s)/n 0 and\ns2 log d/( 2 n) 0; (C.5): scales with (n, d, s) as / 0 and s2 log d/(2 n) 0.\nThe proof of Corollary 4.3 is provided in the supplementary materials. Corollary 4.3 implies that\nwe can asymptotically recover the underlying graph structure.\n\n5\n\nNumerical Simulations\n\nFrom a practical perspective, [10] recommend using Kendalls tau due to its more robust performance against outliers compared with Spearmans rho. Therefore in the following experiments we\nuse Kendalls tau to demonstrate the efcacy of our smooth-projected neighborhood pursuit method.\nIn our numerical simulations, we use the following four different graphs with 200 nodes (d = 200)\nincluding Erd s-R nyi, cluster, chain and scale-free. We then generate data from the normal distrio e\nbution with graph structures above. We adopt the power function g(t) = sign(t)|t|4 to convert the\nGaussian data to the nonparanormal data. For more details about the graph and covariance matrix\ngeneration, please refer to the supplementary materials. We use an ROC curve to evaluate the performance of graph recovery. Due to d\nn, the full solution paths cannot be obtained, therefore we\nrestrict the range of false positive rates to be from 0 to 0.3 for computational convenience.\n\n0.442\n\n0.444\n\n=1\n = 0.5\n = 0.25\n = 0.1\n\n0.440\n\nStatistical Error\n\n0.025\n0.020\n\n0.436\n\n0.015\n0.010\n\nObjective Values\n\n0.030\n\n=1\n = 0.5\n = 0.25\n = 0.1\n\n0.438\n\n0.035\n\n5.1 Our proposed method vs. Heuristic Approaches\nWe rst demonstrate the efcacy of the proposed smoothed elementwise -norm projection. We\nsampled 100 observations from a 200-dimensional normal distribution N (0, I200 ). We study the\nempirical performance of our fast proximal gradient algorithm using different smoothing parameters\n( = 1, 0.5, 0.25, 0.1). The results presented in Figure 1 are averaged over 50 replications. Figure\n1(a) shows the original objective value |||S S(t) ||| versus the number of iterations. Compared\nwith the smaller s, we can see that = 1 makes the algorithm more efcient but less accurate\nw.r.t minimizing (3.4) due to a larger approximation error. However, Figure 1(b) shows that the\nestimation error ||| S(t) ||| using = 1 is similar to other smaller s. Thus computational\nefciency is attained with almost no loss in statistical efciency.\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n0\n\n# of Iterations\n\n10\n\n20\n\n30\n\n40\n\n50\n\n# of Iterations\n\n(b) ||| S(t) |||\n\n(a) |||S S(t) |||\n\nFigure 1: The empirical performance using different smoothing parameters. = 1 has a similar\nperformance to the smaller s in terms of the estimation error.\nWe further compare the graph recovery performance of our proposed method with the indenite\nnonparanormal SKEPTIC estimator and the other two heuristic approaches (truncation and perturbation). The average ROC curves over 100 replications are presented in Figure 2. We see that directly\nplugging the indenite nonparanormal SKEPTIC estimator into the neighborhood pursuit results in\nthe worst performance. The ROC curves drastically drop due to non-convexity. The truncation estimator achieves better performance than the indenite nonparanormal SKEPTIC estimator, but is\nworse than the perturbation and our proposed estimators. The perturbation estimator is a serious\ncompetitor, and achieves a performance similar to our proposed estimator for the scale-free graph.\nHowever, the perturbation approach is lacking in theoretical justication and its concentration property is still largely unknown in high dimensions. For the other 3 graphs, our proposed estimator\nslightly outperforms the perturbation estimator. In summary, our simulations show that our proposed projection approach provides a computational tractable solution and achieves the best graph\nrecovery performance.\n5.2 Our Proposed Method vs. Naive Neighborhood Pursuit\nThis subsection is similar to the numerical studies in [10], and we compare our proposed method\nwith the naive neighborhood pursuit, which directly plugs the Pearson correlation estimator into the\n6\n\n\f1.000\n0.10\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\n0.05\n\n0.10\n\nFalse Positive Rate\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\n0.05\n\n0.10\n\nFalse Positive Rate\n\n(a) Erd s-R nyi\no e\n\n0.15\n\n0.20\n\n0.25\n\n0.80\n0.75\n\nTrue Positive Rate\n\n0.996\n\nSKEPTIC\nTruncation\nPerturbation\nProjection\n\n0.70\n\n0.992\n\nSKEPTIC\nTruncation\nPerturbation\nProjection\n\n0.990\n\n0.88\n\nSKEPTIC\nTruncation\nPerturbation\nProjection\n\n0.994\n\nTrue Positive Rate\n\n0.92\n0.90\n\nTrue Positive Rate\n0.05\n\n0.94\n\n0.998\n\n0.85\n\n0.96\n\n0.96\n0.94\n0.92\n0.90\n0.88\n\nTrue Positive Rate\n\n0.86\n0.82\n\n0.84\n\nSKEPTIC\nTruncation\nPerturbation\nProjection\n\n0.30\n\n0.05\n\n0.10\n\nFalse Positive Rate\n\n(b) Cluster\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\nFalse Positive Rate\n\n(c) Chain\n\n(d) Scale-free\n\n0.05\n\n0.10\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\nFalse Positive Rate\n\n(a) Erd s-R nyi\no e\n\n0.05\n\n0.10\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\n1.0\n0.10\n\n0.15\n\n0.20\n\nFalse Positive Rate\n\n(b) Cluster\n\n0.6\n0.2\n\n0.05\n\n(c) Chain\n\n0.25\n\n0.30\n\nNNP\nSNP\n\n0.0\n\nNNP\nSNP\n0.00\n\nFalse Positive Rate\n\n0.4\n\nTrue Positive Rate\n\n0.8\n\n1.0\n0.6\n0.2\n0.00\n\n0.0\n\nNNP\nSNP\n\n0.0\n\n0.0\n\nNNP\nSNP\n0.00\n\n0.4\n\nTrue Positive Rate\n\n0.8\n\n1.0\n0.8\n0.6\n0.2\n\n0.4\n\nTrue Positive Rate\n\n0.6\n0.4\n0.2\n\nTrue Positive Rate\n\n0.8\n\n1.0\n\nFigure 2: Average ROC curves of the neighborhood pursuit when combing with different correlation\nestimators. SKEPTIC represents the indenite nonparanormal SKEPTIC estimator. Truncation\nand Perturbation represent two heuristic approaches. Projection represents our proposed projection approach.\nneighborhood pursuit approach. The main difference is that our experiment is conducted under the\nsetting (d\nn). The average ROC curves over 100 replications are presented in Figure 3. As can\nbe seen, the smooth-projected neighborhood pursuit uniformly outperforms the naive neighborhood\npursuit throughout all four different graphs.\n\n0.00\n\n0.05\n\n0.10\n\n0.15\n\n0.20\n\n0.25\n\n0.30\n\nFalse Positive Rate\n\n(d) Scale-free\n\nFigure 3: Average ROC curves of the neighborhood pursuit when combing with different correlation\nestimators. SNP represents our proposed estimator and NNP represents the Pearson estimator.\nThe SNP uniformly outperforms the NNP for all four graphs.\n\n6\n\nReal Data Analysis\n\nIn this paper, we further present three real data experiments to demonstrate the superiority of nonparanormal graphical models to Gaussian graphical models. Throughout this section, we use the\nstability graph estimator [15, 12], which has the following procedures: (1) Calculate the solution\npath using all samples, and choose the regularization parameter at the sparsity level ; (2) Randomly\nchoose 100% of all samples without replacement using the regularization parameter chosen in\n(1); (3) Repeat (2) 500 times and retain the edges that appear with frequencies no less than 95%.\nTo ease the interpretability and visualization, we set (, ) as (0.04, 0.1), (0.1, 0.5) for the topic\nmodeling and the stock market respectively. We ne tune the sparsity level so that both estimated\ngraphs have approximately the same number of edges.\n6.1 Topic Graph\nThe topic graph, which is rst used in [3] to illustrate the efcacy of the correlated topic modeling,\nis a hierarchical Bayesian model for abstracting K topics that occur in a collection of documents\n(corpus). By applying the variational EM-algorithm, we can estimate the topic proportion for each\ndocument, which can be viewed as a mixed-membership model. The topic proportion of each document is represented in a K-dimensional simplex. [3] assume that the topic proportion approximately\nfollows a normal distribution after the log-transformation. Here we are interested in visualizing\nthe relationship among the topics using a topic graph: the nodes represent individual topics, and\nneighboring nodes represent highly related topics. The whole corpus used by [3] contains 16,351\ndocuments with 19,088 unique terms. [3] set K = 100 and t a topic model to the articles published\nin Science from 1990 to 1999. However, when we perform the Kolmogorov-Smirnov test for each\ntopic, we nd some of them are in strong violation of the normality assumption (please refer to the\nsupplementary materials). This motivates our choice of the smooth-projected neighborhood pursuit\napproach. The estimated topic graphs are shown in Figure 4 where the clustering information can\nbe read directly from the graphs1 . The smooth-projected neighborhood pursuit generates 6 mid-size\nmodules and 6 small modules, while the naive neighborhood pursuit generated 1 large module, 2\nmid-size modules and 6 small modules. The nonparanormal approach clearly discovers more rened\n1\nHere each topic is labelled with the most frequent word. For more details about topic summaries, please\nrefer to http://www.cs.cmu.edu/ lemur/science/topics.html.\n\n7\n\n\fstructures, and improves the interpretability of the obtained graph. Here we provide a few examples:\n(1) Topics closely related to the climate change in Antarctica are clustered in the same module such\nas ice-68, ozone-23 and carbon-64; (2) Topics closely related to the environmental ecology\nare clustered in the same module such as monkey-21, science-4, environmental-67, species86, etc. ; (3) Topics closely related to the modern physics are clustered in the same module such\nas quantum-29, magnetic-55, pressure-92, solar-62, etc. . In contrast, we see that the naive\nneighborhood pursuit mixes all these topics together and clusters them in a large module.\n\n(a) Our Proposed Method\n\n(b) Naive Neighborhood Pursuit\n\nFigure 4: Two topic graphs show the dramatic topological difference. The smooth-projected neighborhood pursuit generates 6 mid-size modules and 6 small modules while the naive neighborhood\npursuit generates 1 large module, 2 mid-size modules and 6 small modules.\n6.2 S&P 500 Stock Market Graph\nWe acquire closing prices from all stocks of the S&P 500 for all the days that the market was open\nbetween January 1, 2003 and January 1, 2005. This gives us 504 samples for the 452 stocks. The\ndataset is transformed by calculating the log-ratio of the price at time t to price at time t 1, and\nfurther standardized by subtracting the mean and adjusting the variance to one. By examining the\ndata points (see the supplementary materials), we see that a large number of potential outliers exist,\nand they may affect the quality of the estimated graph. Since the nonparanormal SKEPTIC estimator\nis rank-based, it is more robust to outliers than the Pearson correlation estimator. The 452 stocks are\ncategorized into 10 Global Industry Classication Standard (GICS) sectors. We present the obtained\ngraphs in Figure 5, and the nodes are colored according to the GICS sector of the corresponding\nstock. It is expected that stocks from the same GICS sectors should tend to be clustered with each\nother. We highlight several densely connected modules in the nonparanormal graph, and by color\ncoding we see that the nodes in the same dense module belong to the same sector of the market.\nIn contrast, these modules are shown to be very sparse in the Gaussian graph. Especially for the\nblue nodes, many of them are observed as isolated nodes, which means the stocks they represent\nare (both marginally and conditionally) independent to the others. This is contrary to common\nbeliefs. Overall, we see that smooth-projected neighborhood pursuit tends to generate more rened\nstructures that reveal more meaningful relationships from the graph than naive neighborhood pursuit.\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n \n \n\n \n\n \n\n\n \n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n \n \n\n \n \n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n \n\n\n\n\n \n\n \n \n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n \n\n \n\n\n\n \n \n\n\n\n\n \n\n \n\n\n\n\n\n \n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n \n\n \n\n \n\n\n\n\n \n \n\n \n\n \n\n \n\n\n\n\n\n\n \n\n\n\n \n\n\n\n\n\n\n\n \n\n \n\n\n\n\n\n \n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n \n \n\n \n\n \n\n\n\n \n\n\n\n\n \n \n \n\n\n\n \n \n\n \n\n\n\n\n \n\n \n\n \n \n\n\n\n \n \n\n\n\n\n\n \n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n \n\n \n\n\n\n\n\n\n\n \n\n \n\n\n\n \n \n\n\n\n\n \n\n \n\n\n\n\n\n \n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n\n\n \n\n\n\n\n\n\n \n\n\n\n \n\n\n\n\n \n \n\n \n\n \n\n \n\n\n\n\n\n\n \n\n\n\n \n\n\n\n\n \n\n \n\n \n\n\n\n\n\n \n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n \n \n\n \n\n \n\n\n\n \n\n\n\n\n \n\n \n\n\n\n \n\n \n\n \n\n\n\n\n \n\n \n\n \n \n\n\n\n \n \n\n\n\n\n\n \n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n \n\n \n\n\n\n\n\n\n\n\n\n\n\n \n \n \n \n \n\n \n\n \n\n\n \n\n \n\n\n\n(a) Our Proposed Method\n\n\n\n\n\n(b) Naive Neighborhood Pursuit\n\nFigure 5: Stock Networks. A densely connected module is found in the smooth-projected neighborhood pursuit graph, which is much sparser in the corresponding naive neighborhood pursuit graph.\nThe color shows all nodes in this module belong to the same sector of the market.\nMoreover with a subsampling ratio of 0.5, the sample size (n = 252) is smaller than the dimension\n(d = 452) and the nonparanormal SKEPTIC estimator is indenite. By our proposed positive semidenite projection, we can exploit the convexity of the problem and obtain a high quality graph\nestimator without being trapped in a local optimum. The smoothing parameter for the projection\nhere is = 0.3 4 log d/n.\n8\n\n\fReferences\n[1] O. Banerjee, L. E. Ghaoui, and A. dAspremont. Model selection through sparse maximum likelihood\nestimation. Journal of Machine Learning Research, 9:485516, 2008.\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\nSIAM Journal on Imaging Sciences, 2:183202, 2009.\n[3] D. Blei and J. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 1:1735, 2007.\n[4] X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing. A smoothing proximal gradient method for general\nstructured sparse regression. Annals of Applied Statistics, 2012. to appear.\n[5] A. Dempster. Covariance selection. Biometrics, 28:157175, 1972.\n[6] J. Friedman, H. H ing T. Hastie, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied\no\nStatistics, 1:302332, 2007.\n[7] J. Honorio, L. Ortiz, D. Samaras, N. Paragios, , and R. Goldstein. Sparse and locally constant gaussian\ngraphical models. Advances in Neural Information Processing Systems, pages 745753, 2009.\n[8] C. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using\nquadratic approximation. Advances in Neural Information Processing Systems, pages 23302338, 2011.\n[9] C. Klaassen and J. Wellner. Efcient estimation in the bivariate normal copula model: Normal margins\nare least-favorable. Bernoulli, 3(1):5577, 1997.\n[10] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High dimensional semiparametric gaussian\ncopula graphical models. Annals of Statistics, 2012. to appear.\n[11] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10:22952328, 2009.\n[12] H. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection for high dimensional\ngraphical models. Advances in Neural Information Processing Systems, 2010.\n[13] R. Mazumder and T. Hastie. The graphical lasso : New insights and alternatives. Technical report,\nDepartment of Statistics, Stanford University, 2011.\n[14] N. Meinshausen and P. B hlmann. High dimensional graphs and variable selection with the lasso. Annals\nu\nof Statistics, 34(3):14361462, 2006.\n[15] N. Meinshausen and P. B hlmann. Stability selection. Journal of the Royal Statistical Society, Series B,\nu\n72(4):417473, 2010.\n\n[16] Y. Nesterov. On an approach to the construction of optimal methods of smooth convex functions. Ekonom.\ni. Mat. Metody, 24:509517, 1988.\n[17] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103:127152,\n2005.\n[18] P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing 1 -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935980, 2011.\n[19] H. Tsukahara. Semiparametric estimation in copula models. Canadian Journal of Statistics, 33:357375,\n2005.\n[20] M. Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using\nquadratic programming. IEEE Transactions on Information Theory, 55:21832201, 2009.\n\n1 constrained\n\n[21] A. Wille, P. Zimmermann, E. Vranova, A. Frholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, P. von\nRohr, L. Thiele, E. Zitzler, W. Gruissem, and P. B hlmann. Sparse graphical gaussian modeling of the\nu\nisoprenoid gene network in arabidopsis thaliana. Genome Biology, 5:R92, 2004.\n[22] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika,\n94(1):1935, 2007.\n[23] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research,\n7:25412563, 2006.\n[24] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional\nundirected graph estimation in r. Journal of Machine Learning Research, 2012. to appear.\n[25] S. Zhou, S. Van De Geer, and P. B hlmann. Adaptive lasso for high dimensional regression and gaussian\nu\ngraphical modeling. Technical report, ETH Zurich, 2009.\n[26] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,\n101:14181429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 4810, "authors": [{"given_name": "Tuo", "family_name": "Zhao", "institution": null}, {"given_name": "Kathryn", "family_name": "Roeder", "institution": null}, {"given_name": "Han", "family_name": "Liu", "institution": null}]}*