{"title": "Fast structure learning with modular regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 15593, "page_last": 15603, "abstract": "Estimating graphical model structure from high-dimensional and undersampled data is a fundamental problem in many scientific fields.\nExisting approaches, such as GLASSO, latent variable GLASSO, and latent tree models, suffer from high computational complexity and may impose unrealistic sparsity priors in some cases.\nWe introduce a novel method that leverages a newly discovered connection between information-theoretic measures and structured latent factor models to derive an optimization objective which encourages modular structures where each observed variable has a single latent parent.\nThe proposed method has linear stepwise computational complexity w.r.t. the number of observed variables.\nOur experiments on synthetic data demonstrate that our approach is the only method that recovers modular structure better as the dimensionality increases. We also use our approach for estimating covariance structure for a number of real-world datasets and show that it consistently outperforms state-of-the-art estimators at a fraction of the computational cost. Finally, we apply the proposed method to high-resolution fMRI data (with more than 10^5 voxels) and show that it is capable of extracting meaningful patterns.", "full_text": "Fast structure learning with modular regularization\n\nGreg Ver Steeg\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\nMarina del Rey, CA 90292\n\ngregv@isi.edu\n\nDaniel Moyer\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\nMarina del Rey, CA 90292\n\nmoyerd@usc.edu\n\nHrayr Harutyunyan\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\nMarina del Rey, CA 90292\n\nhrayrh@isi.edu\n\nAram Galstyan\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\nMarina del Rey, CA 90292\n\ngalstyan@isi.edu\n\nAbstract\n\nEstimating graphical model structure from high-dimensional and undersampled\ndata is a fundamental problem in many scienti\ufb01c \ufb01elds. Existing approaches, such\nas GLASSO, latent variable GLASSO, and latent tree models, suffer from high\ncomputational complexity and may impose unrealistic sparsity priors in some\ncases. We introduce a novel method that leverages a newly discovered connection\nbetween information-theoretic measures and structured latent factor models to\nderive an optimization objective which encourages modular structures where each\nobserved variable has a single latent parent. The proposed method has linear\nstepwise computational complexity w.r.t. the number of observed variables. Our\nexperiments on synthetic data demonstrate that our approach is the only method\nthat recovers modular structure better as the dimensionality increases. We also\nuse our approach for estimating covariance structure for a number of real-world\ndatasets and show that it consistently outperforms state-of-the-art estimators at\na fraction of the computational cost. Finally, we apply the proposed method to\nhigh-resolution fMRI data (with more than 105 voxels) and show that it is capable\nof extracting meaningful patterns.\n\n1\n\nIntroduction\n\nThe ability to recover the true relationships among many variables directly from data is a holy grail in\nmany scienti\ufb01c domains, including neuroscience, computational biology, and \ufb01nance. Unfortunately,\nthe problem is challenging in high-dimensional and undersampled regimes due to the curse of\ndimensionality. Existing methods try to address the challenge by making certain assumptions about\nthe structure of the solution. For instance, graphical LASSO, or GLASSO [1], imposes sparsity\nconstraints on the inverse covariance matrix. While GLASSO perfroms well for certain undersampled\nproblems, its computational complexity is cubic in the number of variables, making it impractical for\neven moderately sized problems. One can improve the scalability by imposing even stronger sparsity\nconstraints, but this approach fails for many real-world datasets that do not have ultra-sparse structure.\nOther methods such as latent variable graphical LASSO (LVGLASSO) [2] and latent tree modeling\nmethods [3] suffer from high computational complexity as well, whereas approaches like PCA, ICA,\nor factor analysis have better time complexity but perform very poorly in undersampled regimes.\nIn this work we introduce a novel latent factor modeling approach for estimating multivariate Gaussian\ndistributions. The proposed method \u2013 linear Correlation Explanation or linear CorEx \u2013 searches for\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fZ1\n\n. . .\n\nZm\n\nZ1\n\n. . .\n\nZm\n\nX1\n\nX2\n\nX...\n\nXp\n\nX1\n\nX2\n\nX...\n\nXp\n\nm\n\nT C(X | Z) + T C(Z) = 0\n\n+ (for any distribution) * (for Gaussians)\n\nT C(X | Z) + T C(Z) = 0, & 8i, T C(Z | Xi) = 0\n\n(a) Unconstrained latent factor model\n\n(b) Modular latent factor model\n\nFigure 1: Unconstrained and modular latent factor models. Both models admit equivalent information-\ntheoretic characterization (see Prop. 2.1 and Thm. 2.1 respectively).\n\nindependent latent factors that explain all correlations between observed variables, while also biasing\nthe model selection towards modular latent factor models \u2013 directed latent factor graphical models\nwhere each observed variable has a single latent variable as its only parent. Biasing towards modular\nlatent factor models corresponds to preferring models for which the covariance matrix of observed\nvariables is block-diagonal with each block being a diagonal plus rank-one matrix. This modular\ninductive prior is appropriate for many real-world datasets, such as stock market, magnetic resonance\nimaging, and gene expression data, where one expects that variables can be divided into clusters,\nwith each cluster begin governed by a few latent factors and latent factors of different clusters being\nclose to be independent. Additionally, modular latent factors are easy to interpret and are popular for\nexploratory analysis in social science and biology [4]. Furthermore, we provide evidence that learning\nthe graphical structure of modular latent factor models with \ufb01xed number of latent factors gets easier\nas the number of observed variables increases \u2013 an effect which we call blessing of dimensionality.\nWe derive the method by noticing that certain classes of graphical models correspond to global\noptima of information-theoretic functionals. The information-theoretic optimization objective for\nlearning unconstrained latent factor models is shown in Fig. 1a. We add an extra regularization term\nthat encourages the learned model to have modular latent factors (shown in Fig. 1b). The resulting\nobjective is trained using gradient descent, each iteration of which has linear time and memory\ncomplexity in the number of observed variables p, assuming the number of latent factors is constant.\nWe conduct experiments on synthetic data and demonstrate that the proposed method is the only\none that exhibits a blessing of dimensionality when data comes from a modular (or approximately\nmodular) latent factor model. Based on extensive evaluations on synthetic as well as over \ufb01fty\nreal-world datasets, we observe that our approach handily outperforms other methods in covariance\nestimation, with the largest margins on high dimensional, undersampled datasets. Finally, we\ndemonstrate the scalability of linear CorEx by applying it to high-resolution fMRI data (with more\nthan 100K voxels), and show that the method \ufb01nds interpretable structures.\n\n2 Learning structured models\n\nNotation\nLet X \u2318 X1:p \u2318 (X1, X2, . . . , Xp) denote a vector of p observed variables, and let Z \u2318\nZ1:m \u2318 (Z1, Z2, . . . , Zm) denote a vector of m latent variables. Instances of X and Z are denoted in\nlowercase, with x = (x1, . . . , xp) and z = (z1, . . . , zm) respectively. Throughout the paper we refer\nto several information-theoretic concepts, such as differential entropy: H(X) = Elog p(x), mutual\ninformation: I(X; Y ) = H(X) + H(Y ) H(X, Y ), multivariate mutual information, historically\ncalled total correlation [5]: T C(X) =Pp\ni=1 H(Xi) H(X), and their conditional variants, such\nas H(X|Z) = Ez [H(X|Z = z)] , T C(X|Z) = Ez [T C(X|Z = z)]. Please refer to Cover and\nThomas [6] for more information on these quantities.\nConsider the latent factor model shown in Fig. 1a, which we call unconstrained latent factor model. In\nsuch models, the latent factors explain dependencies present in X, since X1, . . . , Xp are conditionally\nindependent given Z. Thus, learning such graphical models gives us meaningful latent factors.\nTypically, to learn such a graphical model we would parameterize the space of models with the\ndesired form and then try to maximize the likelihood of the data under the model. An alternative\nway, the one that we use in this paper, is to notice that some types of directed graphical models can\nbe expressed succinctly in terms of information-theoretic constraints on the joint density function.\n\n2\n\n\fIn particular, the following proposition provides an information-theoretic characterization of the\nunconstrained latent factor model shown in Fig. 1a.\nProposition 2.1. The random variables X and Z are described by a directed graphical model where\nthe parents of X are in Z and the Z\u2019s are independent if and only if T C(X|Z) + T C(Z) = 0.\nThe proof is presented in Sec. A.1. One important consequence is that this information-theoretic\ncharacterization gives us a way to select models that are \u201cclose\u201d to the unconstrained latent factor\nmodel. In fact, let us parametrize pW (z|x) with a set of parameters W 2W and get a family\nof joint distributions P = {pW (x, z) = p(x)pW (z|x) : W 2W} . By taking pW \u21e4(x, z) 2\narg minpW (x,z)2P T C(Z) + T C(X|Z) we select a joint distribution that is as close as possible to\nsatisfy the conditional independence statements corresponding to the unconstrained latent factor\nmodel. If for pW \u21e4(x, z) we have T C(Z) + T C(X|Z) = 0, then by Prop. 2.1 we have a model where\nlatent variables are independent and explain all dependencies between observed variables. Next, we\nde\ufb01ne modular latent factor models (shown in Fig. 1b) and bias the learning of unconstrained latent\nfactor models towards selecting modular structures.\nDe\ufb01nition 2.1. A joint distribution p(x, z) with p observed variables X1:p and m hidden variables\nZ1:m is called modular latent factor model if it factorizes in the following way: 8x, z, p(x, z) =\nQp\ni=1 p(xi|z\u21e1i)Qm\n\nj=1 p(zj), with \u21e1i 2{ 1, 2, . . . , m}.\n\nThe motivation behind encouraging modular structures is two-fold. First, modular factor models\nare easy to interpret by grouping the observed variables according to their latent parent. Second,\nmodular structures are good candidates for beating the curse of dimensionality. Imagine increasing\nthe number of observed variables while keeping the number of latent factors \ufb01xed. Intuitively, we\nbring more information about latent variables, which should help us to recover the structure better.\nWe get another hint on this when we apply a technique from Wang et al. [7] to lower bound the sample\ncomplexity of recovering the structure of a Gaussian modular latent factor model. We establish that\nthe lower bound decreases as we increase p keeping m \ufb01xed (refer to Sec. C for more details). For\nmore general models such as Markov random \ufb01elds, the sample complexity grows like log p [7].\nTo give an equivalent information-theoretic characterization of modular latent factor models hereafter\nwe focus our analysis on multivariate Gaussian distributions.\nTheorem 2.1. A multivariate Gaussian distribution p(x, z) is a modular latent factor model if and\nonly if T C(X|Z) + T C(Z) = 0 and 8i, T C(Z|Xi) = 0.\nThe proof is presented in Sec. A.2. Besides characterizing modular latent factor models, this theorem\ngives us an information-theoretic criterion for selecting more modular joint distributions. The next\nsection describes the proposed method which uses this theorem to bias the model selection procedure\ntowards modular solutions.\n\n3 Linear CorEx\n\nWe sketch the main steps of the derivation here while providing the complete derivation in Sec. B.\nThe \ufb01rst step is to de\ufb01ne the family of joint distributions we are searching over by parametrizing\npW (z|x). If X1:p is Gaussian, then we can ensure X1:p, Z1:m are jointly Gaussian by parametrizing\nj ), wj 2 Rp, j = 1..m, or equivalently by z = W x + \u270f with W 2\nj x, \u23182\npW (zj|x) = N (wT\nRm\u21e5p,\u270f \u21e0N (0, diag(\u23182\nm)). W.l.o.g. we assume the data is standardized so that E [Xi] =\n1, . . . ,\u2318 2\n0, E\u21e5X 2\n\ni\u21e4 = 1. Motivated by Thm. 2.1, we will start with the following optimization problem:\n\n(1)\n\nminimize\n\nW\n\nT C(X|Z) + T C(Z) +\n\nQi,\n\npXi=1\n\nwhere Qi are regularization terms for encouraging modular solutions (i.e. encouraging solutions with\nsmaller value of T C(Z|Xi). We will later specify this regularizer as a non-negative quantity that\ngoes to zero in the case of exactly modular latent factor models. After some calculations for Gaussian\nrandom variables and neglecting some constants, the objective simpli\ufb01es as follows:\n\nminimize\n\nW\n\npXi=1\n\n(1/2 log E\u21e5(Xi \u00b5Xi|Z)2\u21e4 + Qi) +\n\nmXj=1\n\n1/2 log E\u21e5Z2\nj\u21e4 ,\n\n(2)\n\n3\n\n\fAlgorithm 1 Linear CorEx. Implementation is available at https://github.com/hrayrhar/T-CorEx.\n\nInput: Data matrix X 2 Rn\u21e5p, with n iid samples of vectors in Rp.\nResult: Weight matrix, W , optimizing (3).\nSubtract mean and scale from each column of data\nInitialize Wj,i \u21e0N (0, 1/pp)\nfor \u270f in [0.6, 0.6,0.63, 0.64, 0.65, 0.66, 0] do\nrepeat\n\u00afX = p1 \u270f2X + \u270fE, with E 2 Rn\u21e5p and Ei,j\niid\u21e0N (0, 1)\nLet \u02c6J(W ) be the empirical version of (3) with X replaced by \u00afX\nDo one step of ADAM optimizer to update W using rW \u02c6J(W )\nuntil until convergence or maximum number of iterations is reached\n\nend for\n\nwhere \u00b5Xi|Z = EXi|Z[Xi|Z]. For Gaussians, calculating \u00b5Xi|Z requires a computationally unde-\nsirable matrix inversion. Instead, we will select Qi to eliminate this term while also encouraging\nmodular structure. According to Thm. 2.1, modular models obey T C(Z|Xi) = 0, which implies that\np(xi|z) = p(xi)/p(z)Qj p(zj|xi). Let \u232bXi|Z be the conditional mean of Xi given Z under such\nfactorization. Then we have\n\n1\n\n1 + ri\n\nmXj=1\n\n\u232bXi|Z =\n\nIf we let\n\nQi =\n\n1\n2\n\nlog\n\nZjBj,iqE\u21e5Z2\nj\u21e4 , with Rj,i =\nE\u21e5(Xi \u232bXi|Z)2\u21e4\nE\u21e5(Xi \u00b5Xi|Z)2\u21e4 =\n\nE [XiZj]\n\nqE [X 2\ni ] E\u21e5Z2\nlog 1 +\n\n1\n2\n\nminimize\n\nW\n\npXi=1\n\n1/2 log E\u21e5(Xi \u232bXi|Z)2\u21e4 +\n\nRj,iBj,i.\n\nj,i\n\n, ri =\n\nRj,i\n1 R2\n\nmXj=1\nj\u21e4 , Bj,i =\nE\u21e5(Xi \u00b5Xi|Z)2\u21e4 ! 0,\nE\u21e5(\u00b5Xi|Z \u232bXi|Z)2\u21e4\nmXj=1\n1/2 log E\u21e5Z2\nj\u21e4 .\n\nthen we can see that this regularizer is always non-negative and is zero exactly for modular latent\nfactor models (when \u00b5Xi|Z = \u232bXi|Z). The \ufb01nal objective simpli\ufb01es to the following:\n\n(3)\n\n(4)\n\nThis objective depends on pairwise statistics and requires no matrix inversion. The global minimum\nis achieved for modular latent factor models. The next step is to approximate the expectations in the\nobjective (3) with empirical means and optimize it with respect to the parameters W . After training\nthe method we can interpret \u02c6\u21e1i 2 arg maxj I(Zj; Xi) as the parent of variable Xi. Additionally, we\ncan estimate the covariance matrix of X the following way:\n\nb\u2303i,`6=i =\n\n(BT B)i,`\n\n(1 + ri)(1 + r`)\n\n, b\u2303i,i = 1.\n\nWe implement the optimization problem (3) in PyTorch and optimize it using the ADAM optimizer [8].\nIn empirical evaluations, we were surprised to see that this update worked better for identifying weak\ncorrelations in noisy data than for very strong correlations with little or no noise. We conjecture\nthat noiseless latent factor models exhibit stronger curvature in the optimization space leading to\nsharp, spurious local minima. We implemented an annealing procedure to improve results for\nnearly deterministic factor models. The annealing procedure consists of rounds, where at each\nround we pick a noise amount, \u270f 2 [0, 1], and in each iteration of that round replace X with its\nnoisy version, \u00afX, computed as follows: \u00afX = p1 \u270f2X + \u270fE, with E \u21e0N (0, Ip). It can be\neasily seen that when E [Xi] = 0, and E\u21e5X 2\ni\u21e4 = 1, and\nE\u21e5 \u00afXi \u00afXj\u21e4 = (1 \u270f2)E [XiXj] + \u270f2i,j. This way adding noise weakens the correlations between\n\nobserved variables. We train the objective (3) for the current round, then reduce \u270f and proceed into\nthe next round retaining current values of parameters. We do 7 rounds with the following schedule for\n\u270f, [0.61, 0.62, . . . , 0.66, 0]. The \ufb01nal algorithm is shown in Alg. 1. Our implementation is available at\nhttps://github.com/hrayrhar/T-CorEx.\nThe only hyperparameter of the proposed method that needs signi\ufb01cant tuning is the number of\nhidden variables, m. While one can select it using standard validation procedures, we observed that\n\ni\u21e4 = 1, we get that E\u21e5 \u00afXi\u21e4 = 0, E\u21e5 \u00afX 2\n\n4\n\n\fit is also possible to select it by increasing m until the gain in modeling performance, measured by\nlog-likelihood, is insigni\ufb01cant. This is due to the fact that setting m to a larger value than needed has\nno effect on the solution of problem (3) as the method can learn to ignore the extra latent factors.\nThe stepwise computational complexity of linear CorEx is dominated by matrix multiplications of an\nm \u21e5 p weight matrix and a p \u21e5 n data matrix, giving a computational complexity of O(mnp). This\nis only linear in the number of observed variables assuming m is constant, making it an attractive\nalternative to standard methods, like GLASSO, that have at least cubic complexity. Furthermore, one\ncan use GPUs to speed up the training up to 10 times. The memory complexity of linear CorEx is\nO((mT + n)p). Fig. 4 compares the scalability of the proposed method against other methods.\n\n4 Experiments\n\nIn this section we compare the proposed method against other methods on two tasks: learning\nthe structure of a modular factor model (i.e. clustering observed variables) and estimation of\ncovariance matrix of observed variables. Additionally, we demonstrate that linear CorEx scales\nto high-dimensional datasets and \ufb01nds meaningful patterns. We present the essential details on\nexperiments, baselines, and hyperparameters in the main text. The complete details are presented in\nthe appendix (see Sec. D).\n\n4.1 Evidence of blessing of dimensionalty\n\nWe start by testing whether modular latent factor models allow better structure recovery as we increase\ndimensionality. We generate n = 300 samples from a modular latent factor model with p observed\nvariables, m = 64 latent variables each having p/m children, and additive white Gaussian noise\nchannel from parent to child with \ufb01xed signal-to-noise ratio s = 0.1. By setting s = 0.1 we focus\nour experiment in the regime where each individual variable has low signal-to-noise ratio. Therefore,\none should expect poor recovery of the structure when p is small. In fact, the sample complexity\nlower bound of Thm. C.1 tells us that in this setting any method needs at least 576 observed variables\nfor recovering the structure with \u270f = 0.01 error probability. As we increase p, we add more weakly\ncorrelated variables and the overall information that X contains about Z increases. One can expect\nthat some methods will be able to leverage this additional information.\nAs recovering the structure corresponds to correctly clustering the observed variables, we consider\nvarious clustering approaches. For decomposition approaches like factor analysis (FA) [9], non-\nnegative matrix factorization (NMF), probabilistic principal component analysis (PCA) [10], sparse\nPCA [11, 12] and independent component analysis (ICA), we cluster variables according to the latent\nfactor whose weight has the maximum magnitude. As factor analysis suffers from an unidenti\ufb01ability\nproblem, we do varimax rotation (FA+V) [13] to \ufb01nd more meaningful clusters. Other clustering\nmethods include k-means, hierarchical agglomerative clustering using Euclidean distance and the\nWard linkage rule (Hier.), and spectral clustering (Spec.) [14]. Finally, we consider the latent tree\nmodeling (LTM) method [15]. Since information distances are estimated from data, we use the\n\u201cRelaxed RG\u201d method. We slightly modify the algorithm to use the same prior information as other\nmethods in the comparison, namely, that there are exactly m groups and observed nodes can be\nsiblings, but not parent and child. We measure the quality of clusters using the adjusted Rand index\n(ARI), which is adjusted for chance to give 0 for a random clustering and 1 for a perfect clustering.\nThe left part of Fig. 2 shows the clustering results for varying values of p. While a few methods\nmarginally improve as p increases, only the proposed method approaches perfect reconstruction.\nWe \ufb01nd that this blessing of dimensionality effect persists even when we violate the assumptions of\na modular latent factor model by correlating the latent factors or adding extra parents for observed\nvariables. For correlating the latent factors we convolve each Zi with two other random latent factors.\nFor adding extra parents, we randomly sample p extra edges from a latent factor to a non-child\nobserved variable. By this we create on average one extra edge per each observed variable. In\nboth modi\ufb01cations to keep the the notion of clusters well-de\ufb01ned, we make sure that each observed\nvariable has higher mutual information with its main parent compared to other factors. All details\nabout synthetic data generation are presented in Sec. E. The right part of the Fig. 2 demonstrates that\nthe proposed method improves the results as p increases even if the data is not from a modular latent\nfactor model. This proves that our regularization term for encouraging modular structures is indeed\neffective and leads to such structures (more evidence on this statement are presented in Sec. F.1).\n\n5\n\n\fFigure 2: Evidence of blessing of dimensionality effect when learning modular (on the left) or\napproximately modular (on the right) latent factor models. We report adjusted Rand index (ARI)\nmeasured on 104 test samples. Error bars are standard deviation over 20 runs.\n\nFigure 3: Comparison of covariance estimation baselines on synthetic data coming from modular\nlatent models. On the left: m = 8 latent factors each having 16 children, on the right: m = 32 latent\nfactors each having 4 children. The reported score is the negative log-likelihood (lower better) on a\ntest data with 1000 samples. Error bars are standard deviation over 5 runs. We jitter x-coordinates to\navoid overlaps.\n\n4.2 Covariance estimation\n\nWe now investigate the usefulness of our proposed approach for estimating covariance matrices in the\nchallenging undersampled regime where n \u2327 p. For comparison, we include the following baselines:\nthe empirical covariance matrix, Ledoit-Wolf (LW) method [16], factor analysis (FA), sparse PCA,\ngraphical lasso (GLASSO), and latent variable graphical lasso (LVGLASSO). To measure the quality\nof covariance matrix estimates, we evaluate the Gaussian negative log-likelihood on a test data. While\nthe Gaussian likelihood is not the best evaluation metric for non-Gaussian data, we would like to\nnote that our comparisons of baselines are still fair, as most of the baselines, such as [latent variable]\nGLASSO, [sparse] PCA, are derived under Gaussian assumption. In all experiments hyper-parameters\nare selected from a grid of values using a 3-fold cross-validation procedure.\nSynthetic data We \ufb01rst evaluate covariance estimation on synthetic data sampled from a modular\nlatent factor model. For this type of data, the ground truth covariance matrix is block-diagonal with\neach block being a diagonal plus rank-one matrix. We consider two cases: 8 large groups with 16\nvariables in each block and 32 small groups with 4 variables in each block. In both cases we set the\nsignal-to-noise ratio s = 5 and vary the number of samples. The results for both cases are shown in\nFig. 3. As expected, the empirical covariance estimate fails when n \uf8ff p. PCA and factor analysis\nare not competitive in cases when n is small, while LW nicely handles those cases. Methods with\nsparsity assumptions: sparse PCA, GLASSO, LVGLASSO, do well especially for the second case,\nwhere the ground truth covariance matrix is very sparse. In most cases the proposed method performs\nbest, only losing narrowly when n \uf8ff 16 samples and the covariance matrix is very sparse.\nStock market data\nIn \ufb01nance, the covariance matrix plays a central role for estimating risk\nand this has motivated many developments in covariance estimation. Because the stock market is\nhighly non-stationary, it is desirable to estimate covariance using only a small number of samples\nconsisting of the most recent data. We considered the weekly percentage returns for U.S. stocks from\nJanuary 2000 to January 2017 freely available on http://quandl.com. After excluding stocks\nthat did not have returns over the entire period, we were left with 1491 companies. We trained on\nn weeks of data to learn a covariance matrix using various methods then evaluated the negative\nlog-likelihood on the subsequent 26 weeks of test data. Each point in Fig. 5 is an average from rolling\n\n6\n\n\fFigure 4: Runtime comparison of various meth-\nods. Points that do not appear either timed out\nat 104 seconds or ran out of memory. The exper-\niment was done in the setting of Sec. 4.1 on an\nIntel Core i5 processor with 4 cores at 4Ghz and\n64Gb memory. We used Nvidia RTX 2080 GPU\nwhen running the proposed method on a GPU.\n\nFigure 5: Comparison of covariance estimation\nbaselines on stock market data. The reported\nscore is the negative log-likelihood (lower bet-\nter) on a test data. Most of the Ledoit-Wolf points\nare above the top of the y axis.\n\nTable 1: For the \ufb01rst ten latent factors, we give the top three stocks ranked by mutual information\nbetween stock and associated latent factor.\n\nFactor\n\nStock ticker\n\nSPN, MRO, CRZO\n\nPOWI, LLTC, TXN\n\nSector/Industry\nBank holding (NYSE, large cap)\nIndustrial machinery\nBank holding (NASDAQ, small cap)\nOil & gas\nReal estate investment trusts\nElectric utilities\nSemiconductors\n\n0 RF, KEY, FHN\n1 ETN, IEX, ITW\n2 GABC, LBAI, FBNC\n3\n4 AKR, BXP, HIW\n5 CMS, ES, XEL\n6\n7 REGN, BMRN, CELG Biotech pharmaceuticals\n8 BKE, JWN, M\n9 DHI, LEN, MTH\n\nRetail, apparel\nHomebuilders\n\nthe training and testing sets over the entire time period. For component-based methods (probabilistic\nPCA, sparse PCA, FA, proposed method) we used 30 components. We omitted empirical covariance\nestimation since all cases have n < p. We see that Ledoit-Wolf does not help much in this regime.\nWith enough samples, PCA and FA are able to produce competitive estimates. Methods with\nsparsity assumptions, such as GLASSO, LVGLASSO, and sparse PCA, perform better. We see\nthat LVGLASSO consistently outperforms GLASSO, indicating that stock market data is better\nmodeled with latent factors. The proposed method consistently outperforms all the other methods.\nOur approach leverages the high-dimensional data more ef\ufb01ciently than standard factor analysis. The\nstock market is not well modeled by sparsity, but attributing correlations to a small number of latent\nfactors appears to be effective.\nTo examine the interpretability of learned latent factors, we used weekly returns from January 2014\nto January 2017 for training. This means we used only 156 samples and 1491 variables (stocks).\nFor each factor, we use the mutual information between a latent factor and stock to rank the top\nstocks related to a factor. We summarize the top stocks for other latent factors in Table 1. Factor 0\nappears to be not just banking related, but more speci\ufb01cally bank holding companies. Factor 5 has\nremarkably homogeneous correlations and consists of energy companies. Factor 9 is speci\ufb01c to home\nconstruction.\nOpenML datasets\nTo demonstrate the generality of our approach, we show results of covari-\nance estimation on 51 real-world datasets. To avoid cherry-picking, we selected datasets from\nOpenML [17] according to the following criteria: between 100 and 11000 numeric features, at least\ntwenty samples but fewer samples than features (samples with missing data were excluded), and the\ndata is not in a sparse format. These datasets span many domains including gene expression, drug\ndesign, and mass spectrometry. For factor-based methods including our own, we chose the number of\nfactors from the set m 2{ 5, 20, 50, 100} using 3-fold cross-validation. We use an 80-20 train-test\nsplit, learning a covariance matrix from training data and then reporting the negative log-likelihood\non test data. We standardized the data columns to have zero mean and unit variance. Numerical\n\n7\n\n\fFigure 6: Some of the clusters linear CorEx \ufb01nds. The cross-hairs correspond to the speci\ufb01ed regions.\n\nproblems involving in\ufb01nite log-likelihoods can arise in datasets which are low rank because of\nduplicate columns, for example. We add Gaussian noise with variance 1012 to avoid this.\nWe compared the same methods as before with three changes. We omitted empirical covariance\nestimation since all cases have n < p. We also omitted LVGLASSO as it was too slow on datasets\nhaving about 104 variables. The standard GLASSO algorithm was also far too slow for these datasets.\nTherefore, we used a faster version called BigQUIC [18]. For GLASSO, we considered sparsity\nhyper-parameters 2{ 20, 21, 22, 23}. We intended to use a larger range of sparsity parameters but\nthe speed of BigQUIC is highly sensitive to this parameter. In a test example with 104 variables, the\nrunning time was 130 times longer if we use = 0.5 versus = 1. Due to space limits we present\nthe complete results in the appendix (Sec. F.2, Table 2). The proposed method clearly outperformed\nthe other methods, getting the best score on 32 out of 51 datasets. Ledoit-Wolf also performed well,\ngetting the best results on 18 out of 51 datasets. Even when the proposed method was not the best,\nit was generally quite close to the best score. The fact that we had to use relatively large sparsity\nparameters to get reasonable running time may have contributed to BigQUIC\u2019s poor performance.\n\n4.3 High-resolution fMRI data\n\nThe low time and memory complexity of the proposed method allows us to apply it on extremely\nhigh-dimensional datasets, such as functional magnetic resonance images (fMRI), common in human\nbrain mapping. The most common measurement in fMRI is Blood Oxygen Level-Dependent (BOLD)\ncontrast, which measures blood \ufb02ow changes in biological tissues (\u201cactivation\u201d). In a typical fMRI\nsession hundreds of high-resolution brain images are captured, each having 100K-600K volumetric\npixels (voxels). We demonstrate the scalability and interpretability of linear CorEx by applying it\nwith 100 latent factors on the resting-state fMRI of the \ufb01rst session (session id: 014) of the publicly\navailable MyConnectome project [19]. The session has 518 images each having 148262 voxels. We\ndo spatial smoothing by applying a Gaussian \ufb01lter with fwhm=8mm, helping our model to pick up\nthe spatial information faster. Without spatial smoothing the training is unstable and we suspect that\nmore samples are needed to train the model. We assign each voxel to the latent factor that has the\nlargest mutual information with it, forming groups by each factor.\nFig. 6 shows three clusters linear CorEx \ufb01nds. Though appearing fragmented, the cluster on the left\nactually captures exactly a memory and reasoning network from cognitive science literature [20]. This\nincludes the activations in the Left Superior Parietal Lobule, the Left Frontal Middle and Superior\nGyri, and the Right Cerebellum. Though the authors of [20] are describing activations during a\ntask-based experiment, the correlation of these regions during resting state is unsurprising if they\nindeed have underlying functional correlations. The cluster in the middle is, with a few outlier\nexceptions, a contiguous block in the Right Medial Temporal cortex. This demonstrates the extraction\nof lateralized regions. The cluster on the right is a bilateral group in the Superior Parietal Lobules.\nBilateral function and processing is common for many cortical regions, and this demonstrates the\nextraction of one such cluster.\n\n8\n\n\f5 Related work\n\nPure one factor models induce relationships among observed variables that can be used to detect\nlatent factors [21, 22]. Tests using relationships among observed variables to detect latent factors\nhave been adapted to the modeling of latent trees [15, 23]. Besides tree-like approaches or pure one\nfactor models, another line of work imposes sparsity on the connections between latent factors and\nobserved variables [11, 12]. Another class of latent factor models can be cast as convex optimization\nproblems [24, 25]. Unfortunately, the high computational complexity of these methods make them\ncompletely infeasible for the high-dimensional problems considered in this work.\nWhile sparse methods and tractable approximations have enjoyed a great deal of attention [1, 26\u2013\n28, 18, 29, 30], marginalizing over a latent factor model does not necessarily lead to a sparse model\nover the observed variables. Many highly correlated systems, like \ufb01nancial markets [31], seem better\nmodeled through a small number of latent factors. The bene\ufb01t of adding more variables for learning\nlatent factor models is also discussed in [32].\nLearning through optimization of information-theoretic objectives has a long history focusing on\nmutual information [33\u201335]. Minimizing T C(Z) is well known as ICA [36, 37]. The problem of\nminimizing T C(X|Z) is less known but related to the Wyner common information [38] and has also\nbeen recently investigated as an optimization problem [39]. A similar objective was used in [40] to\nmodel discrete variables, and a nonlinear version for continuous variables but without modularity\nregularization (i.e. only T C(Z) + T C(X|Z)) was used in [41].\n6 Conclusion\n\nBy characterizing a class of structured latent factor models via an information-theoretic criterion, we\nwere able to design a new approach for structure learning that outperformed standard approaches\nwhile also reducing stepwise computational complexity from cubic to linear. Better scaling allows us\nto apply our approach to very high-dimensional data like full-resolution fMRI, recovering biologically\nplausible structure thanks to our inductive prior on modular structure. A bias towards modular latent\nfactors may not be appropriate in all domains and, unlike methods encoding sparsity priors (e.g.,\nGLASSO), our approach leads to a non-convex optimization and therefore no theoretical guarantees.\nNevertheless, we demonstrated applicability across a diverse set of over \ufb01fty real-world datasets,\nwith especially promising results in domains like gene expression and \ufb01nance where we outperform\nsparsity-based methods by large margins both in solution quality and computational cost.\n\nAcknowledgments\nWe thank Andrey Lokhov, Marc Vuffray, and Seyoung Yun for valuable conversations about this\nwork and we thank anonymous reviews whose comments have greatly improved this manuscript.\nH. Harutyunyan is supported by USC Annenberg Fellowship. This work is supported in part\nby DARPA via W911NF-16-1-0575 and W911NF-17-C-0011, and the Of\ufb01ce of the Director of\nNational Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2016-\n16041100004. The views and conclusions contained herein are those of the authors and should\nnot be interpreted as necessarily representing-the of\ufb01cial policies, either expressed or implied, of\nDARPA, ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce\nand distribute reprints for governmental purposes notwithstanding any copyright annotation therein.\n\nReferences\n[1] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[2] Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky. Latent variable graphical model selection\n\nvia convex optimization. Ann. Statist., 40(4):1935\u20131967, 08 2012.\n\n[3] Nevin L Zhang and Leonard KM Poon. Latent tree analysis. In Thirty-First AAAI Conference on Arti\ufb01cial\n\nIntelligence, 2017.\n\n[4] Raymond B Cattell. Factor analysis: an introduction and manual for the psychologist and social scientist.\n\n1952.\n\n9\n\n\f[5] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and\n\ndevelopment, 4(1):66\u201382, 1960.\n\n[6] Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience, 2006.\n\n[7] Wei Wang, Martin J Wainwright, and Kannan Ramchandran. Information-theoretic bounds on model\nselection for Gaussian Markov random \ufb01elds. In Information Theory Proceedings (ISIT), 2010 IEEE\nInternational Symposium on, pages 1373\u20131377. IEEE, 2010.\n\n[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[9] Theodore W Anderson and Herman Rubin. Statistical inference in factor analysis. In Proceedings of the\n\nthird Berkeley symposium on mathematical statistics and probability, volume 5, pages 111\u2013150, 1956.\n\n[10] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\n[11] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computa-\n\ntional and Graphical Statistics, 15(2):265\u2013286, 2006.\n\n[12] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse\ncoding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909,\npages 689\u2013696, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1.\n\n[13] Henry F Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):\n\n187\u2013200, 1958.\n\n[14] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416, 2007.\n\n[15] Myung Jin Choi, Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning latent tree\n\ngraphical models. The Journal of Machine Learning Research, 12:1771\u20131812, 2011.\n\n[16] Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.\n\nJournal of Multivariate Analysis, 88(2):365 \u2013 411, 2004. ISSN 0047-259X.\n\n[17] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in\nmachine learning. SIGKDD Explorations, 15(2):49\u201360, 2013. doi: 10.1145/2641190.2641198. URL\nhttp://doi.acm.org/10.1145/2641190.2641198.\n\n[18] Cho-Jui Hsieh, M\u00e1ty\u00e1s A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell Poldrack. BIG &\nQUIC: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information\nProcessing Systems, pages 3165\u20133173, 2013.\n\n[19] Russell A Poldrack, Timothy O Laumann, Oluwasanmi Koyejo, Brenda Gregory, Ashleigh Hover, Mei-Yen\nChen, Krzysztof J Gorgolewski, Jeffrey Luci, Sung Jun Joo, Ryan L Boyd, et al. Long-term neural and\nphysiological phenotyping of a single human. Nature communications, 6:8885, 2015.\n\n[20] Martin M Monti, Daniel N Osherson, Michael J Martinez, and Lawrence M Parsons. Functional neu-\nroanatomy of deductive inference: a language-independent distributed network. Neuroimage, 37(3):\n1005\u20131016, 2007.\n\n[21] Ricardo Silva, Richard Scheines, Clark Glymour, and Peter Spirtes. Learning the structure of linear latent\n\nvariable models. The Journal of Machine Learning Research, 7:191\u2013246, 2006.\n\n[22] Erich Kummerfeld and Joseph Ramsey. Causal clustering for 1-factor measurement models. In Proceedings\nof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1655\u20131664. ACM, 2016.\n\n[23] Nevin L Zhang and Leonard KM Poon. Latent tree analysis. In AAAI, pages 4891\u20134898, 2017.\n\n[24] Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via\nconvex optimization. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton\nConference on, pages 1610\u20131613. IEEE, 2010.\n\n[25] Zhaoshi Meng, Brian Eriksson, and Al Hero. Learning latent variable Gaussian graphical models. In\nProceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1269\u20131277,\n2014.\n\n10\n\n\f[26] Nicolai Meinshausen and Peter B\u00fchlmann. High-dimensional graphs and variable selection with the lasso.\n\nThe Annals of Statistics, 34(3):1436\u20131462, 2006. ISSN 00905364.\n\n[27] Tony Cai, Weidong Liu, and Xi Luo. A constrained L1 minimization approach to sparse precision matrix\n\nestimation. Journal of the American Statistical Association, 106(494):594\u2013607, 2011.\n\n[28] Cho-Jui Hsieh, M\u00e1ty\u00e1s A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. QUIC: Quadratic approxi-\nmation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15:2911\u20132947,\n2014.\n\n[29] Ying Liu and Alan Willsky. Learning Gaussian graphical models with observed or latent fvss. In Advances\n\nin Neural Information Processing Systems, pages 1833\u20131841, 2013.\n\n[30] Sidhant Misra, Marc Vuffray, Andrey Y Lokhov, and Michael Chertkov. Towards optimal sparse inverse\n\ncovariance selection through non-convex optimization. arXiv preprint arXiv:1703.04886, 2017.\n\n[31] Jianqing Fan, Yingying Fan, and Jinchi Lv. High dimensional covariance matrix estimation using a factor\n\nmodel. Journal of Econometrics, 147(1):186\u2013197, 2008.\n\n[32] Quefeng Li, Guang Cheng, Jianqing Fan, and Yuyan Wang. Embracing the blessing of dimensionality in\n\nfactor models. Journal of the American Statistical Association, 113(521):380\u2013389, 2018.\n\n[33] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105\u2013117, 1988.\n\n[34] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and\n\nblind deconvolution. Neural computation, 7(6):1129\u20131159, 1995.\n\n[35] Naftali Tishby, Fernando C Pereira, and William Bialek.\n\narXiv:physics/0004057, 2000.\n\nThe information bottleneck method.\n\n[36] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287\u2013314, 1994.\n\n[37] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural\n\nnetworks, 13(4):411\u2013430, 2000.\n\n[38] Aaron D Wyner. The common information of two dependent random variables. Information Theory, IEEE\n\nTransactions on, 21(2):163\u2013179, 1975.\n\n[39] Giel Op\u2019t Veld and Michael C Gastpar. Caching Gaussians: Minimizing total correlation on the Gray\u2013\nWyner network. In Proceedings of the 50th Annual Conference on Information Systems and Sciences\n(CISS), 2016.\n\n[40] Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchical representations of high-\n\ndimensional data. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[41] Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation\n\nexplanation. arXiv preprint arXiv:1802.05822, 2018.\n\n[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n11\n\n\f", "award": [], "sourceid": 9031, "authors": [{"given_name": "Greg", "family_name": "Ver Steeg", "institution": "USC Information Sciences Institute"}, {"given_name": "Hrayr", "family_name": "Harutyunyan", "institution": "USC Information Sciences Institute"}, {"given_name": "Daniel", "family_name": "Moyer", "institution": "USC Information Sciences Institute"}, {"given_name": "Aram", "family_name": "Galstyan", "institution": "USC Information Sciences Institute"}]}