{"title": "Learning Gaussian Graphical Models with Observed or Latent FVSs", "book": "Advances in Neural Information Processing Systems", "page_first": 1833, "page_last": 1841, "abstract": "Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity $O(k^{2}n)$ using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of high-degree nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate in $O(kn^2+n^2\\log n)$ if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing a inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank correction with complexity $O(kn^{2}+n^{2}\\log n)$ per iteration. We also perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. We show that empirically the family of GGMs of size $O(\\log n)$ strikes a good balance between the modeling capacity and the efficiency.", "full_text": "Learning Gaussian Graphical Models with Observed\n\nor Latent FVSs\n\nYing Liu\n\nDepartment of EECS\n\nAlan S. Willsky\n\nDepartment of EECS\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nliu_ying@mit.edu\n\nwillsky@mit.edu\n\nAbstract\n\nGaussian Graphical Models (GGMs) or Gauss Markov random \ufb01elds are widely\nused in many applications, and the trade-off between the modeling capacity and\nthe ef\ufb01ciency of learning and inference has been an important research prob-\nlem.\nIn this paper, we study the family of GGMs with small feedback vertex\nsets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles.\nExact inference such as computing the marginal distributions and the partition\nfunction has complexity O(k2n) using message-passing algorithms, where k is\nthe size of the FVS, and n is the total number of nodes. We propose ef\ufb01cient\nstructure learning algorithms for two cases: 1) All nodes are observed, which is\nuseful in modeling social or \ufb02ight networks where the FVS nodes often corre-\nspond to a small number of highly in\ufb02uential nodes, or hubs, while the rest of\nthe networks is modeled by a tree. Regardless of the maximum degree, without\nknowing the full graph structure, we can exactly compute the maximum likelihood\nestimate with complexity O(kn2 + n2 log n) if the FVS is known or in polyno-\nmial time if the FVS is unknown but has bounded size. 2) The FVS nodes are\nlatent variables, where structure learning is equivalent to decomposing an inverse\ncovariance matrix (exactly or approximately) into the sum of a tree-structured ma-\ntrix and a low-rank matrix. By incorporating ef\ufb01cient inference into the learning\nsteps, we can obtain a learning algorithm using alternating low-rank corrections\nwith complexity O(kn2 + n2 log n) per iteration. We perform experiments using\nboth synthetic data as well as real data of \ufb02ight delays to demonstrate the modeling\ncapacity with FVSs of various sizes.\n\nIntroduction\n\n1\nIn undirected graphical models or Markov random \ufb01elds, each node represents a random variable\nwhile the set of edges speci\ufb01es the conditional independencies of the underlying distribution. When\nthe random variables are jointly Gaussian, the models are called Gaussian graphical models (GGMs)\nor Gauss Markov random \ufb01elds. GGMs, such as linear state space models, Bayesian linear regres-\nsion models, and thin-membrane/thin-plate models, have been widely used in communication, im-\nage processing, medical diagnostics, and gene regulatory networks. In general, a larger family of\ngraphs represent a larger collection of distributions and thus can better approximate arbitrary empir-\nical distributions. However, many graphs lead to computationally expensive inference and learning\nalgorithms. Hence, it is important to study the trade-off between modeling capacity and ef\ufb01ciency.\nBoth inference and learning are ef\ufb01cient for tree-structured graphs (graphs without cycles): infer-\nence can be computed exactly in linear time (with respect to the size of the graph) using belief\npropagation (BP) [1] while the learning problem can be solved exactly in quadratic time using the\nChow-Liu algorithm [2]. Since trees have limited modeling capacity, many models beyond trees\nhave been proposed [3, 4, 5, 6]. Thin junction trees (graphs with low tree-width) are extensions of\ntrees, where inference can be solved ef\ufb01ciently using the junction algorithm [7]. However, learning\n\n1\n\n\fjunction trees with tree-width greater than one is NP-complete [6] and tractable learning algorithms\n(e.g. [8]) often have constraints on both the tree-width and the maximum degree. Since graphs with\nlarge-degree nodes are important in modeling applications such as social networks, \ufb02ight networks,\nand robotic localization, we are interested in \ufb01nding a family of models that allow arbitrarily large\ndegrees while being tractable for learning.\nBeyond thin-junction trees, the family of sparse GGMs is also widely studied [9, 10]. These models\nare often estimated using methods such as graphical lasso (or l-1 regularization) [11, 12]. However,\na sparse GGM (e.g. a grid) does not automatically lead to ef\ufb01cient algorithms for exact inference.\nHence, we are interested in \ufb01nding a family of models that are not only sparse but also have guaran-\nteed ef\ufb01cient inference algorithms.\nIn this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS\nis a set of nodes whose removal breaks all cycles [13]. The authors of [14] have demonstrated\nthat the computation of exact means and variances for such a GGM can be accomplished, using\nmessage-passing algorithms with complexity O(k2n), where k is the size of the FVS and n is the\ntotal number of nodes. They have also presented results showing that for models with larger FVSs,\napproximate inference (obtained by replacing a full FVS by a pseudo-FVS) can work very well,\nwith empirical evidence indicating that a pseudo-FVS of size O(log n) gives excellent results. In\nAppendix A we will provide some additional analysis of inference for such models (including the\ncomputation of the partition function), but the main focus is maximum likelihood (ML) learning of\nmodels with FVSs of modest size, including identifying the nodes to include in the FVS.\nIn particular, we investigate two cases. In the \ufb01rst, all of the variables, including any to be included\nin the FVS are observed. We provide an algorithm for exact ML estimation that, regardless of the\nmaximum degree, has complexity O(kn2 + n2 log n) if the FVS nodes are identi\ufb01ed in advance and\npolynomial complexity if the FVS is to be learned and of bounded size. Moreover, we provide an\napproximate and much faster greedy algorithm when the FVS is unknown and large. In the second\ncase, the FVS nodes are taken to be latent variables. In this case, the structure learning problem\ncorresponds to the (exact or approximate) decomposition of an inverse covariance matrix into the\nsum of a tree-structured matrix and a low-rank matrix. We propose an algorithm that iterates between\ntwo projections, which can also be interpreted as alternating low-rank corrections. We prove that\neven though the second projection is onto a highly non-convex set, it is carried out exactly, thanks\nto the properties of GGMs of this family. By carefully incorporating ef\ufb01cient inference into the\nlearning steps, we can further reduce the complexity to O(kn2 + n2 log n) per iteration. We also\nperform experiments using both synthetic data and real data of \ufb02ight delays to demonstrate the\nmodeling capacity with FVSs of various sizes. We show that empirically the family of GGMs of\nsize O(log n) strikes a good balance between the modeling capacity and ef\ufb01ciency.\nRelated Work In the context of classi\ufb01cation, the authors of [15] have proposed the tree aug-\nmented naive Bayesian model, where the class label variable itself can be viewed as a size-one\nobserved FVS; however, this model does not naturally extend to include a larger FVS. In [16], a\nconvex optimization framework is proposed to learn GGMs with latent variables, where conditioned\non a small number of latent variables, the remaining nodes induce a sparse graph. In our setting with\nlatent FVSs, we further require the sparse subgraph to have tree structure.\n2 Preliminaries\nEach undirected graphical model has an underlying graph G = (V,E), where V denotes the set of\nvertices (nodes) and E the set of edges. Each node s \u2208 V corresponds to a random variable xs.\nWhen the random vector xV is jointly Gaussian, the model is a GGM with density function given\n2 xT Jx + hT x}, where J is the information matrix or precision matrix, h is\nby p(x) = 1\nthe potential vector, and Z is the partition function. The parameters J and h are related to the mean\n\u00b5 and covariance matrix \u03a3 by \u00b5 = J\u22121h and \u03a3 = J\u22121. The structure of the underlying graph is\nrevealed by the sparsity pattern of J: there is an edge between i and j if and only if Jij (cid:54)= 0.\nGiven samples {xi}s\nempirical distribution is \u02c6p(x) = N (x; \u02c6\u00b5, \u02c6\u03a3), where the empirical mean \u02c6\u00b5 = 1\nempirical covariance matrix \u02c6\u03a3 = 1\ns\nbetween two distributions p and q is de\ufb01ned as DKL(p||q) =\ngenerality, we assume in this paper the means are zero.\n\ni=1 independently generated from an unknown distribution q in the family Q,\ni=1 log q(xi). For Gaussian distributions, the\ni=1 xi and the\n\ni=1 xi(cid:0)xi(cid:1)T \u2212 \u02c6\u00b5 \u02c6\u00b5T . The Kullback-Leibler (K-L) divergence\n(cid:80)s\n\nthe ML estimate is de\ufb01ned as qML = arg minq\u2208Q(cid:80)s\n\n(cid:80)s\n\ns\n\nZ exp{\u2212 1\n\n\u00b4\n\np(x) log p(x)\n\nq(x) dx. Without loss of\n\n2\n\n\fTree-structured models are models whose underlying graphs do not have cycles. The ML estimate\nof a tree-structured model can be computed exactly using the Chow-Liu algorithm [2]. We use\n\u03a3CL = CL( \u02c6\u03a3) and ECL = CLE ( \u02c6\u03a3) to denote respectively the covariance matrix and the set of edges\nlearned using the Chow-Liu algorithm where the samples have empirical covariance matrix \u02c6\u03a3.\n3 Gaussian Graphical Models with Known FVSs\nIn this section we brie\ufb02y discuss some of the ideas related to GGMs with FVSs of size k, where we\nwill also refer to the nodes in the FVS as feedback nodes. An example of a graph and its FVS is\ngiven in Figure 1, where the full graph (Figure 1a) becomes a cycle-free graph (Figure 1b) if nodes\n1 and 2 are removed, and thus the set {1, 2} is an FVS.\n\n(a)\n\n(b)\n\nFigure 1: A graph with an FVS of size 2. (a) Full graph; (b) Tree-\nstructured subgraph after removing nodes 1 and 2\n\ni and \u03a3ii = (cid:0)J\u22121(cid:1)\n\nmeans and variances \u00b5i = (cid:0)J\u22121h(cid:1)\n\nGraphs with small FVSs have been studied in various contexts. The authors of [17] have charac-\nterized the family of graphs with small FVSs and their obstruction sets (sets of forbidden minors).\nFVSs are also related to the \u201cstable sets\u201d in the study of tournaments [18].\nGiven a GGM with an FVS of size k (where the FVS may or may not be given), the marginal\nii, for \u2200i \u2208 V can be computed exactly\nwith complexity O(k2n) using the feedback message passing (FMP) algorithm proposed in [14],\nwhere standard BP is employed two times on the cycle-free subgraph among the non-feedback nodes\nwhile a special message-passing protocol is used for the FVS nodes. We provide a new algorithm\nin Appendix D, to compute det J, the determinant of J, and hence the partition function of such a\nmodel with complexity O(k2n). The algorithm is described and proved in Appendix A.\nAn important point to note is that the complexity of these algorithms depends simply on the size k\nand the number of nodes n. There is no loss in generality in assuming that the size-k FVS F is fully\nconnected and each of the feedback nodes has edges to every non-feedback node. In particular, after\nre-ordering the nodes so that the elements of F are the \ufb01rst k nodes (T = V \\F is the set of non-\nfeedback nodes of size n \u2212 k), we have that J =\n(cid:31) 0, where JT (cid:31) 0 corresponds to\na tree-structured subgraph among the non-feedback nodes, JF (cid:31) 0 corresponds to a complete graph\namong the feedback nodes, and all entries of JM may be non-zero as long as JT \u2212 JM J\u22121\nM (cid:31) 0\n= J\u22121 (cid:31) 0 ). We will refer to the family of such models with a given\n(while \u03a3 =\nFVS F as QF , and the class of models with some FVS of size at most k as Qk.1 If we are not\nexplicitly given an FVS, though the problem of \ufb01nding an FVS of minimal size is NP-complete, the\nauthors of [19] have proposed an ef\ufb01cient algorithm with complexity O(min{m log n, n2}), where\nm is the number of edges, that yields an FVS at most twice the minimum size (thus the inference\ncomplexity is increased only by a constant factor). However, the main focus of this paper, explored\nin the next section, is on learning models with small FVSs (so that when learned, the FVS is known).\nAs we will see, the complexity of such algorithms is manageable. Moreover, as our experiments will\ndemonstrate, for many problems, quite modestly sized FVSs suf\ufb01ce.\n4 Learning GGMs with Observed or Latent FVS of Size k\nIn this section, we study the problem of recovering a GGM from i.i.d. samples, where the feedback\nnodes are either observed or latent variables. If all nodes are observed, the empirical distribution\n\n\u03a3F \u03a3T\nM\n\u03a3M JT\n\n(cid:20) JF\n\nJ T\nM\nJM JT\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nF J T\n\n1In general a graph does not have a unique FVS. The family of graphs with FVSs of size k includes all\n\ngraphs where there exists an FVS of size k.\n\n3\n\n..1.5.6.7.8.2.3.4.91..5.6.7.8.3.4.91\f\u02c6p(xF , xT ) is parametrized by the empirical covariance matrix \u02c6\u03a3 =\n\n(cid:20) \u02c6\u03a3F\n\n\u02c6\u03a3T\nM\n\u02c6\u03a3M \u02c6\u03a3T\n\n(cid:21)\n\n. If the feedback\n\nnodes are latent variables, the empirical distribution \u02c6p(xT ) has empirical covariance matrix \u02c6\u03a3T .\nWith a slight abuse of notation, for a set A \u2282 V, we use q(xA) to denote the marginal distribution\nof xA under a distribution q(xV ).\n4.1 When All Nodes Are Observed\nWhen all nodes are observed, we have two cases: 1) When an FVS of size k is given, we propose\nthe conditioned Chow-Liu algorithm, which computes the exact ML estimate ef\ufb01ciently; 2) When\nno FVS is given a priori, we propose both an exact algorithm and a greedy approximate algorithm\nfor computing the ML estimate.\n4.1.1 Case 1: An FVS of Size k Is Given.\nWhen a size-k FVS F is given, the learning problem becomes solving\n\nDKL(\u02c6p(xF , xT )||q(xF , xT )).\n\nq(xF ,xT )\u2208QF\n\npML(xF , xT ) = arg min\n\n(1)\nThis optimization problem is de\ufb01ned on a highly non-convex set QF with combinatorial structures:\nindeed, there are (n \u2212 k)n\u2212k\u22122 possible spanning trees among the subgraph induced by the non-\nfeedback nodes. However, we are able to solve Problem (1) exactly using the conditioned Chow-Liu\nalgorithm described in Algorithm 1.2 The intuition behind this algorithm is that even though the\nentire graph is not tree, the subgraph induced by the non-feedback nodes (which corresponds to\nthe distribution of the non-feedback nodes conditioned on the feedback nodes) has tree structure,\nand thus we can \ufb01nd the best tree among the non-feedback nodes using the Chow-Liu algorithm\napplied on the conditional distribution. To obtain a concise expression, we also exploit a property of\nGaussian distributions: the conditional information matrix (the information matrix of the conditional\ndistribution) is simply a submatrix of the whole information matrix. In Step 1 of Algorithm 1, we\ncompute the conditional covariance matrix using the Schur complement, and then in Step 2 we\nuse the Chow-Liu algorithm to obtain the best approximate \u03a3CL (whose inverse is tree-structured).\nIn Step 3, we match exactly the covariance matrix among the feedback nodes and the covariance\nmatrix between the feedback nodes and the non-feedback nodes. For the covariance matrix among\nthe non-feedback nodes, we add the matrix subtracted in Step 1 back to \u03a3CL. Proposition 1 states\nthe correctness and the complexity of Algorithm 1. Its proof included in Appendix B.We denote the\noutput covariance matrix of this algorithm as CCL( \u02c6\u03a3).\nAlgorithm 1 The conditioned Chow-Liu algorithm\n\nInput: \u02c6\u03a3 (cid:31) 0 and an FVS F\nOutput: EML and \u03a3ML\n\n1. Compute the conditional covariance matrix \u02c6\u03a3T|F = \u02c6\u03a3T \u2212 \u02c6\u03a3M \u02c6\u03a3\n2. Let \u03a3CL = CL( \u02c6\u03a3T|F ) and ECL = CLE ( \u02c6\u03a3T|F ).\n3. EML = ECL and \u03a3ML =\n\n(cid:20) \u02c6\u03a3F\n\n(cid:21)\n\n\u02c6\u03a3T\nM\n\u02c6\u03a3M \u03a3CL + \u02c6\u03a3M \u02c6\u03a3\n\n.\n\n\u22121\nF\n\n\u02c6\u03a3T\nM\n\n\u22121\nF\n\n\u02c6\u03a3T\n\nM .\n\nIn addition, all the non-zero entries of JML\n\nProposition 1. Algorithm 1 computes the ML estimate \u03a3ML and EML, exactly with complexity\nO(kn2 + n2 log n).\n\u22121\nML can be computed with\nextra complexity O(k2n).\n4.1.2 Case 2: The FVS Is to Be Learned\nStructure learning becomes more computationally involved when the FVS is unknown. In this sub-\nsection, we present both exact and approximate algorithms for learning models with FVS of size\nno larger than k (i.e., in Qk). For a \ufb01xed empirical distribution \u02c6p(xF , xT ), we de\ufb01ne d(F ), a set\nfunction of the FVS F as the minimum value of (1), i.e.,\n\n\u2206\n= \u03a3\n\n2Note that the conditioned Chow-Liu algorithm here is different from other variations of the Chow-Liu\n\nalgorithm such as in [20] where the extensions are to enforce the inclusion or exclusion of a set of edges.\n\n4\n\n\f(cid:18)n\n\n(cid:19)\n\nd(F ) =\n\nmin\n\nq(xF ,xT )\u2208QF\n\nDKL(\u02c6p(xF , xT )||q(xF , xT )).\n\n(2)\n\nWhen the FVS is unknown, the ML estimate can be computed exactly by enumerating all possible\nFVSs of size k to \ufb01nd the F that minimizes d(F ). Hence, the exact solution can be obtained\nk\nwith complexity O(nk+2k), which is polynomial in n for \ufb01xed k. However, as our empirical results\nsuggest, choosing k = O(log(n)) works well, leading to quasi-polynomial complexity even for this\nexact algorithm. That observation notwithstanding, the following greedy algorithm (Algorithm 2),\nwhich, at each iteration, selects the single best node to add to the current set of feedback nodes, has\npolynomial complexity for arbitrarily large FVSs. As we will demonstrate, this greedy algorithm\nworks extremely well in practice.\nAlgorithm 2 Selecting an FVS by a greedy approach\n\nInitialization: F0 = \u2205\nFor t = 1 to k,\n\nk\u2217\nt = arg min\nk\u2208V \\Ft\u22121\n\nd(Ft\u22121 \u222a {k}), Ft = Ft\u22121 \u222a {k\u2217\nt }\n\n4.2 When the FVS Nodes Are Latent Variables\n\nWhen the feedback nodes are latent variables, the marginal distribution of observed variables (the\nnon-feedback nodes in the true model) has information matrix \u02dcJT = \u02c6\u03a3\nM . If the\nexact \u02dcJT is known, the learning problem is equivalent to decomposing a given inverse covariance\nmatrix \u02dcJT into the sum of a tree-structured matrix JT and a rank-k matrix \u2212JM J\u22121\nM .3 In general,\nuse the ML criterion\n\nT = JT \u2212JM J\u22121\n\u22121\n\nF J T\n\nF J T\n\nqML(xF , xT ) = arg\n\nmin\n\nq(xF ,xT )\u2208QF\n\nDKL(\u02c6p(xT )||q(xT )),\n\n(3)\n\nwhere the optimization is over all nodes (latent and observed) while the K-L divergence in the\nobjective function is de\ufb01ned on the marginal distribution of the observed nodes only.\nWe propose the latent Chow-Liu algorithm, an alternating projection algorithm that is a variation\nof the EM algorithm and can be viewed as an instance of the majorization-minimization algorithm.\nThe general form of the algorithm is as follows:\n\n1. Project onto the empirical distribution:\n\n\u02c6p(t)(xF , xT ) = \u02c6p(xT )q(t)(xF|xT ).\n\n2. Project onto the best \ufb01tting structure on all variables:\n\nq(t+1)(xF , xT ) = arg\n\nmin\n\nq(xF ,xT )\u2208QF\n\nDKL(\u02c6p(t)(xF , xT )||q(xF , xT )).\n\nIn the \ufb01rst projection, we obtain a distribution (on both observed and latent variables) whose\nmarginal (on the observed variables) matches exactly the empirical distribution while maintaining\nthe conditional distribution (of the latent variables given the observed ones). In the second projec-\ntion we compute a distribution (on all variables) in the family considered that is the closest to the\ndistribution obtained in the \ufb01rst projection. We found that among various EM type algorithms, this\nformulation is the most revealing for our problems because it clearly relates the second projection\nto the scenario where an FVS F is both observed and known (Section 4.1.1). Therefore, we are able\nto compute the second projection exactly even though the graph structure is unknown (which allows\nany tree structure among the observed nodes). Note that when the feedback nodes are latent, we do\n\n3It is easy to see that different models having the same JM J\n\n\u22121\nF JM cannot be distinguished using the sam-\nples, and thus without loss of generality we can assume JF is normalized to be the identify matrix in the \ufb01nal\nsolution.\n\n5\n\n\fnot need to select the FVS since it is simply the set of latent nodes. This is the source of the simpli-\n\ufb01cation when we use latent nodes for the FVS: We have no search of sets of observed variables to\ninclude in the FVS.\nAlgorithm 3 The latent Chow-Liu algorithm\nInput: the empirical covariance matrix \u02c6\u03a3T\nOutput: information matrix J =\n\n, where JT is tree-structured\n\n(cid:21)\n(cid:17)T\n\n(cid:20) JF\n\uf8ee\uf8f0 J (0)\n\nF\nJ (0)\nM\n\nJ T\nM\nJM JT\n\n(cid:16)\n\nJ (0)\nM\nJ (0)\nT\n\n\uf8f9\uf8fb.\n\n1. Initialization: J (0) =\n\n2. Repeat for t = 1, 2, 3, . . .:\n\n(cid:17)\u22121\n\n(a) P1: Project to the empirical distribution:\n\n\u02c6J (t) =\n\nF\nJ (t)\nM\n\n(J (t)\nM )T\nM (J (t)\n+ J (t)\n(b) P2: Project to the best \ufb01tting structure:\n\u02c6\u03a3(t)\nM\n\n\u02c6\u03a3T\n\n\uf8ee\uf8f0 J (t)\n(cid:16)\n\uf8ee\uf8ef\uf8f0 \u02c6\u03a3(t)\n\n\u03a3(t+1) =\n\nF\nM CL( \u02c6\u03a3(t)\n\u02c6\u03a3(t)\nT \u2212 \u02c6\u03a3(t)\n\nM\n\n(cid:16)\n\n\u02c6\u03a3(t)\nF\n\nT|F ) + \u02c6\u03a3(t)\n\nM\n\n\uf8f9\uf8fb. De\ufb01ne \u02c6\u03a3(t) =\n(cid:16)\n\u02c6J (t)(cid:17)\u22121\nF )\u22121(J (t)\n\uf8f9\uf8fa\uf8fb = CCL( \u02c6\u03a3(t)),\nM )T\n(cid:16)\n(cid:17)T\n(cid:17)\u22121(cid:16)\n(cid:16)\n(cid:17)T\n(cid:17)T\n(cid:17)\u22121(cid:16)\n. De\ufb01ne J (t+1) =(cid:0)\u03a3(t+1)(cid:1)\u22121\n\n\u02c6\u03a3(t)\nM\n\n\u02c6\u03a3(t)\nF\n\n.\n\n\u02c6\u03a3(t)\nM\n\n.\n\nwhere \u02c6\u03a3(t)\n\nT|F = \u02c6\u03a3(t)\n\nIn Algorithm 3 we summarize the latent Chow-Liu algorithm specialized for our family of GGMs,\nwhere both projections have exact closed-form solutions and exhibit complementary structure\u2014one\nusing the covariance and the other using the information parametrization. In projection P1, three\nblocks of the information matrix remain the same; In projection P2, three blocks of the covariance\nmatrix remain the same.\nThe two projections in Algorithm 3 can also be interpreted as alternating low-rank corrections :\nindeed,\n\n(cid:34) 0\n(cid:20) 0\n\n0\n\nIn P1\n\n\u02c6J (t)\n\n=\n\nand in P2 \u03a3(t+1) =\n\n(cid:17)\u22121\n\n(cid:16)\n\n0\n\u02c6\u03a3T\n\n0\n\n0 CL( \u02c6\u03a3T|F )\n\n(cid:35)\n\n(cid:34)\n(cid:34)\n\n+\n\n(cid:21)\n\n+\n\nJ (t)\nF\nJ (t)\nM\n\u02c6\u03a3(t)\nF\n\u02c6\u03a3(t)\nM\n\n(cid:35)(cid:16)\n(cid:35)(cid:16)\n\n(cid:17)\u22121(cid:20)\n(cid:17)\u22121(cid:20)\n\nJ (t)\nF\n\nJ (t)\nF\n\n(cid:16)\n\n(cid:17)T (cid:21)\n(cid:17)T (cid:21)\n\n,\n\nJ (t)\nM\n\n(cid:16)\n\n\u02c6\u03a3(t)\nF\n\n\u02c6\u03a3(t)\nF\n\n\u02c6\u03a3(t)\nM\n\n,\n\nwhere the second terms of both expressions are of low-rank when the size of the latent FVS is small.\nThis formulation is the most intuitive and simple, but a naive implementation of Algorithm 3 has\ncomplexity O(n3) per iteration, where the bottleneck is inverting full matrices \u02c6J (t) and \u03a3(t+1).\nBy carefully incorporating the inference algorithms into the projection steps, we are able to further\nexploit the power of the models and reduce the per-iteration complexity to O(kn2 +n2 log n), which\nis the same as the complexity of the conditioned Chow-Liu algorithm alone. We have the following\nproposition.\nProposition 2. Using Algorithm 3, the objective function of (3) decreases with the number of itera-\ntions, i.e., DKL(N (0, \u02c6\u03a3T )||N (0, \u03a3(t+1)\n)) \u2264 N (0, \u02c6\u03a3T )||N (0, \u03a3(t)\nT )). Using an accelerated version\nof Algorithm 3, the complexity per iteration is O(kn2 + n2 log n).\nDue to the page limit, we defer the description of the accelerated version (the accelerated latent\nChow-Liu algorithm) and the proof of Proposition 2 to Appendix C. In fact, we never need to ex-\nplicitly invert the empirical covariance matrix \u02c6\u03a3T in the accelerated version.\nAs a rule of thumb, we often use the spanning tree obtained by the standard Chow-Liu algorithm as\nan initial tree among the observed nodes. But note that P2 involves solving a combinatorial problem\nexactly, so the algorithm is able to jump among different graph structures which reduces the chance\n\nT\n\n6\n\n\fFigure 2: From left to right: 1) The true model (fBM with 64 time samples); 2) The best spanning\ntree; 3) The latent tree learned using the CLRG algorithm in [21]; 4) The latent tree learned using\nthe NJ algorithm in [21]; 5) The model with a size-one latent FVS learned using Algorithm 3. The\ngray scale is normalized for visual clarity.\n\n(a) 32 nodes\n\n(b) 64 nodes\n\n(c) 128 nodes\n\n(d) 256 nodes\n\nFigure 3: The relationship between the K-L divergence and the latent FVS size. All models are\nlearned using Algorithm 3 with 40 iterations.\n\nof getting stuck at a bad local minimum and gives us much more \ufb02exibility in initializing graph\nstructures. In the experiments, we will demonstrate that Algorithm 3 is not sensitive to the initial\ngraph structure.\n5 Experiments\nIn this section, we present experimental results for learning GGMs with small FVSs, observed or\nlatent, using both synthetic data and real data of \ufb02ight delays.\nFractional Brownian Motion: Latent FVS We consider a fractional Brownian motion (fBM)\nwith Hurst parameter H = 0.2 de\ufb01ned on the time interval (0, 1]. The covariance function is\n2 (|t1|2H + |t2|2H \u2212 |t1 \u2212 t2|2H ). Figure 2 shows the covariance matrices of approx-\n\u03a3(t1, t2) = 1\nimate models using spanning trees (learned by the Chow-Liu algorithm), latent trees (learned by\nthe CLRG and NJ algorithms in [21]) and our latent FVS model (learned by Algorithm 3) using 64\ntime samples (nodes). We can see that in the spanning tree the correlation decays quickly (in fact\nexponentially) with distance, which models the fBM poorly. The latent trees that are learned exhibit\nblocky artifacts and have little or no improvement over the spanning tree measured in the K-L di-\nvergence. In Figure 3, we plot the K-L divergence (between the true model and the learned models\nusing Algorithm 3) versus the size of the latent FVSs for models with 32, 64, 128, and 256 time\nsamples respectively. For these models, we need about 1, 3, 5, and 7 feedback nodes respectively\nto reduce the K-L divergence to 25% of that achieved by the best spanning tree model. Hence, we\nspeculate that empirically k = O(log n) is a proper choice of the size of the latent FVS. We also\nstudy the sensitivity of Algorithm 3 to the initial graph structure. In our experiments, for different\ninitial structures, Algorithm 3 converges to the same graph structures (that give the K-L divergence\nas shown in Figure 3) within three iterations.\nPerformance of the Greedy Algorithm: Observed FVS In this experiment, we examine the\nperformance of the greedy algorithm (Algorithm 2) when the FVS nodes are observed. For each run,\nwe construct a GGM that has 20 nodes and an FVS of size three as the true model. We \ufb01rst generate\na random spanning tree among the non-feedback nodes. Then the corresponding information matrix\nJ is also randomly generated: non-zero entries of J are drawn i.i.d. from the uniform distribution\nU [\u22121, 1] with a multiple of the identity matrix added to ensure J (cid:31) 0. From each generated\nGGM, we draw 1000 samples and use Algorithm 2 to learn the model. For 100 runs that we have\nperformed, we recover the true graph structures successfully. Figure 4 shows the graphs (and the\nK-L divergence) obtained using the greedy algorithm for a typical run. We can see that we have the\nmost divergence reduction (from 12.7651 to 1.3832) when the \ufb01rst feedback node is selected. When\nthe size of the FVS increases to three (Figure 4e), the graph structure is recovered correctly.\n\n7\n\nFBM true model: KL=0Best Spanning Tree: KL=4.055CLRG: KL=4.007NJ: KL=8.9741\u2212FVS: KL=1.8810510152000.511.5Size of Latent FVSK\u2212L Divergence0510152001234Size of Latent FVSK\u2212L Divergence051015200510Size of Latent FVSK\u2212L Divergence051015205101520Size of Latent FVSK\u2212L Divergence\f(a) True Model\n\n(b) KL=12.7651\n\n(c) KL=1.3832\n\n(d) KL=0.6074\n\n(e) KL=0.0048\n\nFigure 4: Learning a GGM using Algorithm 2. The thicker blue lines represent the edges among\nthe non-feedback nodes and the thinner red lines represent other edges. (a) True model; (b) Tree-\nstructured model (0-FVS) learned from samples; (c) 1-FVS model; (d) 2-FVS model; (e) 3-FVS\nmodel.\n\n(a) Spanning Tree\n\n(b) 1-FVS GGM\n\n(c) 3-FVS GGM\n\n(d) 10-FVS GGM\n\nFigure 5: GGMs for modeling \ufb02ight delays. The red dots denote selected feedback nodes and the\nblue lines represent edges among the non-feedback nodes (other edges involving the feedback nodes\nare omitted for clarity).\n\nFlight Delay Model: Observed FVS In this experiment, we model the relationships among air-\nports for \ufb02ight delays. The raw dataset comes from RITA of the Bureau of Transportation Statistics.\nIt contains \ufb02ight information in the U.S. from 1987 to 2008 including information such as scheduled\ndeparture time, scheduled arrival time, departure delay, arrival delay, cancellation, and reasons for\ncancellation for all domestic \ufb02ights in the U.S. We want to model how the \ufb02ight delays at different\nairports are related to each other using GGMs. First, we compute the average departure delay for\neach day and each airport (of the top 200 busiest airports) using data from the year 2008. Note that\nthe average departure delays does not directly indicate whether an airport is one of the major airports\nthat has heavy traf\ufb01c. It is interesting to see whether major airports (especially those notorious for\ndelays) correspond to feedback nodes in the learned models. Figure 5a shows the best tree-structured\ngraph obtained by the Chow-Liu algorithms (with input being the covariance matrix of the average\ndelay). Figure 5b\u20135d show the GGMs learned using Algorithm 2. It is interesting that the \ufb01rst node\nselected is Nashville (BNA), which is not one of the top \u201chubs\u201d of the air system. The reason is\nthat much of the statistical relationships related to those hubs are approximated well enough, when\nwe consider a 1-FVS approximation, by a spanning tree (excluding BNA) and it is the breaking of\nthe cycles involving BNA that provide the most reduction in K-L divergence over a spanning tree.\nStarting with the next node selected in our greedy algorithm, we begin to see hubs being chosen.\nIn particular, the \ufb01rst ten airports selected in order are: BNA, Chicago, Atlanta, Oakland, Newark,\nDallas, San Francisco, Seattle, Washington DC, Salt Lake City. Several major airports on the coasts\n(e.g., Los Angeles and JFK) are not selected, as their in\ufb02uence on delays at other domestic airports\nis well-captured with a tree structure.\n6 Future Directions\nOur experimental results demonstrate the potential of these algorithms, and, as in the work [14],\nsuggests that choosing FVSs of size O(log n) works well, leading to algorithms which can be scaled\nto large problems. Providing theoretical guarantees for this scaling (e.g., by specifying classes of\nmodels for which such a size FVS provides asymptotically accurate models) is thus a compelling\nopen problem. In addition, incorporating complexity into the FVS-order problem (e.g., as in AIC\nor BIC) is another direction we are pursuing. Moreover, we are also working towards extending our\nresults to the non-Gaussian settings.\nAcknowledgments\nThis research was supported in part by AFOSR under Grant FA9550-12-1-0287.\n\n8\n\n1234567891011121314151617181920381812345678910111213141516171819201234567891011121314151617181920312345678910111213141516171819203812345678910111213141516171819203818\fReferences\n[1] J. Pearl, \u201cA constraint propagation approach to probabilistic reasoning,\u201d Proc. Uncertainty in\n\nArti\ufb01cial Intell. (UAI), 1986.\n\n[2] C. Chow and C. Liu, \u201cApproximating discrete probability distributions with dependence trees,\u201d\n\nIEEE Trans. Inform. Theory, vol. 14, no. 3, pp. 462\u2013467, 1968.\n\n[3] M. Choi, V. Chandrasekaran, and A. Willsky, \u201cExploiting sparse Markov and covariance struc-\nture in multiresolution models,\u201d in Proc. 26th Annu. Int. Conf. on Machine Learning. ACM,\n2009, pp. 177\u2013184.\n\n[4] M. Comer and E. Delp, \u201cSegmentation of textured images using a multiresolution Gaussian\n\nautoregressive model,\u201d IEEE Trans. Image Process., vol. 8, no. 3, pp. 408\u2013420, 1999.\n\n[5] C. Bouman and M. Shapiro, \u201cA multiscale random \ufb01eld model for Bayesian image segmenta-\n\ntion,\u201d IEEE Trans. Image Process., vol. 3, no. 2, pp. 162\u2013177, 1994.\n\n[6] D. Karger and N. Srebro, \u201cLearning Markov networks: Maximum bounded tree-width graphs,\u201d\n\nin Proc. 12th Annu. ACM-SIAM Symp. on Discrete Algorithms, 2001, pp. 392\u2013401.\n\n[7] M. Jordan, \u201cGraphical models,\u201d Statistical Sci., pp. 140\u2013155, 2004.\n[8] P. Abbeel, D. Koller, and A. Ng, \u201cLearning factor graphs in polynomial time and sample com-\n\nplexity,\u201d J. Machine Learning Research, vol. 7, pp. 1743\u20131788, 2006.\n\n[9] A. Dobra, C. Hans, B. Jones, J. Nevins, G. Yao, and M. West, \u201cSparse graphical models for\n\nexploring gene expression data,\u201d J. Multivariate Anal., vol. 90, no. 1, pp. 196\u2013212, 2004.\n\n[10] M. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d J. Machine Learn-\n\ning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[11] J. Friedman, T. Hastie, and R. Tibshirani, \u201cSparse inverse covariance estimation with the graph-\n\nical lasso,\u201d Biostatistics, vol. 9, no. 3, pp. 432\u2013441, 2008.\n\n[12] P. Ravikumar, G. Raskutti, M. Wainwright, and B. Yu, \u201cModel selection in Gaussian graphical\nmodels: High-dimensional consistency of l1-regularized MLE,\u201d Advances in Neural Informa-\ntion Processing Systems (NIPS), vol. 21, 2008.\n\n[13] V. Vazirani, Approximation Algorithms. New York: Springer, 2004.\n[14] Y. Liu, V. Chandrasekaran, A. Anandkumar, and A. Willsky, \u201cFeedback message passing for\ninference in Gaussian graphical models,\u201d IEEE Trans. Signal Process., vol. 60, no. 8, pp.\n4135\u20134150, 2012.\n\n[15] N. Friedman, D. Geiger, and M. Goldszmidt, \u201cBayesian network classi\ufb01ers,\u201d Machine learn-\n\ning, vol. 29, no. 2, pp. 131\u2013163, 1997.\n\n[16] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, \u201cLatent variable graphical model selection\nvia convex optimization,\u201d in Communication, Control, and Computing (Allerton), 2010 48th\nAnnual Allerton Conference on.\n\nIEEE, 2010, pp. 1610\u20131613.\n\n[17] M. Dinneen, K. Cattell, and M. Fellows, \u201cForbidden minors to graphs with small feedback\n\nsets,\u201d Discrete Mathematics, vol. 230, no. 1, pp. 215\u2013252, 2001.\n\n[18] F. Brandt, \u201cMinimal stable sets in tournaments,\u201d J. Econ. Theory, vol. 146, no. 4, pp. 1481\u2013\n\n1499, 2011.\n\n[19] V. Bafna, P. Berman, and T. Fujito, \u201cA 2-approximation algorithm for the undirected feedback\n\nvertex set problem,\u201d SIAM J. Discrete Mathematics, vol. 12, p. 289, 1999.\n\n[20] S. Kirshner, P. Smyth, and A. W. Robertson, \u201cConditional Chow-Liu tree structures for model-\ning discrete-valued vector time series,\u201d in Proceedings of the 20th conference on Uncertainty\nin arti\ufb01cial intelligence. AUAI Press, 2004, pp. 317\u2013324.\n\n[21] M. J. Choi, V. Y. Tan, A. Anandkumar, and A. S. Willsky, \u201cLearning latent tree graphical\n\nmodels,\u201d Journal of Machine Learning Research, vol. 12, pp. 1729\u20131770, 2011.\n\n9\n\n\f", "award": [], "sourceid": 927, "authors": [{"given_name": "Ying", "family_name": "Liu", "institution": "MIT"}, {"given_name": "Alan", "family_name": "Willsky", "institution": "MIT"}]}