{"title": "Algebraic tests of general Gaussian latent tree models", "book": "Advances in Neural Information Processing Systems", "page_first": 6298, "page_last": 6307, "abstract": "We consider general Gaussian latent tree models in which the observed variables are not restricted to be leaves of the tree. Extending related recent work, we give a full semi-algebraic description of the set of covariance matrices of any such model. In other words, we find polynomial constraints that characterize when a matrix is the covariance matrix of a distribution in a given latent tree model. However, leveraging these constraints to test a given such model is often complicated by the number of constraints being large and by singularities of individual polynomials, which may invalidate standard approximations to relevant probability distributions. Illustrating with the star tree, we propose a new testing methodology that circumvents singularity issues by trading off some statistical estimation efficiency and handles cases with many constraints through recent advances on Gaussian approximation for maxima of sums of high-dimensional random vectors. Our test avoids the need to maximize the possibly multimodal likelihood function of such models and is applicable to models with larger number of variables. These points are illustrated in numerical experiments.", "full_text": "Algebraic tests of general Gaussian latent tree models\n\nDennis Leung\n\nDepartment of Data Sciences and Operations\n\nUniversity of Southern California\n\nDepartment of Statistics, University of Washington &\n\nDepartment of Mathematical Sciences, University of Copenhagen\n\ndmhleung@uw.edu\n\nMathias Drton\n\nmd5@uw.edu\n\nAbstract\n\nWe consider general Gaussian latent tree models in which the observed variables\nare not restricted to be leaves of the tree. Extending related recent work, we\ngive a full semi-algebraic description of the set of covariance matrices of any\nsuch model. In other words, we \ufb01nd polynomial constraints that characterize\nwhen a matrix is the covariance matrix of a distribution in a given latent tree\nmodel. However, leveraging these constraints to test a given such model is often\ncomplicated by the number of constraints being large and by singularities of\nindividual polynomials, which may invalidate standard approximations to relevant\nprobability distributions. Illustrating with the star tree, we propose a new testing\nmethodology that circumvents singularity issues by trading off some statistical\nestimation ef\ufb01ciency and handles cases with many constraints through recent\nadvances on Gaussian approximation for maxima of sums of high-dimensional\nrandom vectors. Our test avoids the need to maximize the possibly multimodal\nlikelihood function of such models and is applicable to models with larger number\nof variables. These points are illustrated in numerical experiments.\n\n1\n\nIntroduction\n\nLatent tree models are associated to a tree-structured graph in which some nodes represent observed\nvariables and others represent unobserved (latent) variables. Due to their tractability, these models\nhave found many applications in \ufb01elds ranging from the traditional life sciences, biology and\npsychology to contemporary areas such as arti\ufb01cial intelligence and computer vision; refer to Mourad\net al. [2013] for a comprehensive review. In this paper, we study the problem of testing the goodness-\nof-\ufb01t of a postulated Gaussian latent tree model to an observed dataset. In a low dimensional\nsetting where the number of observed variables is small relative to the sample size at hand, testing is\nusually based on the likelihood ratio which measures the divergence in maximum likelihood between\nthe postulated latent tree model and an unconstrained Gaussian model. This, however, requires\nmaximization of the possibly multimodal likelihood function of latent tree models. In contrast, recent\nwork of Shiers et al. [2016] takes a different approach and leverages known polynomial constraints\non the covariance matrix of the observed variables in a given Gaussian latent tree. Speci\ufb01cally, the\npostulated latent tree is tested with an aggregate statistic formed from estimates of the polynomial\nquantities involved. This approach can be traced back to Spearman [1904] and Wishart [1928]; also\nsee Drton et al. [2007, 2008].\nWe make the following new contributions. In Section 2, we extend the polynomial characterization of\nShiers et al. [2016] to cases where observed nodes may also be inner nodes of the tree as considered,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffor example, in the tree learning algorithms of Choi et al. [2011]. Section 3 describes how we\nmay use polynomial equality constraints to test a star tree model. We base ourselves on the recent\ngroundbreaking work of Chernozhukov et al. [2013a], form our test statistic as the maximum of\nunbiased estimates of the relevant polynomials, and calibrate the critical value for testing based on\nmultiplier bootstrapping techniques. This new way of using the polynomials to furnish a test allows\nus to handle latent trees with a larger number of observed variables and avoids potential singularity\nissues caused by individual polynomials. Numerical experiments in Section 4 makes comparisons to\nthe likelihood ratio test and assesses the size of our tests in \ufb01nite samples. Section 5 discusses future\nresearch directions.\n\nNotation. Let 1 \u2264 r \u2264 m be two positive integers. We let [m] = {1, . . . , m} and write(cid:8)m\n\n(cid:9) :=\n\n{I \u2286 [m] : |I| = r} for the collection of subsets of [m] with cardinality r. The supremum norm of\na vector is written (cid:107) \u00b7 (cid:107)\u221e. For two random variables R1 and R2, the symbols R1 =d R2 indicates\nthat R1 and R2 have the same distributions, and R1 \u2248d R2 indicates that the distributions are\napproximately equal. N (\u00b5, \u03c32) means a normal distribution with mean \u00b5 and standard deviation \u03c32.\n\nr\n\n2 Characterization of general Gaussian latent trees\n\nWe \ufb01rst provide the de\ufb01nition of the models considered in this paper. A tree is an undirected graph in\nwhich any two nodes are connected by precisely one path. Let T = (V, E) be a tree, where V is the\nset of nodes, and E is the set of edges which we take to be unordered duples of nodes in V . We say\nthat T is a latent tree if it is paired with a set X = {X1, . . . , Xm} \u2282 V , corresponding to m observed\nvariables, such that v \u2208 X whenever v \u2208 V is of degree less than or equal to two. In particular, X\ncontains all leaf nodes of the tree T (i.e., nodes of degree 1), but it may contain additional nodes.\nThe nodes in V \\ X correspond to latent variables that are not observed but each have at least three\nother neighbors in the tree. This minimal degree requirements of 3 on the latent nodes ensures\nidenti\ufb01ability [Choi et al., 2011, p.1778]. In the terminology of mathematical phylogenetics, T is a\nsemi-labeled tree on X with an injective labeling map; see Semple and Steel [2003, p.16]. However,\nphylogenetic trees are latent trees restricted to have X equal to the set of leaves. While we have\nde\ufb01ned X as a set of nodes, it will be convenient to abuse notation slightly and let X also denote a\nrandom vector (X1, . . . , Xm)(cid:48) whose coordinates correspond to the nodes in question. The context\nwill clarify whether we refer to nodes or random variables.\nNow we present the polynomial characterization of a Gaussian latent tree graphical model that\nextends the results in Shiers et al. [2016]. The Gaussian graphical model on T , denoted M(T ), is\nthe set of all |V |-variate Gaussian distributions respecting the pairwise Markov property of T , i.e.,\nfor any pair u, v \u2208 V with (u, v) (cid:54)\u2208 E, the random variables associated to u and v are conditionally\nindependent given the variables corresponding to V \\{u, v}. The T -Gaussian latent tree model on X,\ndenoted MX(T ), is the set of all m-variate Gaussian distributions that are the marginal distribution\nfor X under some distribution in M(T ). For a given distribution in M(T ), let \u03c1pq be the Pearson\ncorrelation of the pair (Xp, Xq) for any 1 \u2264 p (cid:54)= q \u2264 m. The pairwise Markov property implies that\n\n(cid:89)\n\n\u03c1pq =\n\n(u,v)\u2208phT (Xp,Xq)\n\n\u03c1(cid:48)\nuv,\n\n(2.1)\n\nwhere phT (Xp, Xq) denotes the set of edges on the unique path that connects Xp and Xq in T , and\n\u03c1(cid:48)\nuv is the Pearson correlation between a pair of nodes u and v in V . Of course, \u03c1(cid:48)\nuv = \u03c1pq if u = Xp\nand v = Xq. In the sequel, we often abbreviate phT (Xp, Xq) as phT (p, q) for simplicity.\nSuppose \u03a3 = (\u03c3pq)1\u2264p,q\u2264m is the covariance matrix of X. Our task is to test whether \u03a3 comes\nfrom MX(T ) against a saturated Gaussian graphical model. We assume that all edges in the tree T\ncorrespond to a nonzero correlation, so that \u03a3 contains no zero entries. The covariance matrices for\nMX(T ) are parametrized via (2.1). As shown in Shiers et al. [2016], this set of covariance matrices\nmay be characterized by leveraging results on pseudo-metrics de\ufb01ned on X. Suppose w : E \u2212\u2192 R\u22650\nis a function that assigns non-negative weights to the edges in E. One can then de\ufb01ne a pseudo-metric\n\u03b4w : X \u00d7 X \u2212\u2192 R\u22650 by\n\n\u03b4w(Xp, Xq) =\n\ne\u2208phT (p,q) w(e)\n\n: p (cid:54)= q,\n: p = q.\n\n(cid:26) (cid:80)\n\n0\n\n2\n\n\fThis is known as a T -induced pseudo-metric on X. The following lemma characterizes all the\npseudo-metrics on X that are T -induced. The proof is a bit delicate and is given in our supplementary\nmaterial.\nLemma 2.1. Suppose \u03b4 : X \u00d7 X \u2212\u2192 R\u22650 is a pseudo-metric de\ufb01ned on X. Let \u03b4pq = \u03b4(Xp, Xq)\nfor any p, q \u2208 [m] for simplicity. Then \u03b4 is a T -induced pseudo-metric if and only if for any four\ndistinct 1 \u2264 p, q, r, s \u2264 m such that phT (p, q) \u2229 phT (r, s) = \u2205,\n\nand for any three distinct 1 \u2264 p, q, r \u2264 m,\n\n\u03b4pq + \u03b4rs \u2264 \u03b4pr + \u03b4qs = \u03b4ps + \u03b4qr,\n\n(2.2)\n\n(2.3)\n\n\u03b4pq + \u03b4qr = \u03b4pr\n\nif phT (p, r) = phT (p, q) \u222a phT (q, r).\nLemma 2.1 modi\ufb01es Corollary 1 in Shiers et al. [2016] by requiring the extra equality constraints in\n(2.3) concerning three distinct variable indices. For any subset S \u2282 X, let T|S be the restriction of\nT to S, that is, the minimal subtree of T induced by the elements in S with all the nodes of degree\ntwo not in S suppressed [Semple and Steel, 2003, p.110]; refer to Section 7 in our supplementary\nmaterial for the related graphical notions. Shiers et al. [2016] only consider phylogenetic trees in\nwhich the observed variables X always correspond to the set of nodes in T with degree one. In\nthis case the constraint in (2.3) is vacuous. Indeed, if Xp, Xq, Xr are any three observed nodes\nin T , then T|{Xp, Xq, Xr} must have the con\ufb01guration on the left panel of Figure 2.1, and it\ncan be seen that phT (\u03c0p, \u03c0q) \u222a phT (\u03c0q, \u03c0r) (cid:54)= phT (\u03c0p, \u03c0r) for any permutation (\u03c0p, \u03c0q, \u03c0r) of\n(p, q, r). However, for a general latent tree T whose observed nodes are not con\ufb01ned to be the leaves,\ncondition (2.3) is necessary for a pseudo-metric \u03b4 to be T -induced: T|{Xp, Xq, Xr} may take the\ncon\ufb01guration on the right panel of Figure 2.1, where for some permutation (\u03c0p, \u03c0q, \u03c0r) of (p, q, r),\nphT (\u03c0p, \u03c0r) = phT (\u03c0p, \u03c0q) \u222a phT (\u03c0q, \u03c0r), and it must hold that\n\n\u03b4\u03c0p\u03c0r = \u03b4\u03c0p\u03c0q + \u03b4\u03c0q\u03c0r\n\nif \u03b4 is T -induced.\nWhile condition (2.2) appears in the result of Shiers et al. [2016], it may lead to different patterns\nof constraints for a general latent tree. For four distinct indices 1 \u2264 p, q, r, s \u2264 m, there are\nthree possible partitions into two subsets of equal sizes, namely, {p, q}|{r, s}, {p, r}|{q, s} and\n{p, s}|{q, r}. These three partitions correspond to the path pairs\n\n(phT (p, q), phT (r, s)), (phT (p, r), phT (q, s)) and (phT (p, s), phT (q, r))\n\n(2.4)\nrespectively. Now refer to Figure 2.2 which shows all possible con\ufb01gurations of the restriction of T\nto the four observed variables Xp, Xq, Xr, Xs. In Figure 2.2(a)-(c), up to permutations of the indices\n{p, q, r, s}, only one of three pairs in (2.4) can give an empty set when the intersection of its two\ncomponent paths is taken. In light of (2.2), this implies that, for some permutation \u03c0 of the indices\np, q, r, s,\n\n\u03b4\u03c0p\u03c0q + \u03b4\u03c0r\u03c0s \u2264 \u03b4\u03c0p\u03c0r + \u03b4\u03c0q\u03c0s = \u03b4\u03c0p\u03c0s + \u03b4\u03c0q\u03c0r .\n\n(2.5)\nOn the contrary, in Figure 2.2(d) and (e), it must be the case that each of the three path pairs in\n(2.4) gives an empty set when an intersection is taken between its two component paths, giving the\nequalities \u03b4pq + \u03b4rs = \u03b4pr + \u03b4qs = \u03b4ps + \u03b4qr in consideration of (2.2).\nLemma 2.1 readily implies a characterization of the latent tree model MX(T ) via polynomial\nconstraints in the entries of the covariance matrix \u03a3 = (\u03c3pq) as spelt out in the ensuing corollary. Its\nproof employs similar arguments in Shiers et al. [2016] and is deferred to our supplementary material.\n\n4\n\nIn what follows, we let Q \u2282(cid:8)m\nother words, Q contains all S \u2208(cid:8)m\nwill let L be the set of all triples S = {p, q, r} \u2208 (cid:8)m\n\n(cid:9) such that only one of\n(cid:9) be the set of all quadruples {p, q, r, s} \u2208(cid:8)m\n(cid:9) such that T|S is one of the con\ufb01gurations in Figure 2.2(a)-(c).\n(cid:9) such that T|S has the con\ufb01guration in\n\nthe three path pairs in (2.4) gives an empty set when the union of its two component paths is taken. In\nGiven {p, q, r, s} \u2208 Q, we write {p, q}|{r, s} \u2208 Q to indicate that {p, q, r, s} belongs to Q in a\nway that it is the path pairs phT (p, q) and phT (r, s) that have empty intersection. Similarly, we\nFigure 2.1(b). We will use the notation p \u2212 q \u2212 r \u2208 L to indicate that q is the \u201cmiddle point\" such\nthat phT (p, q) \u2229 phT (q, r) = \u2205.\nCorollary 2.2. Suppose \u03a3 = (\u03c3pq)1\u2264p,q\u2264m is the covariance matrix of X and has no zero entries.\nThe following are together necessary and suf\ufb01cient for the distribution of X to belong to MX(T ):\n\n4\n\n4\n\n3\n\n3\n\n\f(a)\nFigure 2.1: The possible restrictions of a latent tree to three distinct observed variables. Observed\nvariables correspond to solid black dots, latent variables to grey circles.\n\n(b)\n\nFigure 2.2: The possible restrictions of a latent tree to four distinct observed variables. From left to\nright, (a)-(e). Observed variables correspond to solid black dots, latent variables to grey circles.\n\ni. Inequality constraints:\n\n(a) For any {p, q, r} \u2208(cid:8)m\n(b) For any {p, q, r} \u2208(cid:8)m\n\n3\n\n3\n\n(cid:9), \u03c3pq\u03c3pr\u03c3qr \u2265 0.\n(cid:9)\\L,\n\nqq\u03c32\n(c) For any {p, q}|{r, s} \u2208 Q, \u03c32\n\n\u03c32\npq\u03c32\n\npr, \u03c32\n\npr\u03c32\n\nqr \u2212 \u03c32\n\nqr \u2212 \u03c32\npq\u03c32\n\npr\u03c32\nqs \u2212 \u03c32\n\nrr\u03c32\nrs \u2264 0.\n\npq, \u03c32\n\npq\u03c32\n\npr \u2212 \u03c32\n\nqr \u2264 0.\n\npp\u03c32\n\nii. Equality constraints:\n\n(a) For any p \u2212 q \u2212 r \u2208 L, \u03c3pq\u03c3qr \u2212 \u03c3qq\u03c3pr = 0.\n(b) For any {p, q}|{r, s} \u2208 Q, \u03c3pr\u03c3qs \u2212 \u03c3ps\u03c3qr = 0.\n(c) For any {p, q, r, s} (cid:54)\u2208 Q, \u03c3ps\u03c3qr \u2212 \u03c3pr\u03c3qs = \u03c3pq\u03c3rs \u2212 \u03c3pr\u03c3qs = 0.\n\n3 Testing a star tree model\n\nIn this section we illustrate how one can test a postulated Gaussian latent tree model using Corol-\nlary 2.2. In order to focus the discussion we treat the simple but important special case of a star\ntree, which corresponds to a single factor model. A single factor model with m observed variables\nX = {X1 . . . , Xm} can be described by the linear system of equations\n1 \u2264 p \u2264 m,\n\nXp = \u00b5p + \u03b2pH + \u0001p,\n\n(3.1)\nwhere \u00b5p is the mean of Xp, H \u223c N (0, 1) is a latent variable, \u03b2p is the loading coef\ufb01cient for\nvariable Xp, and \u0001p \u223c N (0, \u03c32\np,\u0001) is the idiosyncratic error for variable Xp. All of H, \u00011, . . . , \u0001m\nare independent. The model postulates that X1, . . . , Xm are conditionally independent given H. It\nthus corresponds to the graphical model associated with a star tree T(cid:63) = (V, E) with V = X \u222a {H},\nE = {(H, Xp)}1\u2264p\u2264m.\nLet X1, . . . , Xn be i.i.d. draws from the distribution of X, which is assumed to be Gaussian. Our goal\nis to test whether the distribution of X belongs to the single factor model MX(T(cid:63)). Without loss of\ngenerality, we may assume that \u00b5p = 0 for all p \u2208 [m] [Anderson, 2003, Theorem 3.3.2]. We proceed\nby testing whether all the constraints in Corollary 2.2 are simultaneously satis\ufb01ed with respect to the\nlatent tree T(cid:63). For simplicity, we will focus on testing the equality constraints in Corollary 2.2(ii),\nand brie\ufb02y discuss how one can incorporate the inequality constraints in Corollary 2.2(i) in Section 5.\nFor T(cid:63), both sets L and Q are empty, so that Corollary 2.2(ii)(a) and (b) are automatically satis\ufb01ed.\n\nHence, we are only left with Corollary 2.2(ii)(c): For any {p, q, r, s} \u2208(cid:8)m\n\n(cid:9),\n\n4\n\n\u03c3ps\u03c3qr \u2212 \u03c3pr\u03c3qs = \u03c3pq\u03c3rs \u2212 \u03c3pr\u03c3qs = 0.\n\n(3.2)\n\nThe two polynomials above, equal to det(\u03a3pq,sr) and det(\u03a3ps,qr) respectively, are known as tetrads\nin the literature of factor analysis. It is well-known that they de\ufb01ne equality constraints for a single\nfactor model [Bekker and de Leeuw, 1987, Bollen and Ting, 1993, Drton et al., 2007].\n\n4\n\n\f4\n\n\u221a\n\n\u221a\n\n3.1 Estimating tetrads\n\nThe idea now is to estimate each one of the 2 \u00b7(cid:0)m\ntest statistic. From the sample covariance matrix S = (spq) = n\u22121(cid:80)n\n\n(cid:1) tetrads in (3.2), and aggregate the estimates in a\n\ni , a straightforward\nsample tetrad estimate, say spssqr \u2212 sprsqs, can be computed.\nIf one de\ufb01ne the vectors t =\n(sps, sqr, spr, sqs)(cid:48) and t0 = (\u03c3ps, \u03c3qr, \u03c3pr, \u03c3qs)(cid:48), as well as the function g(t) = spssqr \u2212 sprsqs,\nn(g(t)\u2212 g(t0)) \u2192 N (0,\u2207g(t0)(cid:48)V \u2207g(t0)), where V is the\nby the delta method it is expected that\nn(t\u2212 t0) and \u2207g(t0) is the gradient of g(\u00b7) evaluated at t0. However,\nlimiting covariance matrix of\nthe distribution of this sample tetrad becomes asymptotically degenerate at singularities, that is, when\nthe gradient \u2207g(t0) vanishes, which happens if the underlying true covariances are zero [Drton and\nXiao, 2016]. Consequently, a standardized sample tetrad cannot be well approximated by a normal\ndistribution if the underlying correlations are weak. More generally, even for stronger correlations,\nwe found it dif\ufb01cult to reliably estimate the variance of all sample tetrads in larger-scale models.\nWe propose alternative estimators for which sampling variability can be estimated more easily. Due to\nthe independence of samples, the tetrad det(\u03a3pq,sr) = \u03c3ps\u03c3qr \u2212 \u03c3pr\u03c3qs can be estimated unbiasedly\nwith the differences\n\ni=1 XiXT\n\nYi,(pq)(sr) := Xp,iXs,iXq,i+1Xr,i+1 \u2212 Xp,iXr,iXq,i+1Xs,i+1, i = 1, . . . , n \u2212 1,\n\n(3.3)\nwhere the subscripts in Yi,(pq)(sr) is indicative of the row and column indices for the subma-\ntrix \u03a3pq,sr. These differences can then be averaged for an estimate of the tetrad. Similarly,\none can form Yi,(ps)(qr) to estimate det(\u03a3ps,qr) in (3.2).\nIf we arrange all the tetrads from\n{det(\u03a3pq,sr), det(\u03a3ps,qr)}{p,q,r,s}\u2208{m\nmates {Yi,(pq)(sr), Yi,(ps)(qr)}{p,q,r,s}\u2208{m\ntheorem for 1-dependent sums ensures that for suf\ufb01ciently large sample size n we have the distribu-\nwhere \u00afY = (n \u2212 1)\u22121(cid:80)n\u22121\ntional approximation\n\n(cid:1)-vector \u0398, and correspondingly arrange the esti-\n(cid:1)-vector Yi for each i, then the central limit\n\n(3.4)\ni=1 Yi and \u03a5 = Cov[Y1, Y1] + 2Cov[Y1, Y2]. The latter limiting\ncovariance matrix will not degenerate to a singular matrix even if the underlying covariance matrix\nfor X has zeros at which some of the tetrads are singular (i.e. have zero gradient).\n\n4} into a 2(cid:0)m\n4} into a 2(cid:0)m\n\nn \u2212 1( \u00afY \u2212 \u0398) \u2248d N (0, \u03a5),\n\n\u221a\n\n4\n\n4\n\n3.2 Bootstrap test\nThe fact from (3.4) could serve as the starting point for a test of model M(T(cid:63)). However, the normal\napproximation quickly becomes of concern when moving beyond a small number of variables m.\n\nIndeed, the dimension of \u0398, 2(cid:0)m\n(cid:1), may well be close to the sample size n, or even larger. For\n2(cid:0)8\n(cid:1) = 140, more than half the sample size. A recent work of Zhang and Wu [2017], which follows\n\ninstance, if n = 250, for a model with merely 8 observed variables the dimension of \u0398 is already\n\nup on the groundbreaking paper of Chernozhukov et al. [2013a] on Gaussian approximation for\nmaxima of high dimensional independent sums, suggests that while the approximation in (3.4) may\nbe dubious, by taking a supremum norm on both sides, the Gaussian approximation\n\n4\n\n4\n\nn \u2212 1(cid:107)( \u00afY \u2212 \u0398)(cid:107)\u221e \u2248d (cid:107)Z(cid:107)\u221e,\n\n(3.5)\nwhere Z =d N (0, \u03a5), can be valid even the dimension of \u0398 is large compared to n. In fact, the\noriginal work of Chernozhukov et al. [2013a] suggested that asymptotically, the dimension can be\nsub-exponential in the sample size for the Gaussian approximation to hold. In what follows, we will\ndiscuss implementation of and experiments with a vanishing tetrad test based on (3.5). While it is\npossible to adapt the supporting theory for the present application, the technical details are involved\nand beyond the scope of this conference paper.\nSince \u00afY from (3.4) and (3.5) is an estimator of the vector of tetrads \u0398, it is natural to use (cid:107) \u00afY(cid:107)\u221e as the\ntest statistic and reject the model M(T(cid:63)) for large values of (cid:107) \u00afY(cid:107)\u221e. The Gaussian approximation (3.5)\nsuggests that when M(T(cid:63)) is true, i.e. \u0398 = 0,\nn \u2212 1(cid:107)Y(cid:107)\u221e is distributed as (cid:107)Z(cid:107)\u221e. Nevertheless,\nto calibrate critical values based on the distribution of (cid:107)Z(cid:107)\u221e, one must estimate the unknown\ncovariance matrix \u03a5. Zhang and Wu [2017] suggested the batched mean estimator\n\n\u221a\n\n\u221a\n\n(cid:33)T\n\n(Yi \u2212 \u00afY)\n\n,\n\n(3.6)\n\n(cid:32)(cid:88)\n\n\u03c9(cid:88)\n\nb=1\n\ni\u2208Lb\n\n\u02c6\u03a5 =\n\n1\nB\u03c9\n\n(cid:33)(cid:32)(cid:88)\n\ni\u2208Lb\n\n(Yi \u2212 \u00afY)\n\n5\n\n\fwhere for a batch size B and \u03c9 := (cid:98)(n \u2212 1)/B(cid:99) one considers the non-overlapping sets of samples\nLb = {1 + (b \u2212 1)B, . . . , bB}, b = 1, . . . , \u03c9. The \u201cbatching\" aims to capture the dependence among\nthe Yi\u2019s, and has been widely studied in the time series literature [B\u00fchlmann, 2002, Lahiri, 2003]. If\nmodel M(T(cid:63)) is true, then (3.5) yields that\n\n\u221a\n\nT :=\n\nn \u2212 1(cid:107) diag( \u02c6\u03a5)\u22121/2 \u00afY(cid:107)\u221e \u2248d (cid:107) diag( \u02c6\u03a5)\u22121/2 \u02dcZ(cid:107)\u221e,\n\nwhere the right-hand side is to interpreted conditionally on \u02c6\u03a5, with \u02dcZ \u223c N (0, \u02c6\u03a5) and diag( \u02c6\u03a5)\ncomprising only the diagonal of \u02c6\u03a5. More precisely, for a \ufb01xed test level \u03b1 \u2208 (0, 1), if we de\ufb01ne q1\u2212\u03b1\nto be the conditional (1 \u2212 \u03b1)-quantile of the distribution of (cid:107) diag( \u02c6\u03a5)\u22121/2 \u02dcZ(cid:107)\u221e given \u02c6\u03a5, then\n\n(3.7)\naccording to Zhang and Wu [2017, Corollary 5.4]. We will use T as our test statistic for the model\nM(T(cid:63)), and calibrate the critical value based on (3.7) by simulating the conditional quantile q1\u2212\u03b1\nfrom (cid:107) diag( \u02c6\u03a5)\u22121/2 \u02dcZ(cid:107)\u221e for \ufb01xed \u02c6\u03a5.\n\nP (T > q1\u2212\u03b1) \u2248 \u03b1,\n\n3.3\n\nImplementation\n\nWhile our above presentation invoked the estimate \u02c6\u03a5, which is a matrix with O(m8) entries, we\nmay in fact bypass the problem of computing such a large covariance matrix for the tetrad estimates.\nTo simulate the conditional quantile q1\u2212\u03b1 in (3.7), let e1, . . . , e\u03c9 be i.i.d. standard normal random\nvariables, and consider the expression\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag( \u02c6\u03a5)\u22121/2\n\n\u221a\n\nB\u03c9\n\n\u03c9(cid:88)\n\nb=1\n\neb\n\n(cid:32)(cid:88)\n\ni\u2208Lb\n\n(cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n(Yi \u2212 \u00afY)\n\n,\n\n(3.8)\n\nwhich has exactly the same distribution as (cid:107) diag( \u02c6\u03a5)\u22121/2 \u02dcZ(cid:107)\u221e conditioning on the data X1, . . . , Xn.\nWe emphasize the O(m4) diagonal entries of \u02c6\u03a5 are easily computed as variances in (3.6).\nIn\nconclusion, we perform the following multiplier bootstrap procedure: (i) Generate many, say E =\n1000, sets of {e1, . . . , e\u03c9}, (ii) evaluate (3.8) for each of these E sets, and (iii) take q1\u2212\u03b1 to be the\n1\u2212 \u03b1 quantile from the resulting E numbers. Despite the bootstrap being a computationally intensive\nprocess, it is not hard to see that the evaluation of (3.8) for all E sets of multipliers will involve\nO(m4nE) operations, which even for moderate m is far less than the O(m8) operations needed to\nobtain an entire covariance matrix for all tetrads.\nRemark. It is instructive to make a comparison with the testing methodology in Shiers et al. [2016],\nwhere the focus was on lower-dimensional applications. Suppose \u03c4 : \u03a3 (cid:55)\u2192 \u0398 is the function that\nmaps the covariance matrix \u03a3 into the vector \u0398 of tetrads in (3.2). To test the vanishing of the tetrads,\nShiers et al. [2016] form plug-in estimates \u02c6\u0398 = \u03c4 (S) for \u0398 with the sample covariance matrix\n\n(cid:1)-vector \u03c4 (S), they\nS = n\u22121(cid:80)n\nwhere (cid:99)Var[\u03c4 (S)] is a consistent estimate for Var[\u03c4 (S)]; see also Drton et al. [2008]. For a test of\nmodel M(T(cid:63)), this statistic is now compared to a chi-square distribution with 2(cid:0)m\n(cid:1) degrees of\nthe computational disadvantage that one explicitly uses the entire matrix (cid:99)Var[\u03c4 (S)] with its O(m8)\n\ni . Letting Var[\u03c4 (S)] be the covariance matrix for the 2(cid:0)m\n\nfreedom. While this calibration is justi\ufb01ed for suf\ufb01ciently large sample size n by a joint normal\napproximation analogous to (3.5), it can be problematic for large m. Even more pressing can be\n\nn\u03c4 (S)T ((cid:99)Var[\u03c4 (S)])\u22121\u03c4 (S),\n\nform a Hotelling\u2019s T 2 type statistic as\n\ni=1 XiXT\n\n(3.9)\n\n4\n\n4\n\nentries.\n\n4 Numerical experiments\n\nWe now report on some experiments with the bootstrap test based on the sup-norm of the estimated\ntetrads T proposed in Section 3. In the implementation we always use E = 1000 sets of normal\nmultipliers to simulate the quantile q1\u2212\u03b1 and work with batch size B = 3 in (3.8). We also benchmark\nour methodology against the likelihood ratio test for factor models implemented by the function\nfactanal in the base library of R, which implements a likelihood ratio (LR) test with Bartlett\ncorrection for more accurate asymptotic approximation. The critical value of the LR test is calibrated\n\n(cid:1) \u2212 1 degrees of freedom [Drton et al., 2009, p.99].\n\nwith the chi-square distribution with(cid:0)m\u22121\n\n2\n\n6\n\n\fFigure 4.1: Empirical test sizes vs nominal test levels based on 500 experiments. Data are generated\nbased on MX(T(cid:63)) with parameters as prescribed in the text. Upper panels: (m, n) = (20, 250).\nLower panels: (m, n) = (20, 500). Left panels: Setup 1. Right panels: Setup 2. Open circles: Test\nbased on the statistic T . Crosses: LR test implemented by factanal.\n\n4.1 Low dimensional setup\n\np,\u0001 all equal 1/3.\n\nWe \ufb01rst consider two experimental setups, each with data generated from the one-factor model in\n(3.1) for both (m, n) = (20, 250) and (m, n) = (20, 500). The model parameters are as follows:\n(i) Setup 1: all loadings \u03b2p and error variances \u03c32\np,\u0001 are taken to be 1. (ii) Setup 2: \u03b21 and \u03b22 are\ntaken to be 10, while the other loadings are independently generated based on a normal distribution\nwith mean 0 and variance 0.2. The error variances \u03c32\nFor different nominal test levels \u03b1 in the range (0, 1) that are 0.01 apart, we compare the empirical\nsizes of our test based on the statistic T and the likelihood ratio (LR) test implemented by the function\nfactanal, using 500 repetitions of experiments. The results are shown in Figure 4.1. The left two\npanels correspond to Setup 1 and the right two panels correspond to Setup 2, while the upper panels\ncorrespond to (m, n) = (20, 250) and lower correspond to (m, n) = (20, 500). While we show the\nentire range (0, 1) for the x-axis, practical interest is typically in the initial part where the nominal\nerror rate is in say (0, 0.1).\nIn Setup 1, for both sample sizes, the empirical test sizes of the LR test align almost perfectly with the\n45\u25e6 line as one would expect from classical theory. The sizes of our test based on T also align better\nwith 45\u25e6 line as sample sizes grow. Note that for nominal test levels that are of practical interest, T\nalso gives conservative test sizes for both sample sizes.\nIn Setup 2, where parameters are close to being \u201csingular\", one can see the true advantage of using\nT over the LR test. The empirical test sizes of the LR test with factanal do not align well with\nthe 45\u25e6 line as one normally expect from classical theory, whereas the test sizes of our statistic T\nlean closer to the 45\u25e6 line as n increases. Particularly the performance of the LR test is problematic\nsince, by rejecting the true model (3.1) all too often, it fails to give even an approximate control\non type 1 error. Note that the values of \u03b2 and \u03c3p,\u0001 are such that, for the most part, the observed\nvariables X are rather weakly dependent on each other. If the observations were in fact independent\nthen the likelihood ratio test statistic does not exhibit a chi-square limiting distribution [Drton, 2009,\nTheorem 6.1]. This highlights the fact that, in addition to avoiding any non-convex optimization of\nthe likelihood function of the factor model, our approach based on the simple estimates from (3.3) is\nnot subject to non-standard limiting behaviors that plague the LR test when the parameter values lean\nclose to singularities of the parameter space [Drton, 2009].\n\n7\n\n0.00.20.40.60.81.00.00.20.40.60.81.0NOMINAL LEVELTEST SIZE0.00.20.40.60.81.00.00.20.40.60.81.0NOMINAL LEVELTEST SIZE0.00.20.40.60.81.00.00.20.40.60.81.0NOMINAL LEVELTEST SIZE0.00.20.40.60.81.00.00.20.40.60.81.0NOMINAL LEVELTEST SIZE\fFigure 4.2: Empirical test size vs nominal test levels based on 500 experiments for data generated\nfrom MX(T(cid:63)) under Setup 1 and (m, n) = (100, 250). Open circles: Test based on T . Crosses: LR\ntest implemented by factanal.\n\n4.2 Higher dimensional setup\n\nOur last experiment aims to compare the test sizes of the two tests when the number of observed\nvariables m is relatively large compared to n. Data are exactly as in Setup 1, except that (m, n) =\n(100, 250). For such a model with large m, the number of tetrads involved in our testing methodology\nis so large that even after taking the supremum norm one shouldn\u2019t expect (3.5) to hold; for example,\n\nwhen m = 50, the dimension of \u0398 is 2\u00b7(cid:0)50\n(cid:1) = 460600, and one should be skeptical about the validity\n(cid:1) tetrads, and proceed with the bootstrapping procedure in (3.8) with Yi being\n10000 of the 2 \u00b7(cid:0)m\ntest sizes for a practical range of nominal levels when the total number of tetrads being tested, 2\u00b7(cid:0)20\n(cid:1),\n\nestimates for this selected subset of tetrads alone. The choice of 10000 tetrads to be tested is based on\nthe fact that, in the previous experiments with (m, n) = (20, 250), our test gives reasonable empirical\n\nof (3.5) when we only have the sample size n = 250. To implement our test, we \ufb01rst randomly select\n\n4\n\n4\n\nis approximately 10000. Since the subset of tetrads is randomly selected, our test is still expected to\napproximately control the test sizes at nominal level. The results are reported in Figure 4.2 .\nAs seen, the test based on T has the main features seen in the \ufb01rst experiment. In particular, it\nsuccessfully controls type I error rates for the practical range of \u03b1 \u2208 (0, 0.1). In contrast, with m\nincreased to 100, the LR test drastically fails to control type I error rate. This is despite the fact that\nthe setup is regular with parameter values that are far from any model singularity. The reason for the\nfailure of the LR test is the fact that the dimension is on the same order as the sample size of 250. The\nsample size is not large enough for chi-square asymptotics based on \ufb01xed dimension m to \u201ckick in\u201d.\n\n4\n\n5 Discussion\n\nIn this paper we have established a full set of polynomial constraints on the covariance matrix of\nthe observed variables, in the form of both equalities and inequalities, that characterizes a general\nGaussian latent tree model whose observed nodes are not con\ufb01ned to be the leaves. Focusing on\nthe special case of a star tree model, we also experimented with a new methodology for testing\nthe equality constraints by forming unbiased estimates of the polynomials involved. In simulation\nstudies, when the number of variables involved is large or the underlying parameters are close to\nbeing \u201csingular\", our test compares favorably with the likelihood ratio test in terms of test size.\nOur results have paved the way for developing a full-\ufb02edged algebraic test for a Gaussian latent tree\nmodel. Although we have not pursued this generality in the present conference paper, we give a brief\ndiscussion here. Of course, to do so one would \ufb01rst need to write an ef\ufb01cient graph algorithm to tease\nout all the polynomials entailed by Corollary 2.2 for a given latent tree input. Then the current testing\nmethodology can be adopted by forming unbiased estimates of all these polynomials at hand, which\nalso brings to our attention that in Section 3 only the equality constraints in Corollary 2.2(ii) were used\nto test the single factor model. For illustration, take the 3-degree monomial in Corollary 2.2 (i)(a) as\n\n8\n\n0.00.20.40.60.81.00.00.20.40.60.81.0NOMINAL LEVELTEST SIZE\fwhich is unbiased for \u03c3pq\u03c3pr\u03c3qr, and then use (n\u22122)\u22121(cid:80)n\u22122\nestimate becomes \u2212(n \u2212 2)\u22121(cid:80)n\u22122\n\nan example. Like (3.3), one may form a summand Yi,(p,q,r) = Xp,iXq,iXp,i+1Xr,i+1Xq,i+2Xr,i+2,\ni=1 Yi,(p,q,r) as an averaged estimator. To\nincorporate the constraints in Corollary 2.2 (i) into our test one can \ufb01rst arrange all those inequalities\ninto \u201cless than\" conditions, i.e., Corollary 2.2 (i)(a) becomes \u2212\u03c3pq\u03c3pr\u03c3qr \u2264 0 and the corresponding\ni=1 Yi,(p,q,r). Following that, in the de\ufb01nition of the test statistic\nT , one can take a maximum over all the unbiased estimates for the \u201cless than\" versions of the\npolynomials in Corollary 2.2(i), in addition to the absolute values of the estimates for the polynomials\nin Corollary 2.2(ii). The resulting test statistic shall also reject the model M(T(cid:63)) when its value is\ntoo large. While critical values can still be calibrated with multiplier bootstrap, additional techniques\nsuch as inequality selection can be incorporated to contain the power loss as a result of testing the\ninequalities; see Chernozhukov et al. [2013b] for more details.\nAnother challenge is the determination of the batch size B in (3.6). In our simulation studies of\nSection 4 we took B = 3 since we believe that a batch size of 3 should be enough to capture\ndependence among the 1-dependent summands. Batch size determination has been widely studied\nin the time series literature for low dimensional problems [B\u00fchlmann, 2002, Hall et al., 1995,\nLahiri, 2003]. To the best of our knowledge, in high dimensions this is still a widely open problem.\nTheoretical research on this is far beyond the scope of our current work.\n\nAcknowledgments\n\nPart of this work was undertaken while Dennis Leung was a postdoc at the Chinese University of\nHong Kong, and he would like to thank Professor Qi-Man Shao for some helpful discussions during\nthat time.\n\nReferences\nT. W. Anderson. An introduction to multivariate statistical analysis. Wiley Series in Probability and Statistics.\n\nWiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition, 2003.\n\nPaul A. Bekker and Jan de Leeuw. The rank of reduced dispersion matrices. Psychometrika, 52(1):125\u2013135,\n\n1987.\n\nKenneth A Bollen and Kwok-fai Ting. Con\ufb01rmatory tetrad analysis. Sociological methodology, pages 147\u2013175,\n\n1993.\n\nPeter B\u00fchlmann. Bootstraps for time series. Statistical Science, pages 52\u201372, 2002.\n\nVictor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap\n\nfor maxima of sums of high-dimensional random vectors. Ann. Statist., 41(6):2786\u20132819, 2013a.\n\nVictor Chernozhukov, Denis Chetverikov, and Kengo Kato. Testing many moment inequalities. arXiv preprint\n\narXiv:1312.7614, 2013b.\n\nMyung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, and Alan S. Willsky. Learning latent tree graphical\n\nmodels. J. Mach. Learn. Res., 12:1771\u20131812, 2011.\n\nMathias Drton. Likelihood ratio tests and singularities. Ann. Statist., 37(2):979\u20131012, 2009.\n\nMathias Drton and Han Xiao. Wald tests of singular hypotheses. Bernoulli, 22(1):38\u201359, 2016.\n\nMathias Drton, Bernd Sturmfels, and Seth Sullivant. Algebraic factor analysis: tetrads, pentads and beyond.\n\nProbab. Theory Related Fields, 138(3-4):463\u2013493, 2007.\n\nMathias Drton, H\u00e9l\u00e8ne Massam, and Ingram Olkin. Moments of minors of Wishart matrices. Ann. Statist., 36\n\n(5):2261\u20132283, 2008.\n\nMathias Drton, Bernd Sturmfels, and Seth Sullivant. Lectures on algebraic statistics, volume 39 of Oberwolfach\n\nSeminars. Birkh\u00e4user Verlag, Basel, 2009.\n\nPeter Hall, Joel L. Horowitz, and Bing-Yi Jing. On blocking rules for the bootstrap with dependent data.\n\nBiometrika, 82(3):561\u2013574, 1995.\n\nS. N. Lahiri. Resampling methods for dependent data. Springer Series in Statistics. Springer-Verlag, New York,\n\n2003.\n\n9\n\n\fRapha\u00ebl Mourad, Christine Sinoquet, Nevin L. Zhang, Tengfei Liu, and Philippe Leray. A survey on latent tree\n\nmodels and applications. J. Arti\ufb01cial Intelligence Res., 47:157\u2013203, 2013.\n\nCharles Semple and Mike Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its\n\nApplications. Oxford University Press, Oxford, 2003.\n\nN. Shiers, P. Zwiernik, J. A. D. Aston, and J. Q. Smith. The correlation space of Gaussian latent tree models and\n\nmodel selection without \ufb01tting. Biometrika, 103(3):531\u2013545, 2016.\n\nC. Spearman. \u201cGeneral intelligence,\u201d objectively determined and measured. The American Journal of Psychology,\n\n15(2):201\u2013292, 1904.\n\nJohn Wishart. Sampling errors in the theory of two factors. British Journal of Psychology, 19(2):180\u2013187, 1928.\n\nDanna Zhang and Wei Biao Wu. Gaussian approximation for high dimensional time series. Ann. Statist., 45(5):\n\n1895\u20131919, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3114, "authors": [{"given_name": "Dennis", "family_name": "Leung", "institution": "University of Southern California"}, {"given_name": "Mathias", "family_name": "Drton", "institution": "University of Washington"}]}