{"title": "Identifiability and Unmixing of Latent Parse Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1511, "page_last": 1519, "abstract": "This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models.", "full_text": "Identi\ufb01ability and Unmixing of Latent Parse Trees\n\nDaniel Hsu\n\nMicrosoft Research\n\nSham M. Kakade\nMicrosoft Research\n\nPercy Liang\n\nStanford University\n\nAbstract\n\nThis paper explores unsupervised learning of parsing models along two directions.\nFirst, which models are identi\ufb01able from in\ufb01nite data? We use a general tech-\nnique for numerically checking identi\ufb01ability based on the rank of a Jacobian ma-\ntrix, and apply it to several standard constituency and dependency parsing models.\nSecond, for identi\ufb01able models, how do we estimate the parameters ef\ufb01ciently?\nEM suffers from local optima, while recent work using spectral methods [1] can-\nnot be directly applied since the topology of the parse tree varies across sentences.\nWe develop a strategy, unmixing, which deals with this additional complexity for\nrestricted classes of parsing models.\n\n1\n\nIntroduction\n\nidenti\ufb01ability and computation.\n\nGenerative parsing models, which de\ufb01ne joint distributions over sentences and their parse trees, are\none of the core techniques in computational linguistics. We are interested in the unsupervised learn-\ning of these models [2\u20136], where the goal is to estimate the model parameters given only examples\nof sentences. Unsupervised learning can fail for a number of reasons [7]: model misspeci\ufb01cation,\nnon-identi\ufb01ability, estimation error, and computation error. In this paper, we delve into two of these\nissues:\nIn doing so, we confront a central challenge of parsing\nmodels\u2014that the topology of the parse tree is unobserved and varies across sentences. This is in\ncontrast to standard phylogenetic models [8] and other latent tree models for which there is a single\n\ufb01xed global tree across all examples [9].\nA model is identi\ufb01able if there is enough information in the data to pinpoint the parameters (up\nto some trivial equivalence class); establishing the identi\ufb01ability of a model is often a highly non-\ntrivial task. A classic result of Kruskal [10] has been employed to prove the identi\ufb01ability of a wide\nclass of latent variable models, including hidden Markov models and certain restricted mixtures of\nlatent tree models [11\u201313]. However, these techniques cannot be directly applied to parsing models\nsince the tree topology varies over an exponential set of possible topologies. Instead, we turn to\ntechniques from algebraic geometry [14\u201317]; we show that a simple numerical procedure can be\nused to check identi\ufb01ability for a wide class of models in NLP. Using this tool, we discover that\nprobabilistic context-free grammars (PCFGs) are non-identi\ufb01able, but that simpler PCFG variants\nand dependency models are identi\ufb01able.\nThe most common way to estimate unsupervised parsing models is by using local techniques such\nas EM [18] or MCMC sampling [19], but these methods can suffer from local optima and slow\nmixing. Meanwhile, recent work [1,20\u201323] has shown that spectral methods can be used to estimate\nmixture models and HMMs with provable guarantees. These techniques express low-order moments\nof the observable distribution as a product of matrix parameters and use eigenvalue decomposition\nto recover these matrices. However, these methods are not directly applicable to parsing models\nbecause the tree topology again varies non-trivially. To address this, we propose a new technique,\nunmixing. The main idea is to express moments of the observable distribution as a mixture over\nthe possible topologies. For restricted parsing models, the moments for a \ufb01xed tree structure can\n\nE-mail: dahsu@microsoft.com, skakade@microsoft.com, pliang@cs.stanford.edu\n\n1\n\n\fFigure 1: The two constituency trees and seven dependency trees over L = 3 words, x1, x2, x3. (a)\nA constituency tree consists of a hierarchical grouping of the words with a latent state zv for each\nnode v. (b) A dependency tree consists of a collection of directed edges between the words. In both\ncases, we have labeled each edge from i to j with the parameters used to generate the state of node\nj given i.\n\nbe \u201cunmixed\u201d, thereby reducing the problem to one with a \ufb01xed topology, which can be tackled\nusing standard techniques [1]. Importantly, our unmixing technique does not require the training\nsentences be annotated with the tree topologies a priori, in contrast to recent extensions of [21] to\nlearning PCFGs [24] and dependency trees [25, 26], which work on a \ufb01xed topology.\n\n2 Notation\n\nFor a positive integer n, de\ufb01ne [n] def= {1, . . . , n} and (cid:104)n(cid:105) = {e1, . . . , en}, where ei is the vector\nwhich is 1 in component i and 0 elsewhere. For integers a, b \u2208 [n], let a\u2297nb = (a\u22121)n+b \u2208 [n2] be\nthe integer encoding of the pair (a, b). For a pair of matrices, A, B \u2208 Rm\u00d7n, de\ufb01ne the columnwise\ntensor product A \u2297C B \u2208 Rm2\u00d7n to be such that (A \u2297C B)(i1\u2297mi2)j = Ai1jBi2j. For a matrix\nA \u2208 Rm\u00d7n, let A\u2020 denote the Moore-Penrose pseudoinverse.\n\n3 Parsing models\n\nA sentence is a sequence of L words, x = (x1, . . . , xL), where each word xi \u2208 (cid:104)d(cid:105) is one of\nd possible word types. A (generative) parsing model de\ufb01nes a joint distribution P\u03b8(x, z) over a\nsentence x and its parse tree z (to be made precise later), where \u03b8 are the model parameters (a\ncollection of multinomials). Each parse tree z has a topology Topology(z) \u2208 Topologies, which\nis both unobserved and varying across sentences. The learning problem is to recover \u03b8 given only\nsamples of x.\nTwo important classes of models of natural language syntax are constituency models, which rep-\nresent a hierarchical grouping and labeling of the phrases of a sentence (e.g., Figure 1(a)), and\ndependency models, which represent pairwise relationships between the words of a sentence (e.g.,\nFigure 1(b)).\n\n2\n\nTTTTTTTTOOx1z01OOx2z12z02OOx3z23z03\u03c0TTTTOOx1z01TTTTOOx2z12OOx3z23z13z03\u03c0Topology(z)=1Topology(z)=2x1x2x3\u03c0AATopology(z)=1x1x2x3\u03c0AATopology(z)=2x1x2x3\u03c0AATopology(z)=3x1x2x3A\u03c0ATopology(z)=4x1x2x3AA\u03c0Topology(z)=5x1x2x3AA\u03c0Topology(z)=6x1x2x3AA\u03c0Topology(z)=7(a)Constituency(PCFG-IE)(b)Dependency(DEP-IE)\f3.1 Constituency models\n\nA constituency tree z = (V, s) consists of a set of nodes V and a collection of hidden states s =\n{sv}v\u2208V . Each state sv \u2208 (cid:104)k(cid:105) represents one of k possible syntactic categories. Each node v has\nthe form [i : j] for 0 \u2264 i < j \u2264 L corresponding to the phrase between positions i and j of the\nsentence. These nodes form a binary tree as follows: the root node is [0 : L] \u2208 V , and for each node\n[i : j] \u2208 V with j \u2212 i > 1, there exists a unique m with i < m < j de\ufb01ning the two children nodes\n[i : m] \u2208 V and [m : j] \u2208 V . Let Topology(z) be an integer encoding of V .\n\nPCFG. Perhaps the most well-known constituency parsing model is the probabilistic context-free\ngrammar (PCFG). The parameters of a PCFG are \u03b8 = (\u03c0, B, O), where \u03c0 \u2208 Rk speci\ufb01es the initial\nstate distribution, B \u2208 Rk2\u00d7k speci\ufb01es the binary production distributions, and O \u2208 Rd\u00d7k speci\ufb01es\nthe emission distributions.\nA PCFG corresponds to the following generative process (see Figure 1(a) for an example): choose a\ntopology Topology(z) uniformly at random; generate the state of the root node using \u03c0; recursively\ngenerate pairs of children states given their parents using B; and \ufb01nally generate words xi given\ntheir parents using O. This generative process de\ufb01nes a joint probability over a sentence x and a\nparse tree z:\n\nP\u03b8(x, z) = | Topologies|\u22121\u03c0(cid:62)s[0:L]\n\n(s[i:m] \u2297k s[m:j])(cid:62)Bs[i:j]\n\nx(cid:62)\ni Os[i\u22121:i],\n\n(1)\n\n(cid:89)\n\n[i:m],[m:j]\u2208V\n\nL(cid:89)\n\ni=1\n\nWe will also consider two variants of the PCFG with additional restrictions:\n\nPCFG-I. The left and right children states are generated independently\u2014that is, we have the fol-\nlowing factorization: B = T1 \u2297C T2 for some T1, T2 \u2208 Rk\u00d7k.\nPCFG-IE. The left and the right productions are independent and equal: B = T \u2297C T .\n\n3.2 Dependency tree models\n\nIn contrast to constituency trees, which posit internal nodes with latent states, dependency trees\nconnect the words directly. A dependency tree z is a set of directed edges (i, j), where i, j \u2208 [L]\nare distinct positions in the sentence. Let Root(z) denote the position of the root node of z. We\nconsider only projective dependency trees [27]: z is projective if for every path from i to j to k in\nz, we have that j and k are on the same side of i (that is, j \u2212 i and k \u2212 i have the same sign). Let\nTopology(z) be an integer encoding of z.\n\nDEP-I. We consider the simple dependency model of [4]. The parameters of this model are \u03b8 =\n(\u03c0, A(cid:46), A(cid:38)), where \u03c0 \u2208 Rd is the initial word distribution and A(cid:46), A(cid:38) \u2208 Rd\u00d7d are the left and\nright argument distributions. The generative process is as follows: choose a topology Topology(z)\nuniformly at random, generate the root word using \u03c0, and recursively generate argument words to\nthe left to the right given the parent word using A(cid:46) and A(cid:38), respectively. The corresponding joint\nprobability distribution is as follows:\n\n(cid:89)\n\n(i,j)\u2208z\n\nx(cid:62)\nj Adir(i,j)xi,\n\n(2)\n\nP\u03b8(x, z) = | Topologies|\u22121\u03c0(cid:62)xRoot(z)\n\nwhere dir(i, j) =(cid:46) if j < i and (cid:38) if j > i.\nWe also consider the following two variants:\n\nDEP-IE. The left and right argument distributions are equal: A = A(cid:46) = A(cid:38).\n\nDEP-IES. A = A(cid:46) = A(cid:38) and \u03c0 is the stationary distribution of A (that is, \u03c0 = A\u03c0).\n\nUsually a PCFG induces a topology via a state-dependent probability of choosing a binary production\n\nversus an emission. Our model is a restriction which corresponds to a state-independent probability.\n\n3\n\n\fIdenti\ufb01ability\n\n4\nOur goal is to estimate model parameters \u03b80 \u2208 \u0398 given only access to sentences x \u223c P\u03b80. Speci\ufb01-\ncally, suppose we have an observation function \u03c6(x) \u2208 Rm, which is the only lens through which an\nalgorithm can view the data. We ask a basic question: in the limit of in\ufb01nite data, is it information-\ntheoretically possible to identify \u03b80 from the observed moments \u00b5(\u03b80) def= E\u03b80 [\u03c6(x)]?\nTo be more precise, de\ufb01ne the equivalence class of \u03b80 to be the set of parameters \u03b8 that yield the\nsame observed moments:\n\nS\u0398(\u03b80) = {\u03b8 \u2208 \u0398 : \u00b5(\u03b8) = \u00b5(\u03b80)}.\n\n(3)\nIt is impossible for an algorithm to distinguish among the elements of S\u0398(\u03b80). Therefore, one might\nwant to ensure that |S\u0398(\u03b80)| = 1 for all \u03b80 \u2208 \u0398. However, this requirement is too strong for two rea-\nsons. First, models often have natural symmetries\u2014e.g., the k states of any PCFG can be permuted\nwithout changing \u00b5(\u03b8), so |S\u0398(\u03b80)| \u2265 k!. Second, |S\u0398(\u03b80)| = \u221e for some pathological \u03b80\u2019s\u2014e.g.,\nPCFGs where all states have the same emission distribution O are indistinguishable regardless of\nthe production distributions B. The following de\ufb01nition of identi\ufb01ability accommodates these two\nexceptional cases:\nDe\ufb01nition 1 (Identi\ufb01ability). A model family with parameter space \u0398 is (globally) identi\ufb01able from\n\u03c6 if there exists a measure zero set E such that |S\u0398(\u03b80)| is \ufb01nite for every \u03b80 \u2208 \u0398\\E. It is locally\nidenti\ufb01able from \u03c6 if there exists a measure zero set E such that, for every \u03b80 \u2208 \u0398\\E, there exists an\nopen neighborhood N (\u03b80) around \u03b80 such that S\u0398(\u03b80) \u2229 N (\u03b80) = {\u03b80}.\n\nExample of non-identi\ufb01ability. Consider the DEP-IE model with L = 2 with the full observation\nfunction \u03c6(x) = x1 \u2297 x2. The corresponding observed moments are \u00b5(\u03b8) = 0.5A diag(\u03c0) +\n0.5 diag(\u03c0)A(cid:62). Note that A diag(\u03c0) is an arbitrary d \u00d7 d matrix whose entries sum to 1, which\nhas d2 \u2212 1 degrees of freedom, whereas \u00b5(\u03b8) is a symmetric matrix whose entries sum to 1, which\n\n(cid:1) \u2212 1 degrees of freedom. Therefore, S\u0398(\u03b8) has dimension(cid:0)d\n\n(cid:1) and therefore the model is\n\nhas(cid:0)d+1\n\n2\n\n2\n\nnon-identi\ufb01able.\n\nParameter counting.\nIt is important to compute the degrees of freedom correctly\u2014simple param-\neter counting is insuf\ufb01cient. For example, consider the PCFG-IE model with L = 2. The observed\nmoments with respect to \u03c6(x) = x1 \u2297 x2 is a d \u00d7 d matrix, which places d2 constraints on the\nk2 + (d\u2212 1)k parameters. When d \u2265 2k, there are more constraints than parameters, but the PCFG-\nIE model with L = 2 is actually non-identi\ufb01able (as we will see later). The issue here is that the\nnumber of constraints does not reveal the fact that some of these constraints are redundant.\n\n4.1 Observation functions\nAn observation function \u03c6(x) and its associated observed moments \u00b5(\u03b80) = E\u03b80 [\u03c6(x)] reveals\naspects of the distribution P\u03b80(x). For example, \u03c6(x) = x1 would only reveal the marginal distribu-\ntion of the \ufb01rst word, whereas \u03c6(x) = x1 \u2297 \u00b7\u00b7\u00b7 \u2297 xL reveals the entire distribution of x. There is a\ntradeoff: Higher-order moments provide more information, but are harder to estimate reliably given\n\ufb01nite data, and are also computationally more expensive. In this paper, we consider the following\nintermediate moments:\n\nAbove, \u03b7 \u2208 Rd denotes a unit vector in Rd (e.g., e1) which picks out a linear combination of matrix\nslices from a third-order d \u00d7 d \u00d7 d tensor.\n\n4.2 Automatically checking identi\ufb01ability\n\nOne immediate goal is to determine which models in Section 3 are identi\ufb01able from which of the\nobserved moments (Section 4.1). A powerful analytic tool that has been succesfully applied in\n\n4\n\n\u03c612(x) def= x1 \u2297 x2\n\u03c6123(x) def= x1 \u2297 x2 \u2297 x3\n\u03c6123\u03b7(x) def= (x1 \u2297 x2)(\u03b7(cid:62)x3)\n\u03c6all(x) def= x1 \u2297 \u00b7\u00b7\u00b7 \u2297 xL\n\n\u03c6\u2217\u2217(x) def= (cid:0)xi \u2297 xj : i, j \u2208 [L](cid:1)\n\u03c6\u2217\u2217\u2217(x) def= (cid:0)xi \u2297 xj \u2297 xk : i, j, k \u2208 [L](cid:1)\n\u03c6\u2217\u2217\u2217\u03b7(x) def= (cid:0)(xi \u2297 xj)(\u03b7(cid:62)xk) : i, j, k \u2208 [L](cid:1)\n\n\fprevious work is Kruskal\u2019s theorem [10, 11], but (i) it is does not immediately apply to models with\nrandom topologies, and (ii) only gives suf\ufb01cient conditions for identi\ufb01ability, and cannot be used to\ndetermine non-identi\ufb01ability. Furthermore, since it is common practice to explore many different\nmodels for a given problem in rapid succession, we would like to check identi\ufb01ability quickly and\nreliably. In this section, we develop an automatic procedure to do this.\nTo establish identi\ufb01ability, let us examine the algebraic structure of S\u0398(\u03b80) for \u03b80 \u2208 \u0398, where we\nassume that the parameter space \u0398 is an open subset of [0, 1]n. Recall that S\u0398(\u03b80) is de\ufb01ned by the\nmoment constraints \u00b5(\u03b8) = \u00b5(\u03b80). We can write these constraints as h\u03b80(\u03b8) = 0, where\n\nh\u03b80 (\u03b8) def= \u00b5(\u03b8) \u2212 \u00b5(\u03b80)\n\nis a vector of m polynomials in \u03b8.\nLet us now compute the number of degrees of freedom of h\u03b80 around \u03b80. The key quantity is\nJ(\u03b8) \u2208 Rm\u00d7n, the Jacobian of h\u03b80 at \u03b8 (note that the Jacobian of h\u03b80 does not depend on \u03b80; it\nis precisely the Jacobian of \u00b5). This Jacobian criterion is well-established in algebraic geometry,\nand has been adopted in the statistical literature for testing model identi\ufb01ability and other related\nproperties [14\u201317].\nIntuitively, each row of J(\u03b80) corresponds to a direction of a constraint violation, and thus the row\nspace of J(\u03b80) corresponds to all directions that would take us outside the equivalence class S\u0398(\u03b80).\nIf J(\u03b80) has less than rank n, then there is a direction orthogonal to all the rows along which we\ncan move and still satisfy all the constraints\u2014in other words, |S\u0398(\u03b80)| is in\ufb01nite, and therefore the\nmodel is non-identi\ufb01able. This intuition leads to the following algorithm:\n\nCHECKIDENTIFIABILITY:\n\u22121. Choose a point \u02dc\u03b8 \u2208 \u0398 uniformly at random.\n\u22122. Compute the Jacobian matrix J(\u02dc\u03b8).\n\u22123. Return \u201cyes\u201d if the rank of J(\u02dc\u03b8) = n and \u201cno\u201d otherwise.\n\nThe following theorem asserts the correctness of CHECKIDENTIFIABILITY. It is largely based on\ntechniques in [16], although we have not seen it explicitly stated in this form.\nTheorem 1 (Correctness of CHECKIDENTIFIABILITY). Assume the parameter space \u0398 is a non-\nempty open connected subset of [0, 1]n; and the observed moments \u00b5 : Rn \u2192 Rm, with respect to\nobservation function \u03c6, is a polynomial map. Then with probability 1, CHECKIDENTIFIABILITY\nreturns \u201cyes\u201d iff the model family is locally identi\ufb01able from \u03c6. Moreover, if it returns \u201cyes\u201d, then\nthere exists E \u2282 \u0398 of measure zero such that the model family with parameter space \u0398 \\ E is\nidenti\ufb01able from \u03c6.\n\nThe proof of Theorem 1 is given in Appendix A.\n\nImplementation of CHECKIDENTIFIABILITY\n\n4.3\nComputing the Jacobian. The rows of J correspond to \u2202E\u03b8[\u03c6j(x)]/\u2202\u03b8 and can be computed ef-\n\ufb01ciently by adapting dynamic programs used in the E-step of an EM algorithm for parsing models.\nThere are two main differences: (i) we must sum over possible values of x in addition to z, and (ii)\nwe are not computing moments, but rather gradients thereof. Speci\ufb01cally, we adapt the CKY algo-\nrithm for constituency models and the algorithm of [27] for dependency models. See Appendix C.1\nfor more details.\n\nNumerical issues. Because we implemented CHECKIDENTIFIABILITY on a \ufb01nite precision ma-\nchine, the results are subject to numerical precision errors. However, we veri\ufb01ed that our numerical\nresults are consistent with various analytically-derived identi\ufb01ability results (e.g., from [11]).\n\nWhile we initially de\ufb01ned \u03b8 to be a tuple of conditional probability matrices, we will now use its non-\n\nredundant vectorized form \u03b8 \u2208 Rn.\n\n5\n\n\fModel \\ Observation function\n\nPCFG\n\nDEP-I\n\nPCFG-I / PCFG-IE\n\nDEP-IE / DEP-IES\n\n\u03c6\u2217\u2217\n\n\u03c612\nNo Yes iff L \u2265 4\nNo\n\n\u03c6123e1\n\n\u03c6\u2217\u2217\u2217e1\n\u03c6123\nNo, even from \u03c6all for L \u2208 {3, 4, 5}\nYes iff L \u2265 3\n\nYes iff L \u2265 3\n\nYes iff L \u2265 3\n\n\u03c6\u2217\u2217\u2217\n\nLocal\n\nFigure 2:\nThese \ufb01ndings are given by\nCHECKIDENTIFIABILITY have the semantics from Theorem 1. These were checked for d \u2208\n{2, 3, . . . , 8}, k \u2208 {2, . . . , d} (applies only for PCFG models), L \u2208 {2, 3, . . . , 9}.\n\nidenti\ufb01ability of parsing models.\n\n4.4\n\nIdenti\ufb01ability of constituency and dependency tree models\n\nWe checked the identi\ufb01ability status of various constituency and dependency tree models using our\nimplementation of CHECKIDENTIFIABILITY. We focus on the regime where d \u2265 k for PCFGs;\nadditional results for d < k are given in Appendix B.\nThe results are reported in Figure 2. First, we found that the PCFG is not identi\ufb01able from \u03c6all (and\ntherefore not identi\ufb01able from any \u03c6) for L \u2208 {3, 4, 5}; we believe that the same holds for all L. This\nnegative result motivates exploring restricted subclasses of PCFGs, such as PCFG-I and PCFG-IE,\nwhich factorize the binary productions. For these classes, we found that the sentence length L and\nchoice of observation function can in\ufb02uence identi\ufb01ability: Both models are identi\ufb01able for large\nenough L (e.g., L \u2265 3) and with a suf\ufb01ciently rich observation function (e.g., \u03c6123\u03b7).\nThe dependency models, DEP-I and DEP-IE, were all found to be identi\ufb01able for L \u2265 3 from\nsecond-order moments \u03c6\u2217\u2217. The conditions for identi\ufb01ability are less stringent than their con-\nstituency counterparts (PCFG-I and PCFG-IE), which is natural since dependency models are sim-\npler without the latent states. Note that in all identi\ufb01able models, second-order moments suf\ufb01ce to\ndetermine the distribution\u2014this is good news because low-order moments are easier to estimate.\n\n5 Unmixing algorithms\n\nHaving established which parsing models are identi\ufb01able, we now turn to parameter estimation for\nthese models. We will consider algorithms based on moment matching\u2014those that try to \ufb01nd a \u03b8\nsatisfying \u00b5(\u03b8) = u for some u. Typically, u is an empirical estimate of \u00b5(\u03b80) = E\u03b80 [\u03c6(x)] based\non samples x \u223c P\u03b80.\nIn general, solving \u00b5(\u03b8) = u corresponds to \ufb01nding solutions to systems of multivariate polyno-\nmials, which is NP-hard [28]. However, \u00b5(\u03b8) often has additional structure which we can exploit.\nFor instance, for an HMM, the sliced third-order moments \u00b5123\u03b7(\u03b8) can be written as a product of\nparameter matrices in \u03b8, and each matrix can be recovered by decomposing the product [1].\nFor parsing models, the challenge is that the topology is random, so the moments is not a single prod-\nuct, but a mixture over products. To deal with this complication, we propose a new technique, which\nwe call unmixing: We \u201cunmix\u201d the products from the mixtures, essentially reducing the problem to\none with a \ufb01xed topology.\nWe will \ufb01rst present the general idea of unmixing (Section 5.1) and then apply it to the PCFG-IE\nmodel (Section 5.2) and the DEP-IES model (Section 5.3).\n\n5.1 General case\n\nWe assume the observation function \u03c6(x) consists of a collection of observation matrices\n{\u03c6o(x)}o\u2208O (e.g., for o = (i, j), \u03c6o(x) = xi \u2297 xj). Given an observation matrix \u03c6o(x) and a\ntopology t \u2208 Topologies, consider the mapping that computes the observed moment conditioned on\nNote that these subclasses occupy measure zero subsets of the PCFG parameter space, which is expected\n\ngiven the non-identi\ufb01ability of the general PCFG.\nto the true moments at Op(n\u2212 1\ncomplexity arguments for the parameter error.\n\nWe will develop our algorithms assuming true moments (u = \u00b5(\u03b80)). The empirical moments converge\n2 ), and matrix perturbation arguments (e.g., [1]) can be used derive sample\n\n6\n\n\fthat topology: \u03a8o,t(\u03b8) = E\u03b8[\u03c6o(x) | Topology = t]. As we range o over O and t over Topologies,\nwe will enounter a \ufb01nite number of such mappings. We call these mappings compound parameters,\ndenoted {\u03a8p}p\u2208P.\nNow write the observed moments as a weighted sum:\n\n\u00b5o(\u03b8) =\n\nP(\u03a8o,Topology = \u03a8p)\n\nfor all o \u2208 O,\n\n\u03a8p\n\n(4)\n\n(cid:88)\n\np\u2208P\n\n(cid:124)\n\n(cid:123)(cid:122)\n\ndef= Mop\n\n(cid:125)\n\nwhere we have de\ufb01ned Mop to be the probability mass over tree topologies that yield compound\nparameter \u03a8p. We let {Mop}o\u2208O,p\u2208P be the mixing matrix. Note that (4) de\ufb01nes a system of\nequations \u00b5 = M \u03a8, where the variables are the compound parameters and the constraints are the\nobserved moments. In a sense, we have replaced the original system of polynomial equations (in \u03b8)\nwith a system of linear equations (in \u03a8).\nThe key to the utility of this technique is that the number of compound parameters can be polynomial\nin L even when the number of possible topologies is exponential in L. Previous analytic techniques\n[13] based on Kruskal\u2019s theorem [10] cannot be applied here because the possible topologies are too\nmany and too varied.\nNote that the mixing equation \u00b5 = M \u03a8 holds for each sentence length L, but many compound pa-\nrameters p appear in the equations of multiple L. Therefore, we can combine the equations across all\nobserved sentence lengths, yielding a more constrained system than if we considered the equations\nof each L separately.\nThe following proposition shows how we can recover \u03b8 by unmixing the observed moments \u00b5:\nProposition 1 (Unmixing). Suppose that there exists an ef\ufb01cient base algorithm to recover \u03b8 from\nsome subset of compound parameters {\u03a8p(\u03b8) : p \u2208 P0}, and that e(cid:62)\np is in the row space of M for\neach p \u2208 P0. Then we can recover \u03b8 as follows:\n\nUNMIX(\u00b5):\n\u22121. Compute the mixing matrix M (4).\n\u22122. Retrieve the compound parameters \u03a8p(\u03b8) = (M\u2020\u00b5)p for each p \u2208 P0.\n\u22123. Call the base algorithm on {\u03a8p(\u03b8) : p \u2208 P0} to obtain \u03b8.\n\nFor all our parsing models, M can be computed ef\ufb01ciently using dynamic programming (Ap-\npendix C.2). Note that M is data-independent, so this computation can be done once in advance.\n\n5.2 Application to the PCFG-IE model\n\nAs a concrete example, consider the PCFG-IE model over L = 3 words. Write A = OT . For\nany \u03b7 \u2208 Rd, we can express the observed moments as a sum over the two possible topologies in\nFigure 1(a):\n\n\u00b5123\u03b7\n\n\u00b5132\u03b7\n\n\u00b5231\u03b7\n\ndef= E[x1 \u2297 x2(\u03b7(cid:62)x3)] = 0.5\u03a81;\u03b7 + 0.5\u03a82;\u03b7,\ndef= E[x1 \u2297 x3(\u03b7(cid:62)x2)] = 0.5\u03a83;\u03b7 + 0.5\u03a82;\u03b7,\ndef= E[x2 \u2297 x3(\u03b7(cid:62)x1)] = 0.5\u03a83;\u03b7 + 0.5\u03a81;\u03b7,\n\ndef= A diag(T diag(\u03c0)A(cid:62)\u03b7)A(cid:62),\ndef= A diag(\u03c0)T (cid:62) diag(A(cid:62)\u03b7)A(cid:62),\ndef= A diag(A(cid:62)\u03b7)T diag(\u03c0)A(cid:62),\n\n\u03a81;\u03b7\n\n\u03a82;\u03b7\n\n\u03a83;\u03b7\n\nor compactly in matrix form:\n\n(cid:32) \u00b5123\u03b7\n(cid:124)\n(cid:123)(cid:122)\n\n\u00b5132\u03b7\n\u00b5231\u03b7\n\n(cid:33)\n(cid:125)\n\n(cid:32) 0.5I\n(cid:124)\n\n0.5I\n\n0\n\n=\n\n0\n\n0.5I\n0.5I\n\n0.5I\n0.5I\n\n(cid:123)(cid:122)\n\n0\n\n(cid:33)\n(cid:125)\n\n(cid:32) \u03a81;\u03b7\n(cid:123)(cid:122)\n(cid:124)\n\n\u03a82;\u03b7\n\u03a83;\u03b7\n\n(cid:33)\n(cid:125)\n\n.\n\nobserved moments \u00b5\u03b7\n\nmixing matrix M\n\ncompound parameters \u03a8\u03b7\n\nLet us observe \u00b5\u03b7 at two different values of \u03b7, say at \u03b7 = 1 and \u03b7 = \u03c4 for some random \u03c4. Since\nthe mixing matrix M is invertible, we can obtain the compound parameters \u03a82;1 = (M\u22121\u00b51)2 and\n\u03a82;\u03c4 = (M\u22121\u00b5\u03c4 )2.\n\n7\n\n\fNow we will recover \u03b8 from \u03a82;1 and \u03a82;\u03c4 by \ufb01rst extracting A = OT via an eigenvalue decom-\nposition, and then recovering \u03c0, T , and O in turn (all up to the same unknown permutation) via\nelementary matrix operations.\nFor the \ufb01rst step, we will use the following tool (adapted from Algorithm A of [1]), which allow us\nto decompose two related matrix products:\nLemma 1 (Spectral decomposition). Let M1, M2 \u2208 Rd\u00d7k have full column rank and D be a diag-\nonal matrix with distinct diagonal entries. Suppose we observe X = M1M(cid:62)\n2 and Y = M1DM(cid:62)\n2 .\nThen DECOMPOSE(X, Y ) recovers M1 up to a permutation and scaling of the columns.\n\nDECOMPOSE(X, Y ):\n\u22121. Find U1, U2 \u2208 Rd\u00d7k such that range(U1) = range(X) and range(U2) = range(X(cid:62)).\n\u22122. Perform an eigenvalue decomposition of (U(cid:62)\n\u22123. Return (U(cid:62)\n\n1 XU2)\u22121 = V SV \u22121.\n\n1 Y U2)(U(cid:62)\n\n1 )\u2020V .\n\n2;1, Y = \u03a8(cid:62)\n\nFirst, run DECOMPOSE(X = \u03a8(cid:62)\n2;\u03c4 ) (Lemma 1), which corresponds to M1 = A and\nM2 = A diag(\u03c0)T (cid:62). This produces A\u03a0S for some permutation matrix \u03a0 and diagonal scaling S.\nSince we know that the columns of A sum to one, we can identify A\u03a0.\nTo recover the initial distribution \u03c0 (up to permutation), take \u03a82;11 = A\u03c0 and left-multiply by\n(A\u03a0)\u2020 to get \u03a0\u22121\u03c0. For T , put the entries of \u03c0 in a diagonal matrix: \u03a0\u22121 diag(\u03c0)\u03a0. Take \u03a8(cid:62)\n2;1 =\nAT diag(\u03c0)A(cid:62) and multiply by (A\u03a0)\u2020 on the left and ((A\u03a0)(cid:62))\u2020(\u03a0\u22121 diag(\u03c0)\u03a0)\u22121 on the right,\nwhich yields \u03a0\u22121T \u03a0. (Note that \u03a0 is orthogonal, so \u03a0\u22121 = \u03a0(cid:62).) Finally, multiply A\u03a0 = OT \u03a0\nand (\u03a0\u22121T \u03a0)\u22121, which yields O\u03a0.\nThe above algorithm identi\ufb01es the PCFG-IE from only length 3 sentences. To exploit sentences of\ndifferent lengths, we can compute a mixing matrix M which includes constraints from sentences\nof length 1 \u2264 L \u2264 Lmax up to some upper bound Lmax. For example, Lmax = 10 results in a\n990 \u00d7 2376 mixing matrix. We can retrieve the same compound parameters (\u03a82;1 and \u03a82;\u03c4 ) from\nthe pseudoinverse of M and as proceed as before.\n\n5.3 Application to the DEP-IES model\n\nWe now turn to the DEP-IES model over L = 3 words. Our goal is to recover the parameters\n\u03b8 = (\u03c0, A). Let D = diag(\u03c0) = diag(A\u03c0), where the second equality is due to stationarity of \u03c0.\n\n\u00b51\n\n\u00b512\n\n\u00b513\n\n\u02dc\u00b512\n\ndef= E[x1] = \u03c0,\ndef= E[x1 \u2297 x2] = 7\u22121(DA(cid:62) + DA(cid:62) + DA(cid:62)A(cid:62) + AD + ADA(cid:62) + AD + DA(cid:62)),\ndef= E[x1 \u2297 x3] = 7\u22121(DA(cid:62) + DA(cid:62)A(cid:62) + DA(cid:62) + ADA(cid:62) + AD + AAD + AD),\ndef= \u02dcE[x1 \u2297 x2] = 2\u22121(DA(cid:62) + AD),\n\nwhere \u02dcE[\u00b7] is taken with respect to length 2 sentences. Having recovered \u03c0 from \u00b51, it remains to\nrecover A. By selectively combining the moments above, we can compute AA + A = [7(\u00b513 \u2212\n\u00b512) + 2\u02dc\u00b512] diag(\u00b51)\u22121. Assuming A is generic position, it is diagonalizable: A = Q\u039bQ\u22121 for\nsome diagonal matrix \u039b = diag(\u03bb1, . . . , \u03bbd), possibly with complex entries. Therefore, we can\nrecover \u039b2 + \u039b = Q\u22121(AA + A)Q. Since \u039b is diagonal, we simply have d independent quadratic\nequations in \u03bbi, which can be solved in closed form. After obtaining \u039b, we retrieve A = Q\u039bQ\u22121.\n\n6 Discussion\n\nIn this work, we have shed some light on the identi\ufb01ability of standard generative parsing models us-\ning our numerical identi\ufb01ability checker. Given the ease with which this checker can be applied, we\nbelieve it should be a useful tool for analyzing more sophisticated models [6], as well as developing\nnew ones which are expressive yet identi\ufb01able.\nThere is still a large gap between showing identi\ufb01ability and developing explicit algorithms. We\nhave made some progress on closing it with our unmixing technique, which can deal with models\nwhere the tree topology varies non-trivially.\n\n8\n\n\fReferences\n[1] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nMarkov models. In COLT, 2012.\n\n[2] F. Pereira and Y. Shabes. Inside-outside reestimation from partially bracketed corpora. In ACL, 1992.\n[3] G. Carroll and E. Charniak. Two experiments on learning probabilistic dependency grammars from cor-\n\npora. In Workshop Notes for Statistically-Based NLP Techniques, AAAI, pages 1\u201313, 1992.\n\n[4] M. A. Paskin. Grammatical bigrams. In NIPS, 2002.\n[5] D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP models.\n\nEMNLP, 2002.\n\nIn\n\n[6] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of dependency and\n\nconstituency. In ACL, 2004.\n\n[7] P. Liang and D. Klein. Analyzing the errors of unsupervised learning. In HLT/ACL, 2008.\n[8] J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identi\ufb01ability and consistency.\n\nMathematical Biosciences, 137:51\u201373, 1996.\n\n[9] A. Anandkumar, K. Chaudhuri, D. Hsu, S. M. Kakade, L. Song, and T. Zhang. Spectral methods for\n\nlearning multivariate latent tree structure. In NIPS, 2011.\n\n[10] J. B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to\n\narithmetic complexity and statistics. Linear Algebra and Applications, 18:95\u2013138, 1977.\n\n[11] E. S. Allman, C. Matias, and J. A. Rhodes. Identi\ufb01ability of parameters in latent structure models with\n\nmany observed variables. Annals of Statistics, 37:3099\u20133132, 2009.\n\n[12] E. S. Allman, S. Petrovi, J. A. Rhodes, and S. Sullivant. Identi\ufb01ability of 2-tree mixtures for group-based\n\nmodels. Transactions on Computational Biology and Bioinformatics, 8:710\u2013722, 2011.\n\n[13] J. A. Rhodes and S. Sullivant. Identi\ufb01ability of large phylogenetic mixture models. Bulletin of Mathe-\n\nmatical Biology, 74(1):212\u2013231, 2012.\n\n[14] T. J. Rothenberg. Identi\ufb01cation in parameteric models. Econometrica, 39:577\u2013591, 1971.\n[15] L. A. Goodman. Exploratory latent structure analysis using both identi\ufb01abile and unidenti\ufb01able models.\n\nBiometrika, 61(2):215\u2013231, 1974.\n\n[16] D. Bamber and J. P. H. van Santen. How many parameters can a model have and still be testable? Journal\n\nof Mathematical Psychology, 29:443\u2013473, 1985.\n\n[17] D. Geiger, D. Heckerman, H. King, and C. Meek. Strati\ufb01ed exponential families: graphical models and\n\nmodel selection. Annals of Statistics, 29:505\u2013529, 2001.\n\n[18] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside\n\nalgorithm. Computer Speech and Language, 4:35\u201356, 1990.\n\n[19] M. Johnson, T. Grif\ufb01ths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo.\n\nIn HLT/NAACL, 2007.\n\n[20] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied\n\nProbability, 16(2):583\u2013614, 2006.\n\n[21] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. In COLT,\n\n2009.\n\n[22] S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. In AISTATS, 2010.\n[23] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for latent tree graphical models. In ICML, 2011.\n[24] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Spectral learning of latent-variable\n\nPCFGs. In ACL, 2012.\n\n[25] F. M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectral learning for non-deterministic dependency\n\nparsing. In EACL, 2012.\n\n[26] P. Dhillon, J. Rodue, M. Collins, D. P. Foster, and L. Ungar. Spectral dependency parsing with latent\n\nvariables. In EMNLP-CoNLL, 2012.\n\n[27] J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In COLING, 1996.\n[28] S. Sahni. Computationally related problems. SIAM Journal on Computing, 3:262\u2013279, 1974.\n[29] J. Eisner. Bilexical grammars and their cubic-time parsing algorithms. In Advances in Probabilistic and\n\nOther Parsing Technologies, pages 29\u201362, 2000.\n\n9\n\n\f", "award": [], "sourceid": 718, "authors": [{"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "Sham", "family_name": "Kakade", "institution": ""}, {"given_name": "Percy", "family_name": "Liang", "institution": ""}]}