{"title": "Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs", "book": "Advances in Neural Information Processing Systems", "page_first": 2519, "page_last": 2527, "abstract": "We describe an approach to speed-up inference with latent variable PCFGs, which have been shown to  be highly effective for natural language parsing. Our approach is based on a tensor formulation recently introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition algorithm well-known in the multilinear algebra literature.  We also describe an error bound for this approximation, which bounds the difference between the probabilities calculated by the algorithm and the true probabilities that the approximated model gives. Empirical evaluation on real-world natural language parsing data demonstrates a significant speed-up at minimal cost for parsing performance.", "full_text": "Tensor Decomposition for Fast Parsing with\n\nLatent-Variable PCFGs\n\nShay B. Cohen and Michael Collins\n\nDepartment of Computer Science\n\nColumbia University\nNew York, NY 10027\n\nscohen,mcollins@cs.columbia.edu\n\nAbstract\n\nWe describe an approach to speed-up inference with latent-variable PCFGs, which\nhave been shown to be highly effective for natural language parsing. Our approach\nis based on a tensor formulation recently introduced for spectral estimation of\nlatent-variable PCFGs coupled with a tensor decomposition algorithm well-known\nin the multilinear algebra literature. We also describe an error bound for this\napproximation, which gives guarantees showing that if the underlying tensors are\nwell approximated, then the probability distribution over trees will also be well\napproximated. Empirical evaluation on real-world natural language parsing data\ndemonstrates a signi\ufb01cant speed-up at minimal cost for parsing performance.\n\n1\n\nIntroduction\n\nLatent variable models have shown great success in various \ufb01elds, including computational linguis-\ntics and machine learning.\nIn computational linguistics, for example, latent-variable models are\nwidely used for natural language parsing using models called latent-variable PCFGs (L-PCFGs;\n[14]).\nThe mainstay for estimation of L-PCFGs has been the expectation-maximization algorithm [14,\n16], though other algorithms, such as spectral algorithms, have been devised [5]. A by-product\nof the spectral algorithm presented in [5] is a tensor formulation for computing the inside-outside\nprobabilities of a L-PCFG. Tensor products (or matrix-vector products, in certain cases) are used as\nthe basic operation for marginalization over the latent annotations of the L-PCFG.\nThe computational complexity with the tensor formulation (or with plain CKY, for that matter) is\ncubic in the number of latent states in the L-PCFG. This multiplicative factor can be prohibitive for\na large number of hidden states; various heuristics are used in practice to avoid this problem [16].\nIn this paper, we show that tensor decomposition can be used to signi\ufb01cantly speed-up the parsing\nperformance with L-PCFGs. Our approach is also provided with a theoretical guarantee: given the\naccuracy of the tensor decomposition, one can compute how accurate the approximate parser is.\nThe rest of this paper is organized as follows. We give notation and background in \u00a72\u20133, and then\npresent the main approach in \u00a74. We describe experimental results in \u00a75 and conclude in \u00a76.\n\n2 Notation\nGiven a matrix A or a vector v, we write A(cid:62) or v(cid:62) for the associated transpose. For any integer\nn \u2265 1, we use [n] to denote the set {1, 2, . . . n}. We will make use of tensors of rank 3:1\n\n1All PCFGs in this paper are assumed to be in Chomsky normal form. Our approach generalizes to arbitrary\n\nPCFGs, which require tensors of higher rank.\n\n1\n\n\fDe\ufb01nition 1. A tensor C \u2208 R(m\u00d7m\u00d7m) is a set of m3 parameters Ci,j,k for i, j, k \u2208 [m]. Given\na tensor C, and vectors y1 \u2208 Rm and y2 \u2208 Rm, we de\ufb01ne C(y1, y2) to be the m-dimensional row\nk. Hence C can be interpreted as a\nfunction C : Rm \u00d7 Rm \u2192 R1\u00d7m that maps vectors y1 and y2 to a row vector C(y1, y2) \u2208 R1\u00d7m.\nIn addition, we de\ufb01ne the tensor C(1,2) \u2208 R(m\u00d7m\u00d7m) for any tensor C \u2208 R(m\u00d7m\u00d7m) to be the\ni y2\nj .\n: Rm \u00d7 Rm \u2192 Rm\u00d71 as [C(1,3)(y1, y2)]j =\n\nvector with components [C(y1, y2)]i =(cid:80)\nfunction C(1,2) : Rm \u00d7 Rm \u2192 Rm\u00d71 de\ufb01ned as [C(1,2)(y1, y2)]k = (cid:80)\n(cid:80)\n\nSimilarly, for any tensor C we de\ufb01ne C(1,3)\n\nk. Note that C(1,2)(y1, y2) and C(1,3)(y1, y2) are both column vectors.\n\nj\u2208[m],k\u2208[m] Ci,j,ky1\n\ni\u2208[m],k\u2208[m] Ci,j,ky1\n\ni y2\n\ni\u2208[m],j\u2208[m] Ci,j,ky1\n\nj y2\n\nFor two vectors x \u2208 Rm and y \u2208 Rm we denote by x (cid:12) y \u2208 Rm the Hadamard product of x and y,\ni.e. [x (cid:12) y]i = xiyi. Finally, for vectors x, y, z \u2208 Rm, xy(cid:62)z(cid:62) is the tensor D \u2208 Rm\u00d7m\u00d7m where\nDi,j,k = xiyjzk (this is analogous to the outer product: [xy(cid:62)]i,j = xiyj).\n\n3 Latent-Variable Parsing\n\nIn this section we describe latent-variable PCFGs and their parsing algorithms.\n\n3.1 Latent-Variable PCFGs\n\nThis section gives a de\ufb01nition of the L-PCFG formalism used in this paper; we follow the de\ufb01nitions\ngiven in [5]. An L-PCFG is a 5-tuple (N, I, P, m, n) where:\n\u2022 N is the set of non-terminal symbols in the grammar. I \u2282 N is a \ufb01nite set of in-terminals. P \u2282 N\nis a \ufb01nite set of pre-terminals. We assume that N = I \u222a P, and I \u2229 P = \u2205. Hence we have\npartitioned the set of non-terminals into two subsets.\n\n\u2022 [m] is the set of possible hidden states.\n\u2022 [n] is the set of possible words.\n\u2022 For all a \u2208 I, b \u2208 N, c \u2208 N, h1, h2, h3 \u2208 [m], we have a context-free rule a(h1) \u2192 b(h2) c(h3).\n\u2022 For all a \u2208 P, h \u2208 [m], x \u2208 [n], we have a context-free rule a(h) \u2192 x.\nNote that for any binary rule, a \u2192 b c, it holds that a \u2208 I, and for any unary rule a \u2192 x, it holds\nthat a \u2208 P.\nThe set of \u201cskeletal rules\u201d is de\ufb01ned as R = {a \u2192 b c : a \u2208 I, b \u2208 N, c \u2208 N}. The parameters of\nthe model are as follows:\n\u2022 For each a \u2192 b c \u2208 R, and h1, h2, h3 \u2208 [m], we have a parameter t(a \u2192 b c, h2, h3|h1, a).\n\u2022 For each a \u2208 P, x \u2208 [n], and h \u2208 [m], we have a parameter q(a \u2192 x|h, a).\nAn L-PCFG corresponds to a regular PCFG with non-terminals annotated with latent states.\nFor each triplet of latent states and a rule a \u2192 b c, we have a rule probability p(a(h1) \u2192\nb(h2) c(h3)|a(h1)) = t(a \u2192 b c, h2, h3|h1, a). Similarly, we also have parameters p(a(h) \u2192\nx|a(h)) = q(a \u2192 x|h, a). In addition, there are initial probabilities of generating a non-terminal\nwith a latent at the top of the tree, denoted by \u03c0(a, h).\nL-PCFGs induce distributions over two type of trees: skeletal trees, i.e.\ntrees without values for\nlatent states (these trees are observed in data), and full trees (trees with values for latent states). A\nskeletal tree consists of a sequence of rules r1 . . . rN where ri \u2208 R or ri = a \u2192 x. See Figure 3.1\nfor an example.\nWe now turn to the problem of computing the probability of a skeletal tree, by marginalizing out\nthe latent states of full trees. Let r1 . . . rN be a derivation, and let ai be the non-terminal on the left\nhand-side of rule ri. For any ri = a \u2192 b c, de\ufb01ne h(2)\nto be the latent state associated with the left\nchild of the rule ri and h(3)\nThe distribution over full trees is then:\n\nto be the hidden variable value associated with the right child.\n\ni\n\ni\n\np(r1 . . . rN , h1 . . . hN ) = \u03c0(a1, h1) \u00d7 (cid:89)\n\nt(ri, h(2)\n\ni\n\n, h(3)\n\ni\n\nq(ri|hi, ai)\n\n|hi, ai) \u00d7 (cid:89)\n\ni:ai\u2208P\n\ni:ai\u2208I\n\n2\n\n\fS1\n\nNP2\n\nVP5\n\nD3\n\nthe\n\nN4\n\nV6\n\nP7\n\nman\n\nsaw\n\nhim\n\nr1 = S \u2192 NP VP\nr2 = NP \u2192 D N\nr3 = D \u2192 the\nr4 = N \u2192 man\nr5 = VP \u2192 V P\nr6 = V \u2192 saw\nr7 = P \u2192 him\n\nFigure 1: An s-tree with its sequence of rules.\n(The nodes in the tree are indexed by the deriva-\ntion order, which is canonicalized as top-down,\nleft-most derviation.)\n\np(r1 . . . rN ) =(cid:80)\n\nMarginalizing out the latent states leads to the distribution over the skeletal tree r1 . . . rN :\n\nh1...hN\n\np(r1 . . . rN , h1 . . . hN ).\n\nIt will be important for the rest of this paper to use of matrix form of parameters of an L-PCFG, as\nfollows:\n\u2022 For each a \u2192 b c \u2208 R, we de\ufb01ne T a\u2192b c \u2208 Rm\u00d7m\u00d7m to be the tensor with values\n\nT a\u2192b c\n\nh1,h2,h3\n\n= t(a \u2192 b c, h2, h3|a, h1)\n\nfor h = 1, 2, . . . m.\n\n\u2022 For each a \u2208 P, x \u2208 [n], we de\ufb01ne Qa\u2192x \u2208 R1\u00d7m to be the vector with values q(a \u2192 x|h, a)\n\u2022 For each a \u2208 I, we de\ufb01ne the vector \u03c0a \u2208 Rm where [\u03c0a]h = \u03c0(a, h).\nParameter Estimation Several ways to estimate the parameters T a\u2192b c, Qa\u2192x and \u03c0a have been\nsuggested in the literature. For example, vanilla EM has been used in [14], hierarchical state splitting\nEM has been suggested in [16], and a spectral algorithm is proposed in [5].\nIn the rest of the paper, we assume that the parameters for these tensors have been identi\ufb01ed, and\nfocus mostly on the problem of inference \u2013 i.e. parsing unseen sentences. The reason for this is\ntwo-fold: (a) in real-world applications, training can be done off-line to identify a set of parameters\nonce, and therefore its computational ef\ufb01ciency is of lesser interest; (b) our approach can speed-up\nthe inference problems existing in the EM algorithm, but the speed-up is of lesser interest, because\nthe inference problem in the EM algorithm is linear in the tree size (and not cubic, as in the case\nof parsing). The reason for this linear complexity is that the skeletal trees are observed during EM\ntraining. Still, EM stays cubic in the number of states.\n\n3.2 Tensor Formulation for Inside-Outside\n\nMore formally, let \u00b5(a, i, j) =(cid:80)\n\nThere are several ways to parse a sentence with latent-variable PCFGs. Most of these approaches are\ntaken by using an inside-outside algorithm [12] which computes marginals for various non-terminals\nand spans in the sentence, and then eventually \ufb01nding a parse tree which maximizes a score which\nis the sum of the marginals of the spans that appear in the tree.\n\n\u03c4\u2208T(x):(a,i,j)\u2208\u03c4 p(\u03c4 ) for each non-terminal a \u2208 N, for each (i, j)\nsuch that 1 \u2264 i \u2264 j \u2264 N. Here T(x) denotes the set of all possible s-trees for the sentence x, and\nwe write (a, i, j) \u2208 \u03c4 if non-terminal a spans words xi . . . xj in the parse tree \u03c4.\nThen,\narg max\u03c4\u2208T(x)\n\nthe parsing algorithm seeks for a given sentence x = x1 . . . xN the skeletal\n\n(a,i,j)\u2208\u03c4 \u00b5(a, i, j).\n\n(cid:80)\n\ntree\n\nGiven the marginals \u00b5(a, i, j), one can use the dynamic programming algorithm described in [7] in\norder to \ufb01nd this highest scoring tree.\nA key question is how to compute the marginals \u00b5(a, i, j) using the inside-outside algorithm. Dy-\nnamic programming solutions are available for this end as well. The complexity of a na\u00a8\u0131ve im-\nplementation of the dynamic programming algorithm for this problem is cubic in the number of\nlatent states. This is where we suggest an alternative to the traditional dynamic programming solu-\ntions. Our alternative relies on an existing tensor formulation for the inside-outside algorithm [5],\nwhich re-formalizes the dynamic programming algorithm using tensor, matrix and vector product\noperations.\nAlgorithm 2 presents the re-formulation of the inside-outside algorithm using tensors. For more\ndetails and proofs of correctness, refer to [5]. The re-formalized algorithm is still cubic in\n\n3\n\n\fthe number of hidden states, and spends most of the time computing the tensor applications\nT a\u2192b c(\u03b1b,i,k, \u03b1c,k+1,j), T b\u2192c a\n(1,3) (\u03b2b,i,k, \u03b1c,j+1,k). This is the main\nset of computations we aim to speed-up, as we show in the next section.\n\n(1,2) (\u03b2b,k,j, \u03b1c,k,i\u22121) and T b\u2192a c\n\nInputs: Sentence x1 . . . xN , L-PCFG (N, I, P, m, n), parameters T a\u2192b c \u2208 R(m\u00d7m\u00d7m) for all a \u2192 b c \u2208 R,\nQa\u2192x \u2208 R(1\u00d7m) for all a \u2208 P, x \u2208 [n], \u03c0a \u2208 R(m\u00d71) for all a \u2208 I.\nData structures:\n\n\u2022 Each \u03b1a,i,j \u2208 R1\u00d7m for a \u2208 N, 1 \u2264 i \u2264 j \u2264 N is a row vector of inside terms.\n\u2022 Each \u03b2a,i,j \u2208 Rm\u00d71 for a \u2208 N, 1 \u2264 i \u2264 j \u2264 N is a column vector of outside terms.\n\u2022 Each \u00b5(a, i, j) \u2208 R for a \u2208 N, 1 \u2264 i \u2264 j \u2264 N is a marginal probability.\n\nAlgorithm:\n(Inside base case) \u2200a \u2208 P, i \u2208 [N ], \u03b1a,i,i = Qa\u2192xi\n(Inside recursion) \u2200a \u2208 I, 1 \u2264 i < j \u2264 N,\n\nj\u22121(cid:88)\n\n(cid:88)\n\nk=i\n\na\u2192b c\n\n\u03b1a,i,j =\n\nT a\u2192b c(\u03b1b,i,k, \u03b1c,k+1,j)\n\n(Outside base case) \u2200a \u2208 I, \u03b2a,1,n = \u03c0a\n(Outside recursion) \u2200a \u2208 N, 1 \u2264 i \u2264 j \u2264 N,\n\ni\u22121(cid:88)\n\n(cid:88)\n\nk=1\n\nb\u2192c a\n\n\u03b2a,i,j =\n\nT b\u2192c a\n(1,2) (\u03b2b,k,j, \u03b1c,k,i\u22121) +\n\n(Marginals) \u2200a \u2208 N, 1 \u2264 i \u2264 j \u2264 N,\n\n\u00b5(a, i, j) = \u03b1a,i,j\u03b2a,i,j =\n\n(cid:88)\n\nb\u2192a c\n\nT b\u2192a c\n(1,3) (\u03b2b,i,k, \u03b1c,j+1,k)\n\nN(cid:88)\n(cid:88)\n\nk=j+1\n\n\u03b1a,i,j\n\nh\n\n\u03b2a,i,j\nh\n\nh\u2208[m]\n\nFigure 2: The tensor form of the inside-outside algorithm, for calculation of marginal terms \u00b5(a, i, j).\n\n4 Tensor Decomposition\n\n(1,2) (\u03b2b,k,j, \u03b1c,k,i\u22121) and T b\u2192a c\n\nAs mentioned earlier, most computation for the inside-outside algorithm is spent on the tensor cal-\nculation of T a\u2192b c on the intermediate inside/outside quantities. These computations, appearing as\nT a\u2192b c(\u03b1b,i,k, \u03b1c,k+1,j), T b\u2192c a\n(1,3) (\u03b2b,i,k, \u03b1c,j+1,k) output a vector of\nlength m, where computation of each element in the vector is O(m2). Therefore, the inside-outside\nhas a multiplicative O(m3) factor in its computational complexity, which we aim to reduce.\nFor the rest of this section, \ufb01x a binary grammar rule a \u2192 b c and consider the tensor T (cid:44) T a\u2192b c\nassociated with it. Consider a pair of two vectors y1, y2 \u2208 Rm, associated with the distributions\nover latent-states for the left (y1) and right child (y2) of a given node in a parse tree. Our method\nfor improving the speed of this tensor computation relies on a simple observation. Given an integer\nr \u2265 1, assume that the tensor T had the following special form, which is also called \u201cKruskal form\u201d,\ni , i.e. it would be the sum of r tensors, each is the tensor product of three vectors.\n\nT =(cid:80)r\n\ni=1 uiv(cid:62)\n\ni w(cid:62)\n\nIn that case, the cost of computing T (y1, y2) could be greatly reduced by computing:\n\n(cid:34) r(cid:88)\n\ni=1\n\n(cid:35)\n\nr(cid:88)\n\ni=1\n\n(y1, y2) =\n\nui(v(cid:62)\n\ni y1)(w(cid:62)\n\ni y2) = U(cid:62)(V y1 (cid:12) W y2)\n\n(1)\n\nT (y1, y2) =\n\nuiv(cid:62)\n\ni w(cid:62)\n\ni\n\nwhere U, V, W \u2208 Rr\u00d7m with the ith row being ui, vi and wi respectively.\nThe total complexity of this computation is O(mr). We see later that our approach can be used\neffectively for r as small as 2, turning the inside-outside algorithm for latent-variable PCFGs into a\nlinear algorithm in the number of hidden states.\n\n4\n\n\fr(cid:88)\n\nWe note that it is well-known that an exact tensor decomposition can be achieved by using r = m2\n[11]. In that case, there is no computational gain. The minimal r required for an exact solution can\nbe smaller than m2, but identifying that minimal r is NP-hard [9].\nWe focused on this section on the computation T a\u2192b c(\u03b1b,i,k, \u03b1c,k+1,j), but the steps above can be\ngeneralized easily for the cases of computing T b\u2192c a\n(1,3) (\u03b2b,i,k, \u03b1c,j+1,k).\n\n(1,2) (\u03b2b,k,j, \u03b1c,k,i\u22121) and T b\u2192a c\n\n4.1 CP Decomposition of Tensors\n\nIn the general case, for a \ufb01xed r, our latent-variable PCFG tensors will not have the exact decom-\nposed form from the previous section. Still, by using decomposition algorithms from multilinear\nalgebra, we can approximate the latent-variable tensors, where the quality of approximation is mea-\nsured according to some norm over the set of tensors Rm\u00d7m\u00d7m.\nAn example of such a decomposition is the canonical polyadic decomposition (CPD), also known\nas CANDECOMP/PARAFAC decomposition [3, 8, 10]. Given an integer r, least squares CPD\naims to \ufb01nd the nearest tensor in Kruskal form according to the analogous norm (for tensors) to the\nFrobenius norm (for matrices).\nMore formally, for a given tensor D \u2208 Rm\u00d7m\u00d7m, let ||D||F =\ntensors in Kruskal form Cr be:\n\ni,j,k. Let the set of\n\n(cid:113)(cid:80)\n\ni,j,k D2\n\nCr = {C \u2208 Rm\u00d7m\u00d7m | C =\n\nuiv(cid:62)\n\ni w(cid:62)\n\ni s.t. ui, vi, wi \u2208 Rm}.\n\ni=1\n\n||C \u2212 \u02c6C||F .\n\nThe least squares CPD of C is a tensor \u02c6C such that \u02c6C \u2208 arg min \u02c6C\u2208Cr\nThere are various algorithms to perform CPD, such as alternating least squares, direct linear decom-\nposition, alternating trilinear decomposition and pseudo alternating least squares [6]. Most of these\nimplementations treat the problem of identifying the approximate tensor as an optimization prob-\nlem. These algorithms are not exact. Any of these implementations can be used in our approach.\nWe note that the decomposition optimization problem is hard, and often has multiple local maxima.\nTherefore, the algorithms mentioned above are inexact.\nIn our experiments, we use the alternating least squares algorithm. This method works by iteratively\nimproving U, V and W from Eq. 1 (until convergence), each time solving a least squares problem.\n\n4.2 Propagation of Errors\n\nWe next present a theoretical guarantee about the quality of the CP-approximated tensor formulation\nof the inside-outside algorithm. We measure the propagation of errors in probability calculations\nthrough a given parse tree. We derive a similar result for the marginals.\nWe denote by \u02c6p the distribution induced over trees (skeletal and full), where we approximate each\nT a\u2192b c using the tensor \u02c6T a\u2192b c. Similarly, we denote by \u02c6\u00b5(a, i, j) the approximated marginals.\nLemma 4.1. Let C \u2208 Rm\u00d7m\u00d7m and let y1, y2, \u02c6y1, \u02c6y2 \u2208 Rm. Then the following inequalities hold:\n(2)\n(3)\n\n||C(y1, y2) \u2212 C(\u02c6y1, \u02c6y2)||2 \u2264 ||C||F max{||y1||2,||\u02c6y2||2}(||y1 \u2212 \u02c6y1||2 + ||y2 \u2212 \u02c6y2||2)\n\n||C(y1, y2)||2 \u2264 ||C||F||y1||2||y2||2\n\nProof. Eq. 2 is the result of applying Cauchy-Schwarz inequality twice:\n\n(cid:88)\n\n||C(y1, y2)||2\n\n2 =\n\n\uf8eb\uf8ed(cid:88)\n\ni\n\nj,k\n\n= ||C||2\n\nF \u00b7 ||y1||2\n\nCi,j,ky1\n\nj y2\nk\n2 \u00b7 ||y2||2\n\n2\n\n\uf8f6\uf8f82\n\n\u2264(cid:88)\n\n\uf8eb\uf8ed(cid:88)\n\nC 2\n\ni,j,k\n\n\uf8f6\uf8f8\uf8eb\uf8ed(cid:88)\n\n\uf8f6\uf8f8(cid:32)(cid:88)\n\nk\n\n(cid:33)\n\n(y2\n\nk)2\n\n(y1\n\nj )2\n\ni\n\nj,k\n\nj\n\n5\n\n\fFor Eq. 3, note that C(y1, y2) \u2212 C(\u02c6y1, \u02c6y2) = C(y1, y2) \u2212 C(y1, \u02c6y2) + C(y1, \u02c6y2) \u2212 C(\u02c6y1, \u02c6y2), and\ntherefore from the triangle inequality and bi-linearity of C:\n\n||C(y1, y2) \u2212 C(\u02c6y1, \u02c6y2)||2 \u2264 ||C(y1, y2 \u2212 \u02c6y2)||2 + ||C(y1 \u2212 \u02c6y1, \u02c6y2)||2\n\n(cid:0)||y1||2||y2 \u2212 \u02c6y2||2 + ||y1 \u2212 \u02c6y1||2||\u02c6y2||2\n\n\u2264 ||C||F\n\u2264 ||C||F max{||y1||2,||\u02c6y2||2}(||y1 \u2212 \u02c6y1||2 + ||y2 \u2212 \u02c6y2||2)\n\n(cid:1)\n\nEquipped with this Cauchy-Schwarz style lemma, we can prove the following theorem:\nTheorem 4.2. Let d\u2217 =\ntion error\u201d de\ufb01ned as \u03b3 = maxa\u2192b c ||T a\u2192b c \u2212 \u02c6T a\u2192b c||F , then:\n\u2022 For a given skeletal tree r1, . . . , rN , if the depth of the tree, denoted d, is such that\n\n\u221a\nlog(2(\n\nm + 1)) + log(\u03b3 +\n\n\u03b3 ) + 1\n\nlog( 1\n\n\u221a\n\nm)\n\nwhere \u03b3 is the the \u201ctensor approxima-\n\nd \u2264 min\n\n\u03b3 ) \u2212 log( m\n\u0001 )\n\nlog( 1\n\u221a\nm + 1)) + log(\u03b3 +\nlog(2(\n\n\u221a\n\nm)\n\n, d\u2217\n\nthen |p(r1, . . . , rN ) \u2212 \u02c6p(r1, . . . , rN )| \u2264 \u0001\n\n\u2022 For a given sentence x1, . . . , xM , it holds that for all triplets (a, i, j), if\n\n(cid:41)\n\n(cid:41)\n\n(cid:40)\n\n(cid:40)\n\nM \u2264 min\n\n\u03b3 ) \u2212 log( m\nlog( 1\n\u0001 )\n\u221a\n2 log(4|N|) + log(2(\n\nm + 1)) + log(\u03b3 +\n\n\u221a\n\nm)\n\n, d\u2217\n\nthen |\u00b5(a, i, j)\u2212\u02c6\u00b5(a, i, j)| \u2264 \u0001\n\nProof. For the \ufb01rst part, the proof is using structural induction on the structure of the test tree.\nAssume a \ufb01xed skeletal tree r1, . . . , rN . The probability p(r1, . . . , rN ) can be computed by using a\nsequence of applications of T a\u2192b c on distribution over latent states for left and right children. More\nspeci\ufb01cally, it can be shown that the vector of probabilities de\ufb01ned as [yi]h = p(ti | ai, hi = h)\n(ranging over [m]), where ti is the skeletal subtree rooted at node i can be de\ufb01ned recursively as:\n\u2022 yi = Qa\u2192xi if i is a leaf node with word xi and,\n\u2022 yi = T a\u2192b c(yj, yk) if i is a non-leaf node with node j being the left child and node k being the\n\nright child of node i.\n\nDe\ufb01ne the same quantities \u02c6yi, only using the approximate tensors \u02c6T a\u2192b c. Let \u03b4i = ||yi \u2212 \u02c6yi||. We\nwill prove inductively that if di is the depth of the subtree at node i, then:\n\n(cid:40)\n\n(cid:32)\n\n\u03b4i \u2264 min\n\n\u03b3m\n\n\u221a\n\u221a\nm + 1)(\u03b3 +\n(2(\nm + 1)(\u03b3 +\n2(\n\n\u221a\nm))di \u2212 1\n\u221a\nm) \u2212 1\n\n(cid:33)\n\n(cid:41)\n\n, 1\n\nFor any leaf node (base case): ||yi \u2212 \u02c6yi||2 = 0. For a given non-leaf node i:\n\n\u03b4i =||yi \u2212 \u02c6yi||2 = ||T a\u2192b c(yj, yk) \u2212 \u02c6T a\u2192b c(\u02c6yj, \u02c6yk)||2\n\n\u2264||T a\u2192b c(yj, yk) \u2212 \u02c6T a\u2192b c(yj, yk)||2 + || \u02c6T a\u2192b c(yj, yk) \u2212 \u02c6T a\u2192b c(\u02c6yj, \u02c6yk)||2\n\u2264||T a\u2192b c \u2212 \u02c6T a\u2192b c||F||yj||2||yk||2\n\n+ || \u02c6T a\u2192b c|| max{||yj||2,||\u02c6yk||2}(||yj \u2212 \u02c6yj||2 + ||yk \u2212 \u02c6yk||2)\n\n\u221a\n\n\u221a\n\n\u2264\u03b3m + (\n\u2264\u03b3m\n\n(cid:32)\n(cid:32)\n\n=\u03b3m\n\nm + 1)(\u03b3 +\n\n\u221a\n\nm)(\u03b4j + \u03b4k)\n\u221a\n(2(\n\n\u221a\n\nm)\n\n1 + 2(\n\u221a\n\u221a\nm + 1)(\u03b3 +\n(2(\n2(\nm + 1)(\u03b3 +\n\nm + 1)(\u03b3 +\n2(\n\u221a\nm))di \u2212 1\n\u221a\nm) \u2212 1\n\n(cid:33)\n\n(cid:33)\n\n\u221a\n\n\u221a\nm + 1)(\u03b3 +\n\nm + 1)(\u03b3 +\n\nm))di\u22121 \u2212 1\n\u221a\nm) \u2212 1\n\n(4)\n\n(5)\n\n(6)\n\n6\n\n\fm = 8\n\nm = 16\n\nm = 20\n\nthreshold\nseconds per sentence\nF1\nseconds per sentence\nF1\nseconds per sentence\nF1\n\nno approx.\n4.6\n85.72\n14.6\n85.59\n23.7\n85.20\n\n10\u22128\n4.2\n85.72\n9.8\n85.58\n15.6\n85.21\n\n10\u22125\n3.4\n85.60\n3.5\n85.46\n3.6\n85.15\n\n0.001\n3.5\n85.61\n3.2\n85.49\n3.2\n85.14\n\n8\n\n16\n\n20\n\n=\n\n=\n\n=\n\nm\n\nm\n\nm\n\nFigure 3: Speed and performance of parsing with tensor decomposition for m \u2208 {8, 16, 20} (left\nplots, middle plots and right plots respectively). The left y axis is running time (red circles), the\nright y axis is F1 performance of the parser (blue squares), the x axis corresponds to log t. Solid\nlines describe decomposition with r = 2, dashed lines describe decomposition with r = 8. In\naddition, we include the numerical results for various m for r = 8.\n\n\u221a\n\nwhere Eq. 4 is the result of the triangle inequality, Eq. 5 comes from Lemma 4.1 and the fact that\n|| \u02c6T a\u2192b c||F \u2264 || \u02c6T a\u2192b c\u2212T a\u2192b c||F +||T a\u2192b c||F \u2264 \u03b3 +\nm\nfor any node k (under ind. hyp.), and Eq. 6 is the result of applying the induction hypothesis. It can\nalso be veri\ufb01ed that since di \u2264 d \u2264 d\u2217 we have \u03b4i \u2264 1.\n\u221a\nSince m \u2265 1, it holds that \u03b4i \u2264 \u03b3m (2(\nm))di. Consider |p(r1, . . . , rN ) \u2212\n\u02c6p(r1, . . . , rN )| = |\u03c0a1(y1 \u2212 \u02c6y1)| \u2264 ||\u03c0a1||2\u03b41 \u2264 \u03b41 where a1 is the non-terminal at the root of\nthe tree. It is easy to verify that if d1 \u2264\n, then \u03b41 \u2264 \u0001, as needed.\n\nm and ||\u02c6yk||2 \u2264 \u03b4k +||yk||2 \u2264 1+\n\n\u03b3 ) \u2212 log( m\n\u0001 )\n\nFor the marginals, consider that: |\u00b5(a, i, j) \u2212 \u02c6\u00b5(a, i, j)| \u2264(cid:80)\n\nlog( 1\n\u221a\nm + 1)) + log(\u03b3 +\nlog(2(\n\n\u03c4\u2208T(x) |p(\u03c4 ) \u2212 \u02c6p(\u03c4 )|.\n\nm + 1)(\u03b3 +\n\n\u221a\n\n\u221a\n\nm)\n\n\u221a\n\nWe have d1 \u2264 M. In addition, if\n\nM \u2264 log( 1\n\n\u03b3 ) \u2212 2M log(4|N|) \u2212 log( m\n\u0001 )\n\u221a\n\u221a\nm)\nlog(2(\n\nm + 1)) + log(\u03b3 +\n\nthen d1 \u2264\n\nlog( 1\n\u221a\nlog(2(\n\n\u03b3 ) \u2212 log( m\n\n\u0001/|T(x)| )\nm + 1)) + log(\u03b3 +\n\n\u221a\n\nm)\n\n(7)\n\ncan\n\nbecause the number of labeled binary trees for a sentence of length M is at most (4|N|)2M (and\ntherefore |T(x)| \u2264 (4|N|)2M ; 4l is a bound on the Catalan number, the number of binary trees over\nl nodes), then |\u00b5(a, i, j) \u2212 \u02c6\u00b5(a, i, j)| \u2264 \u0001.\nIt\nleft\n\nveri\ufb01ed\nthe\n\u03b3 ) \u2212 log( m\nlog( 1\n\u0001 )\n\u221a\n2 log(4|N|) + log(2(\nAs expected, the longer a sentence is, or the deeper a parse tree is, the better we need the tensor\napproximation to be (smaller \u03b3) for the inside-outside to be more accurate.\n\nhand-side\n\u221a\n\nif M \u2264\n\nm + 1)) + log(\u03b3 +\n\nsatis\ufb01ed\n\nof Eq.\n\nthat\n\nm)\n\nbe\n\n.\n\n7\n\nis\n\n5 Experiments\n\nWe devote this section to empirical evaluation of our approach. Our goal is to evaluate the trade-off\nbetween the accuracy of the tensor decomposition and the speed-up in the parsing algorithm.\n\n7\n\nlog threshold\u221220\u221215\u221210\u221253.54.04.55.0llllllllllllllllll84.084.585.085.586.0F1seconds per sentencelog threshold\u221220\u221215\u221210\u22125468101214llllllllllllllllll84.084.585.085.586.0F1seconds per sentencelog threshold\u221220\u221215\u221210\u22125510152025llllllllllllllllll84.084.585.085.586.0F1seconds per sentence\fExperimental Setup We use sections 2\u201321 of the Penn treebank [13] to train a latent-variable\nparsing model using the expectation-maximization algorithm (EM was run for 15 iterations) for\nvarious number of latent states (m \u2208 {8, 16, 20}), and then parse in various settings section 22 of\nthe same treebank (sentences of length \u2264 40). Whenever we report parsing accuracy, we use the\ntraditional F1 measure from the Parseval metric [2]. It computes the F1 measure of spans (a, i, j)\nappearing in the gold standard and the hypothesized trees.\nThe total number of tensors extracted from the training data using EM was 7,236 (corresponding\nto the number of grammar rules). Let \u03b3a\u2192b c = ||T a\u2192b c \u2212 \u02c6T a\u2192b c||F . In our experiments, we\nvary a threshold t \u2208 {0.1, 0.001, 10\u22125, 10\u22126, 10\u22128, 0} \u2013 an approximate tensor \u02c6T a\u2192b c is used\ninstead of T a\u2192b c only if \u03b3a\u2192b c \u2264 t. The value t = 0 implies using vanilla inference, without any\napproximate tensors. We describe experiments with r \u2208 {2, 8}. For the tensor approximation, we\nuse the implementation provided in the Matlab tensor toolbox from [1]. The toolbox implements the\nalternating least squares method.\nAs is common, we use a pruning technique to make the parser faster \u2013 items in the dynamic pro-\ngramming chart are pruned if their value according to a base vanilla maximum likelihood model is\nless than 0.00005 [4]. We report running times considering this pruning as part of the execution.\nThe parser was run on a single Intel Xeon 2.67GHz CPU.\nWe note that the performance of the parser improves as we add more latent states. The performance\nof the parser with vanilla PCFG (m = 1) is 70.26 F1 measure.\n\nExperimental Results Table 3 describes F1 performance and running time as we vary t. It is\ninteresting to note that the speed-up, for the same threshold t, seems to be larger when using r = 8\ninstead of r = 2. At \ufb01rst this may sound counter-intuitive. The reason for this happening is that\nwith r = 8, more of the tensors have an approximation error which is smaller than t, and therefore\nmore approximate tensors are used than in the case of r = 2.\nUsing t = 0.1, the speed-up is signi\ufb01cant over non-approximate version of the parser. More specif-\nically, for r = 8, it takes 72% of the time (without considering the pruning phase) of the non-\napproximate parser to parse section 22 with m = 8, 24% of the time with m = 16 and 21% of the\ntime with m = 20. The larger m is, the more signi\ufb01cant the speed-up is.\nThe loss in performance because of the approximation, on the other hand, is negligible. More\nspeci\ufb01cally, for r = 8, performance is decreased by 0.12% for m = 8, 0.11% for m = 16 and\n0.08% for m = 20.\n\n6 Conclusion\n\nWe described an approach to signi\ufb01cantly improve the speed of inference with latent-variable\nPCFGs. The approach approximates tensors which are used in the inside-outside algorithm. The\napproximation comes with a minimal cost to the performance of the parser. Our algorithm can be\nused in tandem with estimation algorithms such as EM or spectral algorithms [5]. We note that\ntensor formulations are used with graphical models [15], for which our technique is also applicable.\nSimilarly, our technique can be applied to other dynamic programming algorithms which compute\nmarginals of a given statistical model.\n\nReferences\n\n[1] B. W. Bader and T. G. Kolda. Algorithm 862: MATLAB tensor classes for fast algorithm\n\nprototyping. ACM Transactions on Mathematical Software, 32(4):635\u2013653, 2006.\n\n[2] E. Black, S. Abney, D. Flickenger, C. Gdaniec, R. Grishman, P Harrison, D. Hindle, R. Ingria,\nF. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski.\nA procedure for quantitatively comparing the syntactic coverage of English grammars. In Proc.\nof DARPA Workshop on Speech and Natural Language, 1991.\n\n[3] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via\nan N-way generalization of Eckart-Young decomposition. Psychometrika, 35:283\u2013319, 1970.\n\n8\n\n\f[4] E. Charniak and M. Johnson. Coarse-to-\ufb01ne n-best parsing and maxent discriminative rerank-\n\ning. In Proceedings of ACL, 2005.\n\n[5] S. B. Cohen, K. Stratos, M. Collins, D. F. Foster, and L. Ungar. Spectral learning of latent-\n\nvariable PCFGs. In Proceedings of ACL, 2012.\n\n[6] N. M. Faber, R. Bro, and P. Hopke. Recent developments in CANDECOMP/PARAFAC algo-\nrithms: a critical review. Chemometrics and Intelligent Laboratory Systems, 65(1):119\u2013137,\n2003.\n\n[7] J. Goodman. Parsing algorithms and metrics. In Proceedings of ACL, 1996.\n[8] R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an\n\u201cexplanatory\u201d multi-modal factor analysis. UCLA working papers in phoentics, 16:1\u201384, 1970.\n\n[9] J. H\u00f8astad. Tensor rank is NP-complete. Algorithms, 11:644\u2013654, 1990.\n[10] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51:455\u2013500,\n\n2009.\n\n[11] J. B. Kruskal. Rank, decomposition, and uniqueness for 3-way and N-way arrays. In R. Coppi\n\nand S. Bolasco, editors, Multiway Data Analysis, pages 7\u201318, 1989.\n\n[12] C.D. Manning and H. Sch\u00a8utze. Foundations of Statistical Natural Language Processing. MIT\n\nPress, 1999.\n\n[13] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of\n\nEnglish: The Penn treebank. Computational Linguistics, 19:313\u2013330, 1993.\n\n[14] T. Matsuzaki, Y. Miyao, and J. Tsujii. Probabilistic CFG with latent annotations. In Proceed-\n\nings of ACL, 2005.\n\n[15] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for latent tree graphical models. In\nProceedings of The 28th International Conference on Machine Learningy (ICML 2011), 2011.\n[16] S. Petrov, L. Barrett, R. Thibaux, and D. Klein. Learning accurate, compact, and interpretable\n\ntree annotation. In Proceedings of COLING-ACL, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1208, "authors": [{"given_name": "Michael", "family_name": "Collins", "institution": null}, {"given_name": "Shay", "family_name": "Cohen", "institution": null}]}