{"title": "Hierarchical Methods of Moments", "book": "Advances in Neural Information Processing Systems", "page_first": 1901, "page_last": 1911, "abstract": "Spectral methods of moments provide a powerful tool for learning the parameters of latent variable models. Despite their theoretical appeal, the applicability of these methods to real data is still limited due to a lack of robustness to model misspecification. In this paper we present a hierarchical approach to methods of moments to circumvent such limitations. Our method is based on replacing the tensor decomposition step used in previous algorithms with approximate joint diagonalization. Experiments on topic modeling show that our method outperforms previous tensor decomposition methods in terms of speed and model quality.", "full_text": "Hierarchical Methods of Moments\n\nMatteo Ruf\ufb01ni \u21e4\n\nUniversitat Polit\u00e8cnica\n\nde Catalunya\n\nGuillaume Rabusseau \u2020\n\nMcGill University\n\nAbstract\n\nBorja Balle \u2021\n\nAmazon Research\n\nCambridge\n\nSpectral methods of moments provide a powerful tool for learning the parameters\nof latent variable models. Despite their theoretical appeal, the applicability of\nthese methods to real data is still limited due to a lack of robustness to model\nmisspeci\ufb01cation. In this paper we present a hierarchical approach to methods of\nmoments to circumvent such limitations. Our method is based on replacing the\ntensor decomposition step used in previous algorithms with approximate joint\ndiagonalization. Experiments on topic modeling show that our method outperforms\nprevious tensor decomposition methods in terms of speed and model quality.\n\n1\n\nIntroduction\n\nUnsupervised learning of latent variable models is a fundamental machine learning problem. Al-\ngorithms for learning a variety of latent variable models, including topic models, hidden Markov\nmodels, and mixture models are routinely used in practical applications for solving tasks ranging\nfrom representation learning to exploratory data analysis. For practitioners faced with the problem of\ntraining a latent variable model, the decades-old Expectation-Maximization (EM) algorithm [1] is\nstill the tool of choice. Despite its theoretical limitations, EM owes its appeal to (i) the robustness of\nthe maximum-likelihood principle to model misspeci\ufb01cation, and (ii) the need, in most cases, to tune\na single parameter: the dimension of the latent variables.\nOn the other hand, method of moments (MoM) algorithms for learning latent variable models via\nef\ufb01cient tensor factorization algorithms have been proposed in the last few years [2\u20139]. Compared to\nEM, moment-based algorithms provide a stronger theoretical foundation for learning latent variable\nmodels. In particular, it is known that in the realizable setting the output of a MoM algorithm will\nconverge to the parameters of the true model as the amount of training data increases. Furthermore,\nMoM algorithms only make a single pass over the training data, are highly parallelizable, and always\nterminate in polynomial time. However, despite their apparent advantages over EM, the adoption of\nMoM algorithms in practical applications is still limited.\nEmpirical studies indicate that initializing EM with the output of a MoM algorithm can improve\nthe convergence speed of EM by several orders of magnitude, yielding a very ef\ufb01cient strategy to\naccurately learn latent variable models [8\u201310]. In the case of relatively simple models this approach\ncan be backed by intricate theoretical analyses [11]. Nonetheless, these strategies are not widely\ndeployed in practice either.\nThe main reason why MoM algorithms are not adopted by practitioners is their lack of robustness to\nmodel misspeci\ufb01cation. Even when combined with EM, MoM algorithms fail to provide an initial\nestimate for the parameters of a model leading to fast convergence when the learning problem is too\nfar from the realizable setting. For example, this happens when the number of the latent variables\nused in a MoM algorithm is too small to accurately represent the training data. In contrast, the model\n\n\u21e4mruf\ufb01ni@cs.upc.edu\n\u2020guillaume.rabusseau@mail.mcgill.ca\n\u2021pigem@amazon.co.uk\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fobtained by standalone EM in this case is reasonable and desirable: when asked for a small number of\nlatent variables EM yields a model which is easy to interpret and can be useful for data visualization\nand exploration. For example, an important application of low-dimensional learning can be found\nin mixture models, where latent class assignments provided by a simple model can be used to split\nthe training data into disjoint datasets to which EM is applied recursively to produce a hierarchical\nclustering [12, 13]. The tree produced by such clusterings procedure provides a useful aid in data\nexploration and visualization even if the models learned at each branching point do not accurately\nrepresent the training data.\nIn this paper we develop a hierarchical method of moments that produces meaningful results even in\nmisspeci\ufb01ed settings. Our approach is different from previous attemps to design MoM algorithms\nfor misspeci\ufb01ed models. Instead of looking for convex relaxations of existing MoM algorithms\nlike in [14\u201316] or analyzing the behavior of a MoM algorithm with a misspeci\ufb01ed number of\nlatent states like in [17, 18], we generalize well-known simultaneous diagonalization approaches to\ntensor decomposition by phrasing the problem as a non-convex optimization problem. Despite its\nnon-convexity, the hierarchical nature of our method allows for a fast accurate solution based on\nlow-dimensional grid search. We test our method on synthetic and real-world datasets on the topic\nmodeling task, showcasing the advantages of our approach and obtaining meaningful results.\n\n2 Moments, Tensors, and Latent Variable Models\n\nThis section starts by recalling the basic ideas behind methods of moments for learning latent variable\nmodels via tensor decompositions. Then we review existing tensor decomposition algorithms and\ndiscuss the effect of model misspeci\ufb01cation on the output of such algorithms.\nFor simplicity we consider \ufb01rst a single topic model with k topics over a vocabulary with d words.\nA single topic model de\ufb01nes a generative process for text documents where \ufb01rst a topic Y 2 [k] is\ndrawn from some discrete distribution P[Y = i] = !i, and then each word Xt 2 [d], 1 \uf8ff t \uf8ff T , in a\ndocument of length T is independently drawn from some distribution P[Xt = j|Y = i] = \u00b5i,j over\nwords conditioned on the document topic. The model is completely speci\ufb01ed by the vector of topic\nproportions ! 2 Rk and the word distributions \u00b5i 2 Rd for each topic i 2 [k]. We collect the word\ndistributions of the model as the columns of a matrix M = [\u00b51 \u00b7\u00b7\u00b7 \u00b5k] 2 Rd\u21e5k.\nIt is convenient to represent the words in a document using one-hot encodings so that Xt 2 Rd is an\nindicator vector. With this notation, the conditional expectation of any word in a document drawn\nfrom topic i is E[Xt|Y = i] = \u00b5i, and the random vector X =PT\nt=1 Xt is conditionally distributed\nas a multinomial random variable, with parameters \u00b5i and T . Integrating over topics drawn from ! we\nobtain the \ufb01rst moment of the distribution over words M1 = E[Xt] =Pi !i\u00b5i = M! . Generalizing\n\nthis argument to pairs and triples of distinct words in a document yields the matrix of second order\nmoments and the tensor of third order moments of a single topic model:\n\nwhere \u2326 denotes the tensor (Kronecker) product between vectors. By de\ufb01ning the matrix \u2326 = diag(!)\none also obtains the expression M2 = M \u2326M>.\nA method of moments for learning single topic models proceeds by (i) using a collection of n\ndocuments to compute empirical estimates \u02c6M1, \u02c6M2, \u02c6M3 of the moments, and (ii) using matrix and\ntensor decomposition methods to (approximately) factor these empirical moments and extract the\nmodel parameters from their decompositions. From the algorithmic point of view, the appeal of this\nscheme resides in the fact that step (i) requires a single pass over the data which can be trivially\nparallelized using map-reduce primitives, while step (ii) only requires linear algebra operations whose\nrunning time is independent of n. The speci\ufb01cs of step (ii) will be discussed in Section 2.1.\nEstimating moments \u02c6Mm from data with the property that E[ \u02c6Mm] = Mm for m 2{ 1, 2, 3} is the\nessential requirement for step (i). In the case of single topic models, and more generally multi-view\nmodels, such estimations are straightforward. For example, a simple consistent estimator takes a\n3 using the\ncollection of documents {x(i)}n\n\n1 \u2326 x(i)\n\n2 \u2326 x(i)\n\ni=1 and computes \u02c6M3 = (1/n)Pn\n\ni=1 x(i)\n\n2\n\nM2 = E[Xs \u2326 Xt] =Xi\nM3 = E[Xr \u2326 Xs \u2326 Xt] =Xi\n\n!i\u00b5i \u2326 \u00b5i 2 Rd\u21e5d ,\n\n!i\u00b5i \u2326 \u00b5i \u2326 \u00b5i 2 Rd\u21e5d\u21e5d ,\n\n(1)\n\n(2)\n\n\f\ufb01rst three words from each document. More data-ef\ufb01cient estimators for datasets containing long\ndocuments can be found in the literature [19].\nFor more complex models the method sketched above requires some modi\ufb01cations. Speci\ufb01cally,\nit is often necessary to correct the statistics directly observable from data in order to obtain vec-\ntors/matrices/tensors whose expectation over a training dataset exhibits precisely the relation with the\nparameters ! and M described above. For example, this is the case for Latent Dirichlet Allocation\nand mixtures of spherical Gaussians [4, 6]. For models with temporal dependence between observa-\ntions, e.g. hidden Markov models, the method requires a spectral projection of observables to obtain\nmoments behaving in a multi-view-like fashion [3, 20]. Nonetheless, methods of moments for these\nmodels and many others always reduces to the factorization of a matrix and tensor of the form M2\nand M3 given above.\n\n2.1 Existing Tensor Decomposition Algorithms\n\nMathematically speaking, methods of moments attempt to solve the polynomial equations in ! and\nM arising from plugging the empirical estimates \u02c6Mm into the expressions for their expectations\ngiven above. Several approaches have been proposed to solve these non-linear systems of equations.\nA popular method for tensor decomposition is Alternating Least Squares (ALS) [21]. Starting from a\nrandom initialization of the factors composing a tensor, ALS iteratively \ufb01xes two of the three factors,\nand updates the remaining one by solving an overdetermined linear least squares problem. ALS is\neasy to implement and to understand, but is known to be prone to local minima, needing several\nrandom restarts to yield meaningful results. These limitations fostered the research for methods with\nguarantees, that, in the unperturbed setting, optimally decompose a tensor like the one in Eq. (2). We\nnow brie\ufb02y analyze some of these methods.\nThe tensor power method (TPM) [2] starts with a whitening step where, given the SVD M2 = U SU>,\nthe whitening matrix E = U S1/2 2 Rd\u21e5k is used to transform M3 into a symmetric orthogonally\ndecomposable tensor\n\nT =\n\n!iE\u2020\u00b5i \u2326 E\u2020\u00b5i \u2326 E\u2020\u00b5i 2 Rk\u21e5k\u21e5k\n\n(3)\n\nkXi=1\n\nThe weights !i and vectors \u00b5i are then recovered from T using a tensor power method and inverting\nthe whitening step.\nThe same whitening matrix is used in [3, 4], where the authors observe that the whitened slices\nof M3 are simultaneously diagonalized by the Moore-Penrose pseudoinverse of M \u23261/2. Indeed,\nsince M2 = M \u2326M> = EE>, there exists a unique orthonormal matrix O 2 Rk\u21e5k such that\nM \u23261/2 = EO. Writing M3,r 2 Rd\u21e5d for the rth slice of M3 across its second mode and mr for the\nrth row of M, it follows that\n\nM3,r = M \u23261/2diag(mr)\u23261/2M> = EOdiag(mr)O>E> .\n\nThus, the problem can be reduced to searching for the common diagonalizer O of the whitened slices\nof M3 de\ufb01ned as\n\nHr = E\u2020M3,rE\u2020> = Odiag(mr)O> .\n\n(4)\nIn the noiseless settings it is suf\ufb01cient to diagonalize any of the slices M3,r. However, one can also\nrecover O as the eigenvectors of a random linear combination of the various Hr which is more robust\nto noise [3].\nLastly, the method proposed in [22] consists in directly performing simultaneous diagonalization\nof random linear combinations of slices of M3 without any whitening step. This method, which in\npractice is slower than the others (see Section 4.1), under an incoherence assumption on the vectors\n\u00b5i, can robustly recover the weights !i and vectors \u00b5i from the tensor M3, even when it is not\northogonally decomposable.\n\n2.2 The Misspeci\ufb01ed Setting\n\nThe methods listed in the previous section have been analyzed in the case where the algorithm only\nhas access to noisy estimates of the moments. However, such analyses assume that the data was\n\n3\n\n\fgenerated by a model from the hypothesis class, that the matrix M has rank k, and that this rank is\nknown to the algorithm. In practice the dimension k of the latent variable can be cross-validated, but\nin many cases this is not enough: data may come from a model outside the class, or from a model with\na very large true k. Besides, the moment estimates might be too noisy to provide reliabe estimates for\nlarge number of latent variables. It is thus frequent to use these algorithms to estimate l < k latent\nvariables. However, existing algorithms are not robust in this setting, as they have not been designed\nto work in this regime, and there is no theoretical explanation of what their outputs will be.\nThe methods relying on a whitening step [2\u20134], will perform the whitening using the matrix E\u2020l\nobtained from the low-rank SVD truncated at rank l: M2 \u21e1 UlSlU>l = ElE>l . TPM will use\nEl to whiten the tensor M3 to a tensor Tl 2 Rl\u21e5l\u21e5l. However, when k > l, Tl may not admit a\nsymmetric orthogonal decomposition 4. Consequently, it is not clear what TPM will return in this\ncase and there are no guarantees it will even converge. The methods from [3, 4] will compute the\nmatrices Hl,r = E\u2020l M3,rE\u2020>l\nfor r 2 [d] that may not be jointly diagonalizable, and in this case\nthere is no theoretical justi\ufb01cation of what what the result of these algorithms will be. Similarly, the\nsimultaneous diagonalization method proposed in [22] produces a matrix that nearly diagonalizes the\nslices of M3, but no analysis is given for this setting.\n\n3 Simultaneous Diagonalization Based on Whitening and Optimization\n\nThis section presents the main contribution of the paper: a simultaneous diagonalization algorithms\nbased on whitening and optimization we call SIDIWO (Simultaneous Diagonalization based on\nWhitening and Optimization). When asked to produce l = k components in the noiseless setting,\nSIDIWO will return the same output as any of the methods discussed in Section 2.1. However, in\ncontrast with those methods, SIDIWO will provide useful results with a clear interpretation even in a\nmisspeci\ufb01ed setting (l < k).\n\n3.1 SIDIWO in the Realizable Setting\n\nTo derive our SIDIWO algorithm we \ufb01rst observe that in the noiseless setting and when l = k, the\npair (M, !) returned by all methods described in Section 2.1 is the solution of the optimization\nproblem given in the following lemma5.\n\nLemma 3.1 Let M3,r be the r-th slice across the second mode of the tensor M3 from (2) with\nparameters (M, !). Suppose rank(M ) = k and let \u2326 = diag(!). Then the matrix (M \u23261/2)\u2020 is the\nunique optimum (up to column rescaling) of the optimization problem\n\nD2DkXi6=j dXr=1\n\nmin\n\n(DM3,rD>)2\n\ni,j!1/2\n\n,\n\n(5)\n\nwhere Dk = {D : D = (EOk)\u2020 for some Ok s.t. OkO>k = Ik} and E is the whitening matrix\nde\ufb01ned in Section 2.1.\n\nRemark 1 (The role of the constraint) Consider the cost function of Problem (5): in an uncon-\nstrained setting, there may be several matrices minimizing that cost. A trivial example is the zero\nmatrix. A less trivial example is when the rows of D belong to the orthogonal complement of the\ncolumn space of the matrix M. The constraint D = (EOk)\u2020 for some orthonormal matrix Ok \ufb01rst\nexcludes the zero matrix from the set of feasible solutions, and second guarantees that all feasible\nsolutions lay in the space generated by the columns of M.\n\nProblem (5) opens a new perspective on using simultaneous diagonalization to learn the parameters\nof a latent variable model. In fact, one could recover the pair (M, !) from the relation M \u23261/2 = D\u2020\nby \ufb01rst \ufb01nding the optimal D and then individually retrieving M and ! by solving a linear system\nusing the vector M1. This approach, outlined in Algorithm 1, is an alternative to the ones presented\nin the literature up to now (even though in the noiseless, realizable setting, it will provide the same\n\n4See the supplementary material for an example corroborating this statement.\n5The proofs of all the results are provided in the supplementary material.\n\n4\n\n\fAlgorithm 1 SIDIWO: Simultaneous Diagonalization based on Whitening and Optimization\nRequire: M1, M2, M3, the number of latent states l\n1: Compute a SVD of M2 truncated at the l-th singular vector: M2 \u21e1 UlSlU>l\n2: De\ufb01ne the matrix El = UlS1/2\n2 Rd\u21e5l.\n3: Find the matrix D 2D l optimizing Problem (5).\n4: Find ( \u02dcM , \u02dc!) solving\u21e2 \u02dcM \u02dc\u23261/2 = D\u2020\n\n\u02dcM \u02dc!> = M1\n\n.\n\nl\n\n5: return ( \u02dcM , \u02dc!)\n\nresults). Similarly to existing methods, this approach requires to know the number of latent states.\nWe will however see in the next section that Algorithm 1 provides meaningful results even when a\nmisspeci\ufb01ed number of latent states l < k is used.\n\n3.2 The Misspeci\ufb01ed Setting\n\nAlgorithm 1 requires as inputs the low order moments M1, M2, M3 along with the desired number\nof latent states l to recover. If l = k, it will return the exact model parameters (M, !); we will now\nsee that it will also provide meaningful results when l < k. In this setting, Algorithm 1 returns a pair\n( \u02dcM , \u02dc!) 2 Rd\u21e5l \u21e5 Rl such that the matrix D = ( \u02dcM \u02dc\u23261/2)\u2020 is optimal for the optimization problem\n\nD2DlXi6=j dXr=1\n\nmin\n\ni,j!1/2\n\n(DM3,rD>)2\n\n.\n\n(6)\n\nAnalyzing the space of feasible solutions (Theorem 3.1) and the optimization function (Theorem\n3.2), we will obtain theoretical guarantees on what SIDIWO returns when l < k, showing that the\ntrivial solutions are not feasible, and that, in the space of feasible solutions, SIDIWO\u2019s optima will\napproximate the true model parameters according to an intuitive geometric interpretation.\n\nRemarks on the constraints. The \ufb01rst step consists in analyzing the space of feasible solutions Dl\nwhen l < k. The observations outlined in Remark 1 still hold in this setting: the zero solution and the\nmatrices laying in the orthonormal complement of M are not feasible. Furthermore, the following\ntheorem shows that other undesirable solutions will be avoided.\nTheorem 3.1 Let D 2D l with rows d1, ..., dl, and let Ir,s denote the r \u21e5 s identity matrix. The\nfollowing facts hold under the hypotheses of Lemma 3.1:\n\n1. For any row di, there exists at least one column of M such that hdi, \u00b5ji 6= 0.\n2. The columns of any \u02dcM satisfying \u02dcM \u02dc\u23261/2 = D\u2020 are a linear combination of those of M,\n\nlaying in the best-\ufb01t l-dimensional subspace of the space spanned by the columns of M.\n\n3. Let \u21e1 be any permutation of {1, ..., d}, and let M\u21e1 and \u2326\u21e1 be obtained by permuting the\n\u21e1 )Ik,l)\u2020 /2\n\ncolumns of M and \u2326 according to \u21e1. If h\u00b5i, \u00b5ji 6= 0 for any i, j, then ((M\u21e1\u23261/2\nDl, and similarly Il,k(M\u21e1\u23261/2\n\n\u21e1 )\u2020 /2D l.\n\nThe second point of Theorem 3.1 states that the feasible solutions will lay in the best l-dimensional\nsubspace approximating the one spanned by the columns of M. This has two interesting consequences:\nif the columns of M are not orthogonal, point 3 guarantees that \u02dcM cannot simply be a sub-block of the\noriginal M, but rather a non-trivial linear combination of its columns laying in the best l-dimensional\nsubspace approximating its column space. In the single topic model case with k topics, when asked\nto recover l < k topics, Algorithm 1 will not return a subset of the original k topics, but a matrix\n\u02dcM whose columns gather the original topics via a non trivial linear combination: the original topics\nwill all be represented in the columns of \u02dcM with different weights. When the columns of M are\northogonal, this space coincides with the space of the l columns of M associated with the l largest !i;\nin this setting, the matrix (M\u21e1\u23261/2\n\u21e1 )Ik,l (for some permutation \u21e1) is a feasible solution and minimizes\nProblem (6). Thus, Algorithm 1 will recover the top l topics.\n\n5\n\n\fInterpreting the optima. Let \u02dcM be such that D = ( \u02dcM \u02dc\u23261/2)\u2020 2D l is a minimizer of Problem (6).\nIn order to better understand the relation between \u02dcM and the original matrix M, we will show\nthat the cost function of Problem (6) can be written in an equivalent form, that unveils a geometric\ninterpretation.\nTheorem 3.2 Let d1, ..., dl denote the rows of D 2D l and introduce the following optimization\nproblem\n\nmin\n\nD2DlXi6=j\n\nsup\nv2VM\n\nkXh=1\n\nhdi, \u00b5hihdj, \u00b5hi!hvh\n\n(7)\n\nwhere VM = {v 2 Rk : v = \u21b5>M, where k\u21b5k2 \uf8ff 1}. Then this problem is equivalent to (6).\nFirst, observe that the cost function in Equation (7) prefers D\u2019s such that the vectors ui =\n[hdi, \u00b51p!1i, ...,hdi, \u00b5kp!ki], i 2 [l], have disjoint support. This is a consequence of the supv2VM ,\nand requires that, for each j, the entries hdi, \u00b5jp!ji are close zero for at least all but one of the\nvarious di. Consequently, each center will be almost orthogonal to all but one row of the optimal D;\nhowever the number of centers is greater than the number of rows of D, so the same row di may be\nnonorthogonal to various centers.\nFor illustration, consider the single topic model: a solution D to Problem (7) would have rows that\nshould be as orthogonal as possible to some topics and as aligned as possible to the others; in other\nwords, for a given topic j, the optimization problem is trying to set hdi, \u00b5jp!ji = 0 for all but one\nof the various di. Consequently, each column of the output \u02dcM of Algorithm 1 should be in essence\naligned with some of the topics and orthogonal to the others.\nIt is worth mentioning that the constraint set Dl forbids the trivial solutions such as the zero matrix, the\npseudo-inverse of any subset of l columns of M \u23261/2, and any subset of l rows of (M \u23261/2)\u2020 (which\nall have an objective value of 0).\nWe remark that Theorem 3.2 doesn\u2019t require the matrix M to be full rank k: we only need it to have\nat least rank greater or equal to l, in order to guarantee that the constraint set Dl is well de\ufb01ned.\nAn optimal solution when l = 2. While Problem (5) can be solved in general using an extension\nof the Jacobi technique [23, 24], we provide a simple and ef\ufb01cient method for the case l = 2.\nThis method will then be used to perform hierarchical topic modeling in Section 4. When l = 2,\nEquation (6) can be solved optimally with few simple steps; in fact, the following theorem shows that\nsolving (6) is equivalent to minimizing a continuous function on the compact one-dimensional set\nI = [1, 1], which can easily be done by griding I. Using this in Step 3 of Algorithm 1, one can\nef\ufb01ciently compute an arbitrarily good approximation of the optimal matrix D 2D 2.\nTheorem 3.3 Consider the continuous function F (x) = c1x4 + c2x3p1 x2 + c3xp1 x2 +\nc4x2 + c5, where the coef\ufb01cients c1, ..., c5 are functions of the entries of M2 and M3. Let a be the\nminimizer of F on [1, 1], and consider the matrix\nOa =\uf8ffp1 a2\n\np1 a2 .\n\na\n\na\n\nThen, the matrix D = (E2Oa)\u2020 is a minimizer of Problem (6) when l = 2.\n\n4 Case Study: Hierarchical Topic Modeling\n\nIn this section, we show how SIDIWO can be used to ef\ufb01ciently recover hierarchical representations\nof latent variable models. Given a latent variable model with k states, our method allows to recover\na pair ( \u02dcM , \u02dc!) from estimate of the moments M1, M2 and M3, where the l columns of \u02dcM offer a\nsynthetic representation of the k original centers. We will refer to these l vectors as pseudo-centers:\neach pseudo-center is representative of a group of the original centers. Consider the case l = 2. A\ndataset C of n samples can be split into two smaller subsets according to their similarity to the two\npseudo-centers. Formally, this assignment is done using Maximum A Posteriori (MAP) to \ufb01nd the\npseudo-center giving maximum conditional likelihood to each sample. The splitting procedure can\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Figure 1a provides a visualization of the topics used to generate the sample. Figure 1b\nrepresents the hierarchy recovered with the proposed method. Table 1c reports the average and\nstandard deviation over 10 runs of the clustering accuracy for the various methods, along with average\nrunning times.\n\nbe iterated recursively to obtain a divisive binary tree, leading to a hierarchical clustering algorithm.\nWhile this hierarchical clustering method can be applied to any latent variable model that can be\nlearned with the tensor method of moments (e.g. Latent Dirichlet Allocation), we present it here for\nthe single topic model for the sake of simplicity.\nWe consider a corpus C of n texts encoded as in Section 2 and we split C into two smaller corpora\naccording to their similarity to the two pseudo-centers in two steps: project the pseudo-centers on the\nsimplex to obtain discrete probability distributions (using for example the method described in [25]),\nand use MAP assignment to assign each text x to a pseudo-center. This process is summarized in\nAlgorithm 2. Once the corpus C has been split into two subsets C1 and C2, each of these subsets may\nAlgorithm 2 Splitting a corpus into two parts\nRequire: A corpus of texts C = (x(1), ..., x(n)).\n1: Estimate M1, M2 and M3.\n2: Recover l = 2 pseudo-center with Algorithm 1 .\n3: Project the Pseudo-center to the simplex\n4: for i 2 [n] do\n5:\n\nAssign the text x(i) to the cluster Cluster(i) = arg maxj P[X = x(i)|Y = j, \u02dc!, \u02dcM ], where\nP[X|Y = j, \u02dc!, \u02dcM ] is the multinomial distr. associated to the j-th projected pseudo-center.\n\n6: end for\n7: return The cluster assignments Cluster.\n\nstill contain the full set of topics but the topic distribution will differ in the two: topics similar to the\n\ufb01rst pseudo-center will be predominant in the \ufb01rst subset, the others in the second. By recursively\niterating this process, we obtain a binary tree where topic distributions in the nodes with higher depth\nare expected to be more concentrated on fewer topics.\nIn the next sections, we assess the validity of this approach on both synthetic and real-world data6.\n\n4.1 Experiment on Synthetic Data\nIn order to test the ability of SIDIWO to recover latent structures in data, we generate a dataset\ndistributed as a single topic model (with a vocabulary of 100 words) whose 8 topics have an intrinsic\nhierarchical structure depicted in Figure 1a. In this \ufb01gure, topics are on the x-axis, words on the\ny-axis, and green (resp. red) points represents high (resp low) probability. We see for example that\nthe \ufb01rst 4 topics are concentrated over the 1st half of the vocabulary, and that topics 1 and 2 have\nhigh probability on the 1st and 3rd fourth of the words while for the other two it is on the 1st and 4th.\n\n6The experiments in this section have been performed in Python 2.7, using numpy [26] library for linear alge-\nbra operations, with the exception of the implementation of the method from [22], for which we used the author\u2019s\nMatlab implementation: https://github.com/kuleshov/tensor-factorization. All the experiments\nwere run on a MacBook Pro with an Intel Core i5 processor. The implementation of the described algorithms\ncan be found a this link: https://github.com/mruffini/Hierarchical-Methods-of-Moments.\n\n7\n\n\fFigure 2: Experiment on the NIPS dataset.\n\nWe generate 400 samples according to this model and we iteratively run Algorithm 2 to create a\nhierarchical binary tree with 8 leafs. We expect leafs to contain samples from a unique topic and\ninternal nodes to gather similar topics. Results are displayed in Figure 1b where each chart represents\na node of the tree (child nodes lay below their parent) and contains the heatmap of the samples\nclustered in that node (x-axis corresponds to samples and y-axis to words, red points are infrequent\nwords and clear points frequent ones). The results are as expected: each leaf contains samples from\none of the topics and internal nodes group similar topics together.\nWe compare the clustering accuracy of SIDIWO with other methods using the Adjusted Rand Index\n[27] of the partition of the data obtained at the leafs w.r.t the one obtained using the true topics;\ncomparisons are with the \ufb02at clustering on k = 8 topics with TPM, the method from [3] (SVD), the\none from [22] (Rand. Proj.) and ALS from [21], where ALS is applied to decompose a whitened\n8 \u21e5 8 \u21e5 8 tensor T , calculated as in Equation (3). We repeat the experiment 10 times with different\nrandom samples and we report the average results in Table 1c; SIDIWO always recovers the original\ntopic almost perfectly, unlike competing methods. One intuition for this improvement is that each\nsplit in the divisive clustering helps remove noise in the moments.\n\n4.2 Experiment on NIPS Conference Papers 1987-2015\nWe consider the full set of NIPS papers accepted between 1987 and 2015 , containing n = 11, 463\npapers [28]. We assume that the papers are distributed according to a single topic model, we keep the\nd = 3000 most frequent words as vocabulary and we iteratively run Algorithm 2 to create a binary\ntree of depth 4. The resulting tree is shown in Figure 2 where each node contains the most relevant\nwords of the cluster, where the relevance [29] of a word w 2C node \u21e2C is de\ufb01ned by\n\nr(w,Cnode) = log P[w|Cnode] + (1 ) log P[w|Cnode]\nP[w|C]\n\n,\n\nwhere the weight parameter is set to = 0.7 and P[w|Cnode] (resp. P[w|C]) is the empirical frequency\nof w in Cnode (resp. in C). The leafs clustering and the whole hierarchy have a neat interpretation.\nLooking at the leaves, we can easily hypothesize the dominant topics for the 8 clusters. From left to\nright we have: [image processing, probabilistic models], [neuroscience, neural networks], [kernel\nmethods, algorithms], [online optimization, reinforcement learning]. Also, each node of the lower\nlevels gathers meaningful keywords, con\ufb01rming the ability of the method to hierarchically \ufb01nd\nmeaningful topics. The running time for this experiment was 59 seconds.\n\n4.3 Experiment on Wikipedia Mathematics Pages\nWe consider a subset of the full Wikipedia corpus, containing all articles (n = 809 texts) from the\nfollowing math-related categories: linear algebra, ring theory, stochastic processes and optimization.\n\n8\n\n\fFigure 3: Experiment on the Wikipedia Mathematics Pages dataset.\n\nWe remove a set of 895 stop-words, keep a vocabulary of d = 3000 words and run SIDIWO to\nperform hierarchical topic modeling (using the same methodology as in the previous section). The\nresulting hierarchical clustering is shown in Figure 3 where we see that each leaf is characterized by\none of the dominant topics: [ring theory, linear algebra], [stochastic processes, optimization] (from\nleft to right). It is interesting to observe that the \ufb01rst level of the clustering has separated pure\nmathematical topics from applied ones. The running time for this experiment was 6 seconds.\n\n5 Conclusions and future works\n\nWe proposed a novel spectral algorithm (SIDIWO) that generalizes recent method of moments\nalgorithms relying on tensor decomposition. While previous algorithms lack robustness to model\nmisspeci\ufb01cation, SIDIWO provides meaningful results even in misspeci\ufb01ed settings. Moreover,\nSIDIWO can be used to perform hierarchical method of moments estimation for latent variable\nmodels. In particular, we showed through hierarchical topic modeling experiments on synthetic and\nreal data that SIDIWO provides meaningful results while being very computationally ef\ufb01cient.\nA natural future work is to investigate the capability of the proposed hierarchical method to learn\novercomplete latent variable models, a task that has received signi\ufb01cant attention in recent literature\n[30, 31]. We are also interested in comparing the learning performance of SIDIWO the with those\nof other existing methods of moments in the realizable setting. On the applications side, we are\ninterested in applying the methods developed in this paper to the healthcare analytics \ufb01eld, for\ninstance to perform hierarchical clustering of patients using electronic healthcare records or more\ncomplex genetic data.\n\nAcknowledgments\nGuillaume Rabusseau acknowledges support of an IVADO postdoctoral fellowship. Borja Balle\ncompleted this work while at Lancaster University.\n\nReferences\n[1] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1\u201338, 1977.\n\n[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decom-\npositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773\u20132832,\n2014.\n\n[3] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models\n\nand hidden Markov models. In COLT, volume 1, page 4, 2012.\n\n[4] Animashree Anandkumar, Yi-kai Liu, Daniel J Hsu, Dean P Foster, and Sham M Kakade. A spectral\n\nalgorithm for Latent Dirichlet Allocation. In NIPS, pages 917\u2013925, 2012.\n\n[5] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompo-\n\nsitions. In COLT, pages 824\u2013856, 2014.\n\n[6] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral\n\ndecompositions. In ITCS, pages 11\u201320. ACM, 2013.\n\n9\n\n\f[7] Le Song, Eric P Xing, and Ankur P Parikh. A spectral algorithm for latent tree graphical models. In ICML,\n\npages 1065\u20131072, 2011.\n\n[8] Borja Balle, William L Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages:\n\nUni\ufb01ed presentation and empirical comparison. In ICML, pages 1386\u20131394, 2014.\n\n[9] Arun T Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In ICML,\n\npages 1040\u20131048, 2013.\n\n[10] Raphael Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. Journal of\n\nMachine Learning Research, 20:147\u2013162, 2011.\n\n[11] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet EM: A provably\n\noptimal algorithm for crowdsourcing. In NIPS, pages 1260\u20131268, 2014.\n\n[12] Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques.\n\nIn KDD workshop on text mining, volume 400, pages 525\u2013526. Boston, 2000.\n\n[13] Sergio M Savaresi and Daniel L Boley. On the performance of bisecting K-means and PDDP. In SDM,\n\npages 1\u201314. SIAM, 2001.\n\n[14] Borja Balle, Ariadna Quattoni, and Xavier Carreras. Local loss optimization in operator models: a new\n\ninsight into spectral learning. In ICML, pages 1819\u20131826, 2012.\n\n[15] Borja Balle and Mehryar Mohri. Spectral learning of general weighted automata via constrained matrix\n\ncompletion. In NIPS, pages 2159\u20132167, 2012.\n\n[16] Ariadna Quattoni, Borja Balle, Xavier Carreras, and Amir Globerson. Spectral regularization for max-\n\nmargin sequence tagging. In ICML, pages 1710\u20131718, 2014.\n\n[17] Alex Kulesza, N Raj Rao, and Satinder Singh. Low-rank spectral learning. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 522\u2013530, 2014.\n\n[18] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-rank spectral learning with weighted loss functions. In\n\nArti\ufb01cial Intelligence and Statistics, pages 517\u2013525, 2015.\n\n[19] Matteo Ruf\ufb01ni, Marta Casanellas, and Ricard Gavald\u00e0. A new spectral method for latent variable models.\n\narXiv preprint arXiv:1612.03409, 2016.\n\n[20] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models.\n\nJournal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[21] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500,\n\n2009.\n\n[22] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In\n\nAISTATS, pages 507\u2013516, 2015.\n\n[23] Jean-Francois Cardoso and Antoine Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM\n\njournal on matrix analysis and applications, 17(1):161\u2013164, 1996.\n\n[24] Angelika Bunse-Gerstner, Ralph Byers, and Volker Mehrmann. Numerical methods for simultaneous\n\ndiagonalization. SIAM journal on matrix analysis and applications, 14(4):927\u2013949, 1993.\n\n[25] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto the l\n\n1-ball for learning in high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[26] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for ef\ufb01cient\n\nnumerical computation. Computing in Science & Engineering, 13(2):22\u201330, 2011.\n\n[27] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classi\ufb01cation, 2(1):193\u2013218, 1985.\n\n[28] Valerio Perrone, Paul A Jenkins, Dario Spano, and Yee Whye Teh. Poisson random \ufb01elds for dynamic\n\nfeature models. arXiv preprint arXiv:1611.07460, 2016.\n\n[29] Carson Sievert and Kenneth E Shirley. Ldavis: A method for visualizing and interpreting topics. In ACL\n\nworkshop on interactive language learning, visualization, and interfaces, 2014.\n\n[30] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models\n\nthrough tensor methods. In COLT, pages 36\u2013112, 2015.\n\n10\n\n\f[31] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics in\n\novercomplete regime. Journal of Machine Learning Research, 18(22):1\u201340, 2017.\n\n[32] Elina Mihaylova Robeva. Decomposing Matrices, Tensors, and Images. PhD thesis, University of\n\nCalifornia, Berkeley, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1188, "authors": [{"given_name": "Matteo", "family_name": "Ruffini", "institution": "UPC"}, {"given_name": "Guillaume", "family_name": "Rabusseau", "institution": "McGill University"}, {"given_name": "Borja", "family_name": "Balle", "institution": null}]}