{"title": "Spectral Methods for Supervised Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1511, "page_last": 1519, "abstract": "Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral decomposition algorithm to recover the parameters of supervised latent Dirichlet allocation (sLDA) models. The Spectral-sLDA algorithm is provably correct and computationally efficient. We prove a sample complexity bound and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on a diverse range of synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the algorithm.", "full_text": "Spectral Methods for Supervised Topic Models\n\n\u2020Machine Learning Department, Carnegie Mellon University, yiningwa@cs.cmu.edu\n\n\u2021Dept. of Comp. Sci. & Tech.; Tsinghua National TNList Lab; State Key Lab of Intell. Tech. & Sys.,\n\nTsinghua University, dcszj@mail.tsinghua.edu.cn\n\nYining Wang\u2020\n\nJun Zhu\u2021\n\nAbstract\n\nSupervised topic models simultaneously model the latent topic structure of large\ncollections of documents and a response variable associated with each docu-\nment. Existing inference methods are based on either variational approximation or\nMonte Carlo sampling. This paper presents a novel spectral decomposition algo-\nrithm to recover the parameters of supervised latent Dirichlet allocation (sLDA)\nmodels. The Spectral-sLDA algorithm is provably correct and computationally\nef\ufb01cient. We prove a sample complexity bound and subsequently derive a suf\ufb01-\ncient condition for the identi\ufb01ability of sLDA. Thorough experiments on a diverse\nrange of synthetic and real-world datasets verify the theory and demonstrate the\npractical effectiveness of the algorithm.\n\n1\n\nIntroduction\n\nTopic modeling offers a suite of useful tools that automatically learn the latent semantic structure of a\nlarge collection of documents. Latent Dirichlet allocation (LDA) [9] represents one of the most pop-\nular topic models. The vanilla LDA is an unsupervised model built on input contents of documents.\nIn many applications side information is available apart from raw contents, e.g., user-provided rat-\ning scores of an online review text. Such side signal usually provides additional information to\nreveal the underlying structures of the documents in study. There have been extensive studies on\ndeveloping topic models that incorporate various side information, e.g., by treating it as supervision.\nSome representative models are supervised LDA (sLDA) [8] that captures a real-valued regression\nresponse for each document, multiclass sLDA [21] that learns with discrete classi\ufb01cation responses,\ndiscriminative LDA (DiscLDA) [14] that incorporates classi\ufb01cation response via discriminative lin-\near transformations on topic mixing vectors, and MedLDA [22, 23] that employs a max-margin\ncriterion to learn discriminative latent topic representations.\nTopic models are typically learned by \ufb01nding maximum likelihood estimates (MLE) through local\nsearch or sampling methods [12, 18, 19], which may suffer from local optima. Much recent progress\nhas been made on developing spectral decomposition [1, 2, 3] and nonnegative matrix factorization\n(NMF) [4, 5, 6, 7] methods to infer latent topic-word distributions. Instead of \ufb01nding MLE estimates,\nwhich is a known NP-hard problem [6], these methods assume that the documents are i.i.d. sampled\nfrom a topic model, and attempt to recover the underlying model parameters. Compared to local\nsearch and sampling algorithms, these methods enjoy the advantage of being provably effective. In\nfact, sample complexity bounds have been proved to show that given a suf\ufb01ciently large collection\nof documents, these algorithms can recover the model parameters accurately with a high probability.\nAlthough spectral decomposition (as well as NMF) methods have achieved increasing success in\nrecovering latent variable models, their applicability is quite limited. For example, previous work\nhas mainly focused on unsupervised latent variable models, leaving the broad family of supervised\nmodels (e.g., sLDA) largely unexplored. The only exception is [10] which presents a spectral method\nfor mixtures of regression models, quite different from sLDA. Such ignorance is not a coincidence\nas supervised models impose new technical challenges. For instance, a direct application of previous\n\n1\n\n\ftechniques [1, 2] on sLDA cannot handle regression models with duplicate entries. In addition, the\nsample complexity bound gets much worse if we try to match entries in regression models with their\ncorresponding topic vectors. On the practical side, few quantitative experimental results (if any at\nall) are available for spectral decomposition based methods on LDA models.\nIn this paper, we extend the applicability of spectral learning methods by presenting a novel spec-\ntral decomposition algorithm to recover the parameters of sLDA models from empirical low-order\nmoments estimated from the data. We provide a sample complexity bound and analyze the identi-\n\ufb01ability conditions. A key step in our algorithm is a power update step that recovers the regression\nmodel in sLDA. The method uses a newly designed empirical moment to recover regression model\nentries directly from the data and reconstructed topic distributions. It is free from making any con-\nstraints on the underlying regression model, and does not increase the sample complexity much.\nWe also provide thorough experiments on both synthetic and real-world datasets to demonstrate the\npractical effectiveness of our proposed algorithm. By combining our spectral recovery algorithm\nwith a Gibbs sampling procedure, we showed superior performance in terms of language modeling,\nprediction accuracy and running time compared to traditional inference algorithms.\n\n2 Preliminaries\nWe \ufb01rst overview the basics of sLDA, orthogonal tensor decomposition and the notations to be used.\n\n2.1 Supervised LDA\nLatent Dirichlet allocation (LDA) [9] is a generative model for topic modeling of text documents.\nIt assumes k different topics with topic-word distributions \u00b51,\u00b7\u00b7\u00b7 , \u00b5k \u2208 \u2206V \u22121, where V is the\nvocabulary size and \u2206V \u22121 denotes the probability simplex of a V -dimensional random vector. For\na document, LDA models a topic mixing vector h \u2208 \u2206k\u22121 as a probability distribution over the\nk topics. A conjugate Dirichlet prior with parameter \u03b1 is imposed on the topic mixing vectors. A\nbag-of-word model is then adopted, which generates each word in the document based on h and\nthe topic-word vectors \u00b5. Supervised latent Dirichlet allocation (sLDA) [8] incorporates an extra\nresponse variable y \u2208 R for each document. The response variable is modeled by a linear regression\nmodel \u03b7 \u2208 Rk on either the topic mixing vector h or the averaging topic assignment vector \u00afz, where\nj 1[zj =i] with m the number of words in a document. The noise is assumed to be Gaussian\n\u00afzi = 1\nm\nwith zero mean and \u03c32 variance.\nFig. 1 shows the graph structure of two sLDA variants mentioned above. Although previous work\nhas mainly focused on model (b) which is convenient for Gibbs sampling and variational inference,\nwe consider model (a) because it will considerably simplify our spectral algorithm and analysis. One\nmay assume that whenever a document is not too short, the empirical distribution of its word topic\nassignments should be close to the document\u2019s topic mixing vector. Such a scheme was adopted to\nlearn sparse topic coding models [24], and has demonstrated promising results in practice.\n\n(cid:80)\n\n2.2 High-order tensor product and orthogonal tensor decomposition\n\nA real p-th order tensor A \u2208 (cid:78)p\nA \u2208(cid:78)p Rn and matrices X1,\u00b7\u00b7\u00b7 , Xp \u2208 Rn\u00d7m, the mapping A(X1,\u00b7\u00b7\u00b7 , Xp) is a p-th order tensor\nin(cid:78)p Rm, with [A(X1,\u00b7\u00b7\u00b7 , Xp)]i1,\u00b7\u00b7\u00b7 ,ip\nAn orthogonal tensor decomposition of a tensor A \u2208(cid:78)p Rn is a collection of orthonormal vectors\n\nRni belongs to the tensor product of Euclidean spaces Rni.\nGenerally we assume n1 = n2 = \u00b7\u00b7\u00b7 = np = n, and we can identify each coordinate of A by a\np-tuple (i1,\u00b7\u00b7\u00b7 , ip), where i1,\u00b7\u00b7\u00b7 , ip \u2208 [n]. For instance, a p-th order tensor is a vector when p = 1\nand a matrix when p = 2. We can also consider a p-th order tensor A as a multilinear mapping. For\nj1,\u00b7\u00b7\u00b7 ,jp\u2208[n] Aj1,\u00b7\u00b7\u00b7 ,jp [X1]j1,i1[X2]j2,i2 \u00b7\u00b7\u00b7 [Xp]jp,ip.\nConsider some concrete examples of such a multilinear mapping. When A, X1, X2 are matrices, we\nhave A(X1, X2) = X(cid:62)\n\n1 AX2. Similarly, when A is a matrix and x is a vector, A(I, x) = Ax.\n\n(cid:44)(cid:80)\ni=1 such that A = (cid:80)k\n\ni=1\n\ni=1 and scalars {\u03bbi}k\n\n{vi}k\n. Without loss of generality, we assume\n\u03bbi are nonnegative when p is odd. Although orthogonal tensor decomposition in the matrix case\ncan be done ef\ufb01ciently by singular value decomposition (SVD), it has several delicate issues in\nhigher order tensor spaces [2]. For instance, tensors may not have unique decompositions, and an\northogonal decomposition may not exist for every symmetric tensor [2]. Such issues are further\ncomplicated when only noisy estimates of the desired tensors are available. For these reasons, we\nneed more advanced techniques to handle high-order tensors. In this paper, we will apply the robust\n\ni=1 \u03bbiv\n\n\u2297p\ni\n\n2\n\n\f\u03b1\n\n\u03b7, \u03c3\n\nh\n\ny\n\nz\n\nx\n\nM\n\n\u00b5\n\n\u03b2\n\nk\n\n\u03b1\n\n\u03b7, \u03c3\n\nh\n\ny\n\nz\n\nx\n\nM\n\nN\n\n\u00b5\n\n\u03b2\n\nk\n\nN\n\n(a) yd = \u03b7(cid:62)\n\nd hd + \u03b5d\n\n(b) yd = \u03b7(cid:62)\nFigure 1: Plate notations for two variants of sLDA\n\nd \u00afzd + \u03b5d\n\ntensor power method [2] to recover robust eigenvalues and eigenvectors of an (estimated) third-order\ntensor. The algorithm recovers eigenvalues and eigenvectors up to an absolute error \u03b5, while running\nin polynomial time with respect to the tensor dimension and log(1/\u03b5). Further details and analysis\nof the robust tensor power method are presented in Appendix A.2 and [2].\n\ni v2\n\n2.3 Notations\nThroughout, we use v\u2297p (cid:44) v\u2297v\u2297\u00b7\u00b7\u00b7\u2297v to denote the p-th order tensor generated by a vector v. We\ni to denote the Euclidean norm of a vector v, (cid:107)M(cid:107) to denote the spectral norm\nof a matrix M and (cid:107)T(cid:107) to denote the operator norm of a high-order tensor. (cid:107)M(cid:107)F =\ni,j M 2\nij\ndenotes the Frobenious norm of a matrix. We use an indicator vector x \u2208 RV to represent a word in\na document, e.g., for the i-th word in the vocabulary, xi = 1 and xj = 0 for all j (cid:54)= i. We also use\n\nuse (cid:107)v(cid:107) =(cid:112)(cid:80)\nO (cid:44) (\u00b51, \u00b52,\u00b7\u00b7\u00b7 , \u00b5k) \u2208 RV \u00d7k to denote the topic distribution matrix, and (cid:101)O (cid:44) ((cid:101)\u00b51,(cid:101)\u00b52,\u00b7\u00b7\u00b7 ,(cid:101)\u00b5K)\nto denote the canonical version of O, where(cid:101)\u00b5i =\n\n(cid:113) \u03b1i\n\u03b10(\u03b10+1) \u00b5 with \u03b10 =(cid:80)k\n\n(cid:113)(cid:80)\n\ni=1 \u03b1i.\n\n3 Spectral Parameter Recovery\nWe now present a novel spectral parameter recovery algorithm for sLDA. The algorithm consists of\ntwo key components\u2014the orthogonal tensor decomposition of observable moments to recover the\ntopic distribution matrix O and a power update method to recover the linear regression model \u03b7. We\nelaborate on these techniques and a rigorous theoretical analysis in the following sections.\n\n3.1 Moments of observable variables\nOur spectral decomposition methods recover the topic distribution matrix O and the linear regression\nmodel \u03b7 by manipulating moments of observable variables.\nIn De\ufb01nition 1, we de\ufb01ne a list of\nmoments on random variables from the underlying sLDA model.\nDe\ufb01nition 1. We de\ufb01ne the following moments of observable variables:\n\nM1 = E[x1], M2 = E[x1 \u2297 x2] \u2212 \u03b10\nM3 = E[x1 \u2297 x2 \u2297 x3] \u2212 \u03b10\n\n\u03b10 + 1\n\nM1 \u2297 M1,\n\n(E[x1 \u2297 x2 \u2297 M1] + E[x1 \u2297 M1 \u2297 x2] + E[M1 \u2297 x1 \u2297 x2])\n\n(1)\n\n(2)\n\n(3)\n\n\u03b10 + 2\nM1 \u2297 M1 \u2297 M1,\n\n2\u03b12\n0\n\n+\n\n(\u03b10 + 1)(\u03b10 + 2)\nMy = E[yx1 \u2297 x2] \u2212 \u03b10\n\n\u03b10 + 2\n\n+\n\n2\u03b12\n0\n\n(\u03b10 + 1)(\u03b10 + 2)\n\nE[y]M1 \u2297 M1.\n\n(E[y]E[x1 \u2297 x2] + E[x1] \u2297 E[yx2] + E[yx1] \u2297 E[x2])\n\nNote that the moments M1, M2 and M3 were also de\ufb01ned and used in previous work [1, 2] for the\nparameter recovery for LDA models. For the sLDA model, we need to de\ufb01ne a new moment My\nin order to recover the linear regression model \u03b7. The moments are based on observable variables\nin the sense that they can be estimated from i.i.d. sampled documents. For instance, M1 can be\nestimated by computing the empirical distribution of all words, and M2 can be estimated using M1\nand word co-occurrence frequencies. Though the moments in the above forms look complicated,\nwe can apply elementary calculations based on the conditional independence structure of sLDA to\nsigni\ufb01cantly simplify them and more importantly to get them connected with the model parameters\nto be recovered, as summarized in Proposition 1. The proof is deferred to Appendix B.\n\n3\n\n\fProposition 1. The moments can be expressed using the model parameters as:\n\n\u03b1i\u00b5i \u2297 \u00b5i, M3 =\n\n2\n\n\u03b1i\u00b5i \u2297 \u00b5i \u2297 \u00b5i,\n\n\u03b10(\u03b10 + 1)(\u03b10 + 2)\n\ni=1\n\nk(cid:88)\n\nk(cid:88)\n\n1\n\n\u03b10(\u03b10 + 1)\n\ni=1\n\n2\n\nM2 =\n\nMy =\n\nk(cid:88)\n\n\u03b10(\u03b10 + 1)(\u03b10 + 2)\n\ni=1\n\n\u03b1i\u03b7i\u00b5i \u2297 \u00b5i.\n\n(4)\n\n(5)\n\ni=1 from the underlying sLDA model. One idea to reconstruct {\u00b5i}k\n\n3.2 Simultaneous diagonalization\nProposition 1 shows that the moments in De\ufb01nition 1 are all the weighted sums of tensor products\nof {\u00b5i}k\ni=1 is to perform si-\nmultaneous diagonalization on tensors of different orders. The idea has been used in a number of\nrecent developments of spectral methods for latent variable models [1, 2, 10]. Speci\ufb01cally, we \ufb01rst\nwhiten the second-order tensor M2 by \ufb01nding a matrix W \u2208 RV \u00d7k such that W (cid:62)M2W = Ik.\nThis whitening procedure is possible whenever the topic distribuction vectors {\u00b5i}k\ni=1 are linearly\nindependent (and hence M2 has rank k). The whitening procedure and the linear independence\nassumption also imply that {W \u00b5i}k\ni=1 are orthogonal vectors (see Appendix A.2 for details), and\ncan be subsequently recovered by performing an orthogonal tensor decomposition on the simultane-\nously whitened third-order tensor M3(W, W, W ). Finally, by multiplying the pseudo-inverse of the\nwhitening matrix W + we obtain the topic distribution vectors {\u00b5i}k\nIt should be noted that Jennrich\u2019s algorithm [13, 15, 17] could recover {\u00b5i}k\nrd order tensor M3 alone when {\u00b5i}k\nsimultaneous diagonalization framework because the intermediate vectors {W \u00b5i}k\nrole in the recovery procedure of the linear regression model \u03b7.\n\ni=1 directly from the 3-\ni=1 is linearly independent. However, we still adopt the above\ni=1 play a vital\n\ni=1.\n\n3.3 The power update method\nAlthough the linear regression model \u03b7 can be recovered in a similar manner by performing simul-\ntaneous diagonalization on M2 and My, such a method has several disadvantages, thereby calling\nfor novel solutions. First, after obtaining entry values {\u03b7i}k\ni=1 we need to match them to the topic\ndistributions {\u00b5i}k\ni=1 previously recovered. This can be easily done when we have access to the true\nmoments, but becomes dif\ufb01cult when only estimates of observable tensors are available because the\nestimated moments may not share the same singular vectors due to sampling noise. A more seri-\nous problem is that when \u03b7 has duplicate entries the orthogonal decomposition of My is no longer\nunique. Though a randomized strategy similar to the one used in [1] might solve the problem, it\ncould substantially increase the sample complexity [2] and render the algorithm impractical.\nWe develop a power update method to resolve the above dif\ufb01culties. Speci\ufb01cally, after obtaining the\nwhitened (orthonormal) vectors {vi} (cid:44) ci \u00b7 W \u00b5i\n1 we recover the entry \u03b7i of the linear regression\nmodel directly by computing a power update v(cid:62)\ni My(W, W )vi. In this way, the matching problem\nis automatically solved because we know what topic distribution vector \u00b5i is used when recovering\n\u03b7i. Furthermore, the singular values (corresponding to the entries of \u03b7) do not need to be distinct\nbecause we are not using any unique SVD properties of My(W, W ). As a result, our proposed\nalgorithm works for any linear model \u03b7.\n\n3.4 Parameter recovery algorithm\nAn outline of our parameter recovery algorithm for sLDA (Spectral-sLDA) is given in Alg. 1. First,\nempirical estimates of the observable moments in De\ufb01nition 1 are computed from the given docu-\nments. The simultaneous diagonalization method is then used to reconstruct the topic distribution\nmatrix O and its prior parameter \u03b1. After obtaining O = (\u00b51,\u00b7\u00b7\u00b7 , \u00b5k), we use the power update\nmethod introduced in the previous section to recover the linear regression model \u03b7.\nAlg. 1 admits three hyper-parameters \u03b10, L and T . \u03b10 is de\ufb01ned as the sum of all entries in the\nprior parameter \u03b1. Following the conventions in [1, 2], we assume that \u03b10 is known a priori and use\nthis value to perform parameter estimation. It should be noted that this is a mild assumption, as in\npractice usually a homogeneous vector \u03b1 is assumed and the entire vector is known [20]. The L and\nT parameters are used to control the number of iterations in the robust tensor power method. In gen-\neral, the robust tensor power method runs in O(k3LT ) time. To ensure suf\ufb01cient recovery accuracy,\n\n1ci is a scalar coef\ufb01cient that depends on \u03b10 and \u03b1i. See Appendix A.2 for details.\n\n4\n\n\fpower method [2] with parameters L and T .\n.\n\nAlgorithm 1 spectral parameter recovery algorithm for sLDA. Input parameters: \u03b10, L, T .\n\n1: Compute empirical moments and obtain(cid:99)M2,(cid:99)M3 and(cid:99)My.\n2: Find(cid:99)W \u2208 Rn\u00d7k such that(cid:99)M2((cid:99)W ,(cid:99)W ) = Ik.\n3: Find robust eigenvalues and eigenvectors ((cid:98)\u03bbi,(cid:98)vi) of (cid:99)M3((cid:99)W ,(cid:99)W ,(cid:99)W ) using the robust tensor\n4: Recover prior parameters: (cid:98)\u03b1i \u2190 4\u03b10(\u03b10+1)\n(\u03b10+2)2(cid:98)\u03bb2\n2 (cid:98)\u03bbi((cid:99)W +)(cid:62)(cid:98)vi.\n5: Recover topic distributions: (cid:98)\u00b5i \u2190 \u03b10+2\n2 (cid:98)v\n6: Recover the linear regression model:(cid:98)\u03b7i \u2190 \u03b10+2\n7: Output:(cid:98)\u03b7,(cid:98)\u03b1 and {(cid:98)\u00b5i}k\n(cid:113) \u03b10(\u03b10+1)\n\nL should be at least a linear function of k and T should be set as T = \u2126(log(k) + log log(\u03bbmax/\u03b5)),\nand \u03b5 is an error tolerance parameter. Appendix A.2 and [2] pro-\nwhere \u03bbmax = 2\nvide a deeper analysis into the choice of L and T parameters.\n\ni (cid:99)My((cid:99)W ,(cid:99)W )(cid:98)vi.\n\ni=1.\n\n\u03b10+2\n\n\u03b1min\n\n(cid:62)\n\ni\n\n3.5 Speeding up moment computation\n\nO(V 3) storage, where N is corpus size and M is the number of words per document. Such time\nand space complexities are clearly prohibitive for real applications, where the vocabulary usually\ncontains tens of thousands of terms. However, we can employ a trick similar as in [11] to speed\n\nIn Alg. 1, a straightforward computation of the third-order tensor (cid:99)M3 requires O(N M 3) time and\nup the moment computation. We \ufb01rst note that only the whitened tensor (cid:99)M3((cid:99)W ,(cid:99)W ,(cid:99)W ) is needed\nterm in (cid:99)M3 can be written as(cid:80)r\nin our algorithm, which only takes O(k3) storage. Another observation is that the most dif\ufb01cult\ncontains at most M non-zero entries. This allows us to compute (cid:99)M3((cid:99)W ,(cid:99)W ,(cid:99)W ) in O(N M k) time\ni=1 ciui,1 \u2297 ui,2 \u2297 ui,3, where r is proportional to N and ui,\u00b7\nby computing(cid:80)r\ni=1 ci(W (cid:62)ui,1) \u2297 (W (cid:62)ui,2) \u2297 (W (cid:62)ui,3). Appendix B.2 provides more details\nabout this speed-up trick. The overall time complexity is O(N M (M + k2) + V 2 + k3LT ) and the\nspace complexity is O(V 2 + k3).\n\n4 Sample Complexity Analysis\nWe now analyze the sample complexity of Alg. 1 in order to achieve \u03b5-error with a high probability.\nFor clarity, we focus on presenting the main results, while deferring the proof details to Appendix A,\nincluding the proofs of important lemmas that are needed for the main theorem.\n\nwith\nAlgorithm 1, and L is at least a linear function of k. Fix \u03b4 \u2208 (0, 1). For any small error-tolerance\nparameter \u03b5 > 0, if Algorithm 1 is run with parameter T = \u2126(log(k) + log log(\u03bbmax/\u03b5)) on N\n(cid:19)\ni.i.d. sampled documents (each containing at least 3 words) with N \u2265 max(n1, n2, n3), where\n\nTheorem 1. Let \u03c31((cid:101)O) and \u03c3k((cid:101)O) be the largest and the smallest singular values of the canonical\ntopic distribution matrix (cid:101)O. De\ufb01ne \u03bbmin (cid:44) 2\n\u03b1max and \u03b1min the largest and the smallest entries of \u03b1. Suppose (cid:98)\u00b5, (cid:98)\u03b1 and(cid:98)\u03b7 are the outputs of\nn1 = C1 \u00b7(cid:16)\n(cid:17)2 \u00b7 \u03b12\n1 +(cid:112)log(6/\u03b4)\nn2 = C2 \u00b7 (1 +(cid:112)log(15/\u03b4))2\n\u03b52\u03c3k((cid:101)O)4\n\n, n3 = C3 \u00b7 (1 +(cid:112)log(9/\u03b4))2\n\u03c3k((cid:101)O)10\nmax\u03c31((cid:101)O)2(cid:17)\n\n(cid:113) \u03b10(\u03b10+1)\n\n(cid:113) \u03b10(\u03b10+1)\n\nand C1, C2 and C3 are universal constants, then with probability at least 1 \u2212 \u03b4, there exists a\npermutation \u03c0 : [k] \u2192 [k] such that for every topic i, the following holds:\n\nand \u03bbmax (cid:44) 2\n\n\u22121(\u03b4/60\u03c3))2, \u03b12\n\n((cid:107)\u03b7(cid:107) + \u03a6\n\n(cid:18) 1\n\n0(\u03b10 + 1)2\n\n\u00b7 max\n\n\u00b7 max\n\nk2\n\u03bb2\n\n(cid:16)\n\n\u03b52 ,\n\n\u03b1min\n\n\u03b10+2\n\n\u03b10+2\n\n\u03b1max\n\n,\n\n\u03b1min\n\n,\n\nmin\n\n1. |\u03b1i \u2212(cid:98)\u03b1\u03c0(i)| \u2264 4\u03b10(\u03b10+1)(\u03bbmax+5\u03b5)\n(cid:16) 8\u03b1max\n2. (cid:107)\u00b5i \u2212(cid:98)\u00b5\u03c0(i)(cid:107) \u2264(cid:16)\n(cid:17)\nmin(\u03bbmin\u22125\u03b5)2 \u00b7 5\u03b5, if \u03bbmin > 5\u03b5;\n3\u03c31((cid:101)O)\n3. |\u03b7i \u2212(cid:98)\u03b7\u03c0(i)| \u2264(cid:16) (cid:107)\u03b7(cid:107)\n(cid:17)\n\u03b5;\n\n+ 5(\u03b10+2)\n\n+ (\u03b10 + 2)\n\n(\u03b10+2)2\u03bb2\n\n(cid:17)\n\n+ 1\n\n\u03bbmin\n\n\u03bbmin\n\n\u03b5.\n\n2\n\n5\n\n\fFigure 2: Reconstruction errors of Alg. 1. X axis denotes the training size. Error bars denote the\nstandard deviations measured on 3 independent trials under each setting.\n\nIn brevity, the proof is based on matrix perturbation lemmas (see Appendix A.1) and analysis to\nthe orthogonal tensor decomposition methods (including SVD and robust tensor power method)\nperformed on inaccurate tensor estimations (see Appendix A.2). The sample complexity lower\nbound consists of three terms, from n1 to n3. The n3 term comes from the sample complexity\nbound for the robust tensor power method [2]; the ((cid:107)\u03b7(cid:107) + \u03a6\u22121(\u03b4/60\u03c3))2 term in n2 characterizes\nthe recovery accuracy for the linear regression model \u03b7, and the \u03b12\nwe try to recover the topic distribution vectors \u00b5; \ufb01nally, the term n1 is required so that some\n\nmax\u03c31((cid:101)O)2 term arises when\ntechnical conditions are met. The n1 term does not depend on either k or \u03c3k((cid:101)O), and could be\n\nlargely neglected in practice.\nAn important implication of Theorem 1 is that it provides a suf\ufb01cient condition for a supervised\nLDA model to be identi\ufb01able, as shown in Remark 1. To some extent, Remark 1 is the best identi-\n\ufb01ability result possible under our inference framework, because it makes no restriction on the linear\nregression model \u03b7, and the linear independence assumption is unavoidable without making further\ndocument, a supervised LDA model M = (\u03b1, \u00b5, \u03b7) is identi\ufb01able if \u03b10 =(cid:80)k\nassumptions on the topic distribution matrix O.\nRemark 1. Given a suf\ufb01ciently large number of i.i.d. sampled documents with at least 3 words per\ni=1 \u03b1i is known and\n{\u00b5i}k\nWe also make remarks on indirected quantities appeared in Theorem 1 (e.g., \u03c3k((cid:101)O)) and a simpli\ufb01ed\n\ni=1 are linearly independent.\n\nsample complexity bound for some specical cases. They can be found in Appendix A.4.\n\n5 Experiments\n5.1 Datasets description and Algorithm implementation details\nWe perform experiments on both synthetic and real-world datasets. The synthetic data are generated\nin a similar manner as in [22], with a \ufb01xed vocabulary of size V = 500. We generate the topic\ndistribution matrix O by \ufb01rst sampling each entry from a uniform distribution and then normalize\nevery column of O. The linear regression model \u03b7 is sampled from a standard Gaussian distribution.\nThe prior parameter \u03b1 is assumed to be homogeneous, i.e., \u03b1 = (1/k,\u00b7\u00b7\u00b7 , 1/k). Documents and\nresponse variables are then generated from the sLDA model speci\ufb01ed in Sec. 2.1.\nFor real-world data, we use the large-scale dataset built on Amazon movie reviews [16] to demon-\nstrate the practical effectiveness of our algorithm. The dataset contains 7,911,684 movie reviews\nwritten by 889,176 users from Aug 1997 to Oct 2012. Each movie review is accompanied with a\nscore from 1 to 5 indicating how the user likes a particular movie. The median number of words per\nreview is 101. A vocabulary with V = 5, 000 terms is built by selecting high frequency words. We\nalso pre-process the dataset by shifting the review scores so that they have zero mean.\nBoth Gibbs sampling for the sLDA model in Fig. 1 (b) and the proposed spectral recovery algorithm\nare implemented in C++. For our spectral algorithm, the hyperparameters L and T are set to 100,\nwhich is suf\ufb01ciently large for all settings in our experiments. Since Alg. 1 can only recover the\ntopic model itself, we use Gibbs sampling to iteratively sample topic mixing vectors h and topic\nassignments for each word z in order to perform prediction on a held-out dataset.\n\n5.2 Convergence of reconstructed model parameters\nWe demonstrate how the sLDA model reconstructed by Alg. 1 converges to the underlying true\nmodel when more observations are available. Fig. 2 presents the 1-norm reconstruction errors of \u03b1,\n\u03b7 and \u00b5. The number of topics k is set to 20 and the number of words per document (i.e., M) is set\n\n6\n\n3006001000300060001000000.20.40.6\u03b1 error (1\u2212norm) M=250M=500300600100030006000100000510\u03b7 error (1\u2212norm) M=250M=5003006001000300060001000000.20.4\u00b5 error (1\u2212norm) M=250M=500\fFigure 3: Mean square errors and negative per-word log-likelihood of Alg. 1 and Gibbs sLDA.\nEach document contains M = 500 words. The X axis denotes the training size (\u00d7103).\n\nFigure 4: pR2 scores and negative per-word log-likelihood. The X axis indicates the number of\ntopics. Error bars indicate the standard deviation of 5-fold cross-validation.\n\nto 250 and 500. Since Spectral-sLDA can only recover topic distributions up to a permutation over\n\n[k], a minimum weighted graph match was computed on O and (cid:98)O to \ufb01nd an optimal permutation.\n\nFig. 2 shows that the reconstruction errors for all the parameters go down rapidly as we obtain more\ndocuments. Furthermore, though Theorem 1 does not involve the number of words per document,\nthe simulation results demonstrate a signi\ufb01cant improvement when more words are observed in each\ndocument, which is a nice complement for the theoretical analysis.\n\n5.3 Prediction accuracy and per-word likelihood\nWe compare the prediction accuracy and per-word likelihood of Spectral-sLDA and Gibbs-sLDA\n(cid:80)K\non both synthetic and real-world datasets. On the synthetic dataset, the regression error is measured\nby the mean square error (MSE), and the per-word log-likelihood is de\ufb01ned as log2 p(w|h, O) =\nk=1 p(w|z = k, O)p(z = k|h). The hyper-parameters used in our Gibbs sampling imple-\nlog2\nmentation are the same with the ones used to generate the datasets.\nFig. 3 shows that Spectral-sLDA consistently outperforms Gibbs-sLDA. Our algorithm also enjoys\nthe advantage of being less variable, as indicated by the curve and error bars. Moreover, when the\nnumber of training documents is suf\ufb01ciently large, the performance of the reconstructed model is\nvery close to the underlying true model2, which implies that Alg. 1 can correctly identify an sLDA\nmodel from its observations, therefore supporting our theory.\nWe also test both algorithms on the large-scale Amazon movie review dataset. The quality of the\nprediction is assessed with predictive R2 (pR2) [8], a normalized version of MSE, which is de\ufb01ned\n\ni (yi \u2212 \u00afy)2), where(cid:98)yi is the estimate, yi is the truth, and \u00afy is\n\nthe average true value. We report the results under various settings of \u03b1 and k in Fig. 4, with the\n\u03c3 hyper-parameter of Gibbs-sLDA selected via cross-validation on a smaller subset of documents.\nApart from Gibbs-sLDA and Spectral-sLDA, we also test the performance of a hybrid algorithm\nwhich performs Gibbs sampling using models reconstructed by Spectral-sLDA as initializations.\nFig. 4 shows that in general Spectral-sLDA does not perform as well as Gibbs sampling. One\npossible reason is that real-world datasets are not exact i.i.d. samples from an underlying sLDA\nmodel. However, a signi\ufb01cant improvement can be observed when the Gibbs sampler is initialized\nwith models reconstructed by Spectral-sLDA instead of random initializations. This is because\nSpectral-sLDA help avoid the local optimum problem of local search methods like Gibbs sampling.\nSimilar improvements for spectral methods were also observed in previous papers [10].\n\nas pR2 (cid:44) 1 \u2212 ((cid:80)\n\ni (yi \u2212(cid:98)yi)2)/((cid:80)\n\n2Due to the randomness in the data generating process, the true model has a non-zero prediction error.\n\n7\n\n0.20.40.60.8124681000.20.4MSE (k=20) ref. modelSpec\u2212sLDAGibbs\u2212sLDA0.20.40.60.812468108.88.99Neg. Log\u2212likeli. (k=20) 0.20.40.60.8124681000.20.4MSE (k=50)0.20.40.60.812468108.938.948.958.968.97Neg. Log\u2212likeli. (k=50)024681000.050.10.15PR2 (\u03b1=0.01) Gibbs\u2212sLDASpec\u2212sLDAHybrid0246810\u22120.0500.050.10.15PR2 (\u03b1=0.1) Gibbs\u2212sLDASpec\u2212sLDAHybrid0246810\u22120.2\u22120.100.1PR2 (\u03b1=1.0) Gibbs\u2212sLDASpec\u2212sLDAHybrid02468107.47.57.67.7Neg. Log\u2212likeli. (\u03b1=0.01) Gibbs\u2212sLDASpec\u2212sLDAHybrid02468107.47.67.8Neg. Log\u2212likeli. (\u03b1=0.1) Gibbs\u2212sLDASpec\u2212sLDAHybrid02468107.47.67.88Neg. Log\u2212likeli. (\u03b1=1.0) Gibbs\u2212sLDASpec\u2212sLDAHybrid\fTable 1: Training time of Gibbs-sLDA and Spectral-sLDA, measured in minutes. k is the number\nof topics and n is the number of documents used in training.\n\nn(\u00d7104)\n1\nGibbs-sLDA 0.6\nSpec-sLDA 1.5\n\nk = 10\n10\n6.0\n1.7\n\n50\n30.5\n2.9\n\n5\n3.0\n1.6\n\n100\n61.1\n4.3\n\n1\n2.9\n3.1\n\n5\n\n14.3\n3.6\n\nk = 50\n10\n28.2\n4.3\n\n50\n\n145.4\n9.5\n\n100\n281.8\n16.2\n\nTable 2: Prediction accuracy and per-word log-likelihood of Gibbs-sLDA and the hybrid algorithm.\nThe initialization solution is obtained by running Alg. 1 on a collection of 1 million documents,\nwhile n is the number of documents used in Gibbs sampling. k = 8 topics are used.\n\nlog10 n\n\nGibbs-sLDA\n\nHybrid\n\n3\n\n0.00\n(0.01)\n0.02\n(0.01)\n\npredictive R2\n5\n\n4\n\n0.04\n(0.02)\n0.17\n(0.03)\n\n0.11\n(0.02)\n0.18\n(0.03)\n\n6\n\n0.14\n(0.01)\n0.18\n(0.03)\n\n3\n\n7.72\n(0.01)\n7.70\n(0.01)\n\n4\n\n7.55\n(0.01)\n7.49\n(0.02)\n\nNegative per-word log-likelihood\n\n5\n\n7.45\n(0.01)\n7.40\n(0.01)\n\n6\n\n7.42\n(0.01)\n7.36\n(0.01)\n\nNote that for k > 8 the performance of Spectral-sLDA signi\ufb01cantly deteriorates. This phenomenon\ncan be explained by the nature of Spectral-sLDA itself: one crucial step in Alg. 1 is to whiten the\n\nempirical moment (cid:99)M2, which is only possible when the underlying topic matrix O has full rank.\nFor the Amazon movie review dataset, it is impossible to whiten (cid:99)M2 when the underlying model\n\ncontains more than 8 topics. This interesting observation shows that the Spectral-sLDA algorithm\ncan be used for model selection to avoid over\ufb01tting by using too many topics.\n\n5.4 Time ef\ufb01ciency\nThe proposed spectral recovery algorithm is very time ef\ufb01cient because it avoids time-consuming\niterative steps in traditional inference and sampling methods. Furthermore, empirical moment com-\nputation, the most time-consuming part in Alg. 1, consists of only elementary operations and could\nbe easily optimized. Table 1 compares the training time of Gibbs-sLDA and Spectral-sLDA and\nshows that our proposed algorithm is over 15 times faster than Gibbs sampling, especially for large\ndocument collections. Although both algorithms are implemented in a single-threading manner,\nSpectral-sLDA is very easy to parallelize because unlike iterative local search methods, the moment\ncomputation step in Alg. 1 does not require much communication or synchronization.\nThere might be concerns about the claimed time ef\ufb01ciency, however, because signi\ufb01cant perfor-\nmance improvements could only be observed when Spectral-sLDA is used together with Gibbs-\nsLDA, and the Gibbs sampling step might slow down the entire procedure. To see why this is not\nthe case, we show in Table 2 that in order to obtain high-quality models and predictions, only a\nvery small collection of documents are needed after model reconstruction of Alg. 1. In contrast,\nGibbs-sLDA with random initialization requires more data to get reasonable performances.\nTo get a more intuitive idea of how fast our proposed method is, we combine Tables 1 and 2 to see\nthat by doing Spectral-sLDA on 106 documents and then post-processing the reconstructed models\nusing Gibbs sampling on only 104 documents, we obtain a pR2 score of 0.17 in 5.8 minutes, while\nGibbs-sLDA takes over an hour to process a million documents with a pR2 score of only 0.14.\nSimilarly, the hybrid method takes only 10 minutes to get a per-word likelihood comparable to the\nGibbs sampling algorithm that requires more than an hour running time.\n\n6 Conclusion\nWe propose a novel spectral decomposition based method to reconstruct supervised LDA models\nfrom labeled documents. Although our work has mainly focused on tensor decomposition based\nalgorithms, it is an interesting problem whether NMF based methods could also be applied to obtain\nbetter sample complexity bound and superior performance in practice for supervised topic models.\n\nAcknowledgement\nThe work was done when Y.W. was at Tsinghua. The work is supported by the National Ba-\nsic Research Program of China (No. 2013CB329403), National NSF of China (Nos. 61322308,\n61332007), and Tsinghua University Initiative Scienti\ufb01c Research Program (No. 20121088071).\n\n8\n\n\fReferences\n[1] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and Y.-K. Liu. Two SVDs suf\ufb01ce: Spectral de-\ncompositions for probabilistic topic modeling and latent Dirichlet allocatoin. arXiv:1204.6703,\n2012.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv:1210:7559, 2012.\n\n[3] A. Anandkumar, D. Hsu, and S. Kakade. A method of moments for mixture models and hidden\n\nMarkov models. arXiv:1203.0683, 2012.\n\n[4] S. Arora, R. Ge, Y. Halpern, D. Mimno, and A. Moitra. A practical algorithm for topic model-\n\ning with provable guarantees. In ICML, 2013.\n\n[5] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization -\n\nprovably. In STOC, 2012.\n\n[6] S. Arora, R. Ge, and A. Moitra. Learning topic models-going beyond SVD. In FOCS, 2012.\n[7] V. Bittorf, B. Recht, C. Re, and J. Tropp. Factoring nonnegative matrices with linear programs.\n\nIn NIPS, 2012.\n\n[8] D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.\n[9] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, (3):993\u20131022, 2003.\n\n[10] A. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In\n\nICML, 2013.\n\n[11] S. Cohen and M. Collins. Tensor decomposition for fast parsing with latent-variable PCFGs.\n\nIn NIPS, 2012.\n\n[12] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent Dirichlet allocation. In\n\nNIPS, 2010.\n\n[13] J. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with applica-\ntions to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95\u2013\n138, 1977.\n\n[14] S. Lacoste-Julien, F. Sha, and M. Jordan. DiscLDA: Discriminative learning for dimensionality\n\nreduction and classi\ufb01cation. In NIPS, 2008.\n\n[15] S. Leurgans, R. Ross, and R. Abel. A decomposition for three-way arrays. SIAM Journal on\n\nMatrix Analysis and Applications, 14(4):1064\u20131083, 1993.\n\n[16] J. McAuley and J. Leskovec. From amateurs to connoisseus: Modeling the evolution of user\n\nexpertise through online reviews. In WWW, 2013.\n\n[17] A. Moitra. Algorithmic aspects of machine learning. 2014.\n[18] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed\n\nGibbs sampling for latent Dirichlet allocation. In SIGKDD, 2008.\n\n[19] R. Redner and H. Walker. Mixture densities, maximum likelihood and the EM algorithm.\n\nSIAM Review, 26(2):195\u2013239, 1984.\n\n[20] M. Steyvers and T. Grif\ufb01ths. Latent semantic analysis: a road to meaning, chapter Probabilistic\n\ntopic models. Laurence Erlbaum, 2007.\n\n[21] C. Wang, D. Blei, and F.-F. Li. Simultaneous image classi\ufb01cation and annotation. In CVPR,\n\n2009.\n\n[22] J. Zhu, A. Ahmed, and E. Xing. MedLDA: Maximum margin supervised topic models. Journal\n\nof Machine Learning Research, (13):2237\u20132278, 2012.\n\n[23] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data aug-\n\nmentation. Journal of Machine Learning Research, (15):1073\u20131110, 2014.\n\n[24] J. Zhu and E. Xing. Sparse topic coding. In UAI, 2011.\n\n9\n\n\f", "award": [], "sourceid": 816, "authors": [{"given_name": "Yining", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}]}