{"title": "A provable SVD-based algorithm for learning topics in dominant admixture corpus", "book": "Advances in Neural Information Processing Systems", "page_first": 1997, "page_last": 2005, "abstract": "Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from such a collection of documents drawn from admixtures, is NP-hard. Making a strong assumption called separability, [4] gave the first provable algorithm for inference. For the widely used LDA model, [6] gave a provable algorithm using clever tensor-methods. But [4, 6] do not learn topic vectors with bounded $l_1$ error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded $l_1$ error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than the others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on $w_0$, the lowest probability that a topic is dominant, and is better than [4]. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art [5].", "full_text": "A provable SVD-based algorithm for learning topics\n\nin dominant admixture corpus\n\nTrapit Bansal\u2020, C. Bhattacharyya\u2021\u2217\n\nDepartment of Computer Science and Automation\n\nIndian Institute of Science\nBangalore -560012, India\n\n\u2020trapitbansal@gmail.com\n\u2021chiru@csa.iisc.ernet.in\n\nRavindran Kannan\nMicrosoft Research\n\nIndia\n\nkannan@microsoft.com\n\nAbstract\n\nTopic models, such as Latent Dirichlet Allocation (LDA), posit that documents\nare drawn from admixtures of distributions over words, known as topics. The\ninference problem of recovering topics from such a collection of documents drawn\nfrom admixtures, is NP-hard. Making a strong assumption called separability, [4]\ngave the \ufb01rst provable algorithm for inference. For the widely used LDA model,\n[6] gave a provable algorithm using clever tensor-methods. But [4, 6] do not learn\ntopic vectors with bounded l1 error (a natural measure for probability vectors).\nOur aim is to develop a model which makes intuitive and empirically supported\nassumptions and to design an algorithm with natural, simple components such as\nSVD, which provably solves the inference problem for the model with bounded l1\nerror. A topic in LDA and other models is essentially characterized by a group of\nco-occurring words. Motivated by this, we introduce topic speci\ufb01c Catchwords,\na group of words which occur with strictly greater frequency in a topic than any\nother topic individually and are required to have high frequency together rather\nthan individually. A major contribution of the paper is to show that under this\nmore realistic assumption, which is empirically veri\ufb01ed on real corpora, a singu-\nlar value decomposition (SVD) based algorithm with a crucial pre-processing step\nof thresholding, can provably recover the topics from a collection of documents\ndrawn from Dominant admixtures. Dominant admixtures are convex combination\nof distributions in which one distribution has a signi\ufb01cantly higher contribution\nthan the others. Apart from the simplicity of the algorithm, the sample complexity\nhas near optimal dependence on w0, the lowest probability that a topic is domi-\nnant, and is better than [4]. Empirical evidence shows that on several real world\ncorpora, both Catchwords and Dominant admixture assumptions hold and the pro-\nposed algorithm substantially outperforms the state of the art [5].\n\n1\n\nIntroduction\n\nTopic models [1] assume that each document in a text corpus is generated from an ad-mixture of\ntopics, where, each topic is a distribution over words in a Vocabulary. An admixture is a convex\ncombination of distributions. Words in the document are then picked in i.i.d. trials, each trial has a\nmultinomial distribution over words given by the weighted combination of topic distributions. The\nproblem of inference, recovering the topic distributions from such a collection of documents, is\nprovably NP-hard. Existing literature pursues techniques such as variational methods [2] or MCMC\nprocedures [3] for approximating the maximum likelihood estimates.\n\n\u2217http://mllab.csa.iisc.ernet.in/tsvd\n\n1\n\n\fGiven the intractability of the problem one needs further assumptions on topics to derive polynomial\ntime algorithms which can provably recover topics. A possible (strong) assumption is that each\ndocument has only one topic but the collection can have many topics. A document with only one\ntopic is sometimes referred as a pure topic document. [7] proved that a natural algorithm, based\non SVD, recovers topics when each document is pure and in addition, for each topic, there is a set\nof words, called primary words, whose total frequency in that topic is close to 1. More recently,\n[6] show using tensor methods that if the topic weights have Dirichlet distribution, we can learn\nthe topic matrix. Note that while this allows non-pure documents, the Dirichlet distribution gives\nessentially uncorrelated topic weights.\nIn an interesting recent development [4, 5] gave the \ufb01rst provable algorithm which can recover topics\nfrom a corpus of documents drawn from admixtures, assuming separability. Topics are said to be\nseparable if in every topic there exists at least one Anchor word. A word in a topic is said to be an\nAnchor word for that topic if it has a high probability in that topic and zero probability in remaining\ntopics. The requirement of high probability in a topic for a single word is unrealistic.\n\nOur Contributions: Topic distributions, such as those learnt in LDA, try to model the co-\noccurrence of a group of words which describes a theme. Keeping this in mind we introduce the\nnotion of Catchwords. A group of words are called Catchwords of a topic, if each word occurs\nstrictly more frequently in the topic than other topics and together they have high frequency. This\nis a much weaker assumption than separability. Furthermore we observe, empirically, that posterior\ntopic weights assigned by LDA to a document often have the property that one of the weights is\nsigni\ufb01cantly higher than the rest. Motivated by this observation, which has not been exploited by\ntopic modeling literature, we suggest a new assumption. It is natural to assume that in a text corpus,\na document, even if it has multiple themes, will have an overarching dominant theme. In this paper\nwe focus on document collections drawn from dominant admixtures. A document collection is said\nto be drawn from a dominant admixture if for every document, there is one topic whose weight is\nsigni\ufb01cantly higher than the other topics and in addition, for every topic, there is a small fraction of\ndocuments which are nearly purely on that topic. The main contribution of the paper is to show that\nunder these assumptions, our algorithm, which we call TSVD, indeed provably \ufb01nds a good approx-\nimation in total l1 error to the topic matrix. We prove a bound on the error of our approximation\nwhich does not grow with dictionary size d, unlike [5] where the error grows linearly with d.\nEmpirical evidence shows that on semi-synthetic corpora constructed from several real world\ndatasets, as suggested by [5], TSVD substantially outperforms the state of the art [5]. In partic-\nular it is seen that compared to [5] TSVD gives 27% lower error in terms of l1 recovery on 90% of\nthe topics.\n\nsmaller. Let Sk = {x = (x1, x2, . . . , xk) : xl \u2265 0;(cid:80)\n\nProblem De\ufb01nition: d, k, s will denote respectively, the number of words in the dictionary, num-\nber of topics and number of documents. d, s are large, whereas, k is to be thought of as much\nl xl = 1}. For each topic, there is a \ufb01xed\nvector in Sk giving the probability of each word in that topic. Let M be the d \u00d7 k matrix with these\nvectors as its columns.\nDocuments are picked in i.i.d.\nTo pick document j, one \ufb01rst picks a k-vector\nW1j, W2j, . . . , Wkj of topic weights according to a \ufb01xed distribution on Sk. Let P\u00b7,j = MW\u00b7,j\nbe the weighted combination of the topic vectors. Then the m words of the document are picked in\ni.i.d. trials; each trial picks a word according to the multinomial distribution with P\u00b7,j as the proba-\nbilities. All that is given as data is the frequency of words in each document, namely, we are given\nthe d \u00d7 s matrix A, where Aij = Number of occurrences of word i in Document j\n. Note that E(A|W) = P,\nwhere, the expectation is taken entry-wise.\nIn this paper we consider the problem of \ufb01nding M given A.\n\ntrials.\n\nm\n\n2 Previous Results\n\nIn this section we review literature related to designing provable algorithms for topic models. For an\noverview of topic models we refer the reader to the excellent survey of [1]. Provable algorithms for\nrecovering topic models was started by [7]. Latent Semantic Indexing (LSI) [8] remains a successful\nmethod for retrieving similar documents by using SVD. [7] showed that one can recover M from a\n\n2\n\n\fcollection of documents, with pure topics, by using SVD based procedure under the additional Pri-\nmary Words assumption. [6] showed that in the admixture case, if one assumes Dirichlet distribution\nfor the topic weights, then, indeed, using tensor methods, one can learn M to l2 error provided some\nadded assumptions on numerical parameters like condition number are satis\ufb01ed.\nThe \ufb01rst provably polynomial time algorithm for admixture corpus was given in [4, 5]. For a topic\nl, a word i is an anchor word if: Mi,l \u2265 p0 and Mi,l(cid:48) = 0 \u2200l(cid:48) (cid:54)= l.\nTheorem 2.1 [4] If every topic has an anchor word, there is a polynomial time algorithm that\nreturns an \u02c6M such that with high probability,\n\nk(cid:88)\n\nd(cid:88)\n\nl=1\n\ni=1\n\n| \u02c6Mil \u2212 Mil| \u2264 d\u03b5 provided s \u2265 Max\n\nO\n\n(cid:26)\n\n(cid:18) k6 log d\n\n(cid:19)\n\na4\u03b52p6\n\n0\u03b32m\n\n(cid:18) k4\n\n(cid:19)(cid:27)\n\n, O\n\n\u03b32a2\n\n,\n\nwhere, \u03b3 is the condition number of E(W W T ), a is the minimum expected weight of a topic and m\nis the number of words in each document.\n\nNote that the error grows linearly in the dictionary size d, which is often large. Note also the\ndependence of s on parameters p0, which is, 1/p6\n0 and on a, which is 1/a4. If, say, the word \u201crun\u201d is\nan anchor word for the topic \u201cbaseball\u201d and p0 = 0.1, then the requirement is that every 10 th word\nin a document on this topic is \u201crun\u201d. This seems too strong to be realistic. It would be more realistic\nto ask that a set of words like - \u201crun\u201d, \u201chit\u201d, \u201cscore\u201d, etc. together have frequency at least 0.1 which\nis what our catchwords assumption does.\n\n3 Learning Topics from Dominant Admixtures\n\nInformally, a document is said to be drawn from a Dominant Admixture if the document has one\ndominant topic. Besides its simplicity, we show empirical evidence from real corpora to demonstrate\nthat topic dominance is a reasonable assumption. The Dominant Topic assumption is weaker than\nthe Pure Topic assumption. More importantly, SVD based procedures proposed by [7] will not\napply. Inspired by the Primary Words assumption we introduce the assumption that each topic has a\nset of Catchwords which individually occur more frequently in that topic than others. This is again\na much weaker assumption than both Primary Words and Anchor Words assumptions and can be\nveri\ufb01ed experimentally. In this section we establish that by applying SVD on a matrix, obtained by\nthresholding the word-document matrix, and subsequent k-means clustering can learn topics having\nCatchwords from a Dominant Admixture corpus.\n\n3.1 Assumptions: Catchwords and Dominant admixtures\n\u03b4 \u2264 0.08\nLet \u03b1, \u03b2, \u03c1, \u03b4, \u03b50 be non-negative reals satisfying: \u03b2 + \u03c1 \u2264 (1 \u2212 \u03b4)\u03b1,\nDominant topic Assumption (a) For j = 1, 2, . . . , s, document j has a dominant topic l(j) such\nthat Wl(j),j \u2265 \u03b1 and Wl(cid:48)j \u2264 \u03b2, \u2200l(cid:48) (cid:54)= l(j).\n(b) For each topic l, there are at least \u03b50w0s documents in each of which topic l has weight at least\n1 \u2212 \u03b4.\nCatchwords Assumption: There are k disjoint sets of words - S1, S2, . . . , Sk such that with \u03b5\n\nde\ufb01ned in (5), \u2200i \u2208 Sl, \u2200l(cid:48) (cid:54)= l, Mil(cid:48) \u2264 \u03c1Mil, (cid:80)\n\n\u03b1 + 2\u03b4 \u2264 0.5,\n\ni\u2208Sl\n\u2200i \u2208 Sl, m\u03b42\u03b1Mil \u2265 8 ln\n\nMil \u2265 p0,\n\n(cid:18) 20\n\n(cid:19)\n\n\u03b5w0\n\n.\n\n(1)\n\nPart (b) of the Dominant Topic Assumption is in a sense necessary for \u201cidenti\ufb01ability\u201d - namely for\nthe model to have a set of k document vectors so that every document vector is in the convex hull of\nthese vectors. The Catchwords assumption is natural to describe a theme as it tries to model a unique\ngroup of words which is likely to co-occur when a theme is expressed. This assumption is close to\ntopics discovered by LDA like models, which try to model co-occurence of words. If \u03b1, \u03b4 \u2208 \u2126(1),\nthen, the assumption (1) says Mil \u2208 \u2126\u2217(1/m). In fact if Mil \u2208 o(1/m), we do not expect to see\nword i (in topic l), so it cannot be called a catchword at all.\n\n3\n\n\fA slightly different (but equivalent) description of the model will be useful to keep in mind. What\nis \ufb01xed (not stochastic) are the matrices M and the distribution of the weight matrix W. To pick\ndocument j, we can \ufb01rst pick the dominant topic l in document j and condition the distribution of\nW\u00b7,j on this being the dominant topic. One could instead also think of W\u00b7,j being picked from a\nl=1 MilWlj and pick the m words of the document\n\nmixture of k distributions. Then, we let Pij =(cid:80)k\n\nin i.i.d multinomial trials as before. We will assume that\n\nTl = {j : l is the dominant topic in document j} satis\ufb01es |Tl| = wls,\n\nwhere, wl is the probability of topic l being dominant. This is only approximately valid, but the\nerror is small enough that we can disregard it.\nFor \u03b6 \u2208 {0, 1, 2, . . . , m}, let pi(\u03b6, l) be the probability that j \u2208 Tl and Aij = \u03b6/m and qi(\u03b6, l) the\ncorresponding \u201cempirical probability\u201d:\n\n(cid:90)\n\n(cid:18)m\n\n(cid:19)\n\npi(\u03b6, l) =\n\nW\u00b7,j\n\n\u03b6\n\nij(1 \u2212 Pij)m\u2212\u03b6Prob(W\u00b7,j | j \u2208 Tl) Prob(j \u2208 Tl), where, P\u00b7,j = MW\u00b7,j.\nP \u03b6\n(2)\n\nqi(\u03b6, l) =\n\n1\ns\n\n|{j \u2208 Tl : Aij = \u03b6/m}| .\n\n(3)\n\nNote that pi(\u03b6, l) is a real number, whereas, qi(\u03b6, l) is a random variable with E(qi(\u03b6, l)) = pi(\u03b6, l).\nWe need a technical assumption on the pi(\u03b6, l) (which is weaker than unimodality).\nNo-Local-Min Assumption: We assume that pi(\u03b6, l) does not have a local minimum, in the sense:\n(4)\n\npi(\u03b6, l) > Min(pi(\u03b6 \u2212 1, l), pi(\u03b6 + 1, l)) \u2200 \u03b6 \u2208 {1, 2, . . . , m \u2212 1}.\n\nThe justi\ufb01cation for this assumption is two-fold. First, generally, Zipf\u2019s law kind of behaviour where\nthe number of words plotted against relative frequencies declines as a power function has often been\nobserved. Such a plot is monotonically decreasing and indeed satis\ufb01es our assumption. But for\nCatchwords, we do not expect this behaviour - indeed, we expect the curve to go up initially as the\nrelative frequency increases, then reach a maximum and then decline. This is a unimodal function\nand also satis\ufb01es our assumption.\nRelative sizes of parameters: Before we close this section, a discussion on the values of the pa-\nrameters is in order. Here, s is large. For asymptotic analysis, we can think of it as going to in\ufb01nity.\n1/w0 is also large and can be thought of as going to in\ufb01nity. [In fact, if 1/w0 \u2208 O(1), then, in-\ntuitively, we see that there is no use of a corpus of more than constant size - since our model has\ni.i.d. documents, intuitively, the number of samples we need should depend mainly on 1/w0]. m is\n(much) smaller, but need not be constant.\nc refers to a generic constant independent of m, s, 1/w0, \u03b5, \u03b4; its value may be different in different\ncontexts.\n\n3.2 The TSVD Algorithm\n\nExisting SVD based procedures for clustering on raw word-document matrices fail because the\nspread of frequencies of a word within a topic is often more (at least not signi\ufb01cantly less) than the\ngap between the word\u2019s frequencies in two different topics. Hypothetically, the frequency for the\nword run, in the topic Sports, may range upwards of 0.01, say. But in other topics, it may range\nfrom, say, 0 to 0.005. The success of the algorithm will lie on correctly identifying the dominant\ntopics such as sports by identifying that the word run has occurred with high frequency. In this\nexample, the gap (0.01-0.005) between Sports and other topics is less than the spread within Sports\n(1.0-0.01), so a 2-clustering approach (based on SVD) will split the topic Sports into two. While\nthis is a toy example, note that if we threshold the frequencies at say 0.01, ideally, sports will be all\nabove and the rest all below the threshold, making the succeeding job of clustering easy.\nThere are several issues in extending beyond the toy case. Data is not one-dimensional. We will use\ndifferent thresholds for each word; word i will have a threshold \u03b6i/m. Also, we have to compute\n\u03b6i/m. Ideally we would not like to split any Tl, namely, we would like that for each l and and each\ni, either most j \u2208 Tl have Aij > \u03b6i/m or most j \u2208 Tl have Aij \u2264 \u03b6i/m. We will show that\n\n4\n\n\four threshold procedure indeed achieves this. One other nuance: to avoid conditioning, we split\nthe data A into two parts A(1) and A(2), compute the thresholds using A(1) and actually do the\nthresholding on A(2). We will assume that the intial A had 2s columns, so each part now has s\ncolumns. Also, T1, T2, . . . , Tk partitions the columns of A(1) as well as those of A(2). The columns\nof thresholded matrix B are then clustered, by a technique we call Project and Cluster, namely,\nwe project the columns of B to its k\u2212dimensional SVD subspace and cluster in the projection.\nThe projection before clustering has recently been proven [9] (see also [10]) to yield good starting\ncluster centers. The clustering so found is not yet satisfactory. We use the classic Lloyd\u2019s k-means\nalgorithm proposed by [12]. As we will show, the partition produced after clustering, {R1, . . . , Rk}\nof A(2) is close to the partition induced by the Dominant Topics, {T1, . . . , Tk}. Catchwords of topic\nl are now (approximately) identi\ufb01ed as the most frequently occurring words in documents in Rl.\nFinally, we identify nearly pure documents in Tl (approximately) as the documents in which the\ncatchwords occur the most. Then we get an approximation to M\u00b7,l by averaging these nearly pure\ndocuments. We now describe the precise algorithm.\n\n3.3 Topic recovery using Thresholded SVD\n\nThreshold SVD based K-means (TSVD)\n\n(cid:18) 1\n\n900c2\n0\n\n\u221a\n\n\u221a\n\u03b50\n\u03b1p0\u03b4\nk\n640m\n\n\u03b1p0\nk3m\n\n,\n\n(cid:19)\n\n,\n\n.\n\n(5)\n\n\u03b5 = Min\n\n1. Randomly partition the columns of A into two matrices A(1) and A(2) of s columns each.\n2. Thresholding\n\n(a) Compute Thresholds on A(1) For each i, let \u03b6i be the highest value of \u03b6 \u2208\n\n{0, 1, 2, . . . , m} such that |{j : A(1)\n\nij > \u03b6\n\n(b) Do the thresholding on A(2): Bij =\n\n(cid:40)\u221a\nm}| \u2265 w0\n\u03b6i\n\n2 s; |{j : A(1)\nif A(2)\notherwise\n\n0\n\nij = \u03b6\n\nm}| \u2264 3\u03b5w0s.\n\nij > \u03b6i/m and \u03b6i \u2265 8 ln(20/\u03b5w0)\n\n.\n\n3. SVD Find the best rank k approximation B(k) to B.\n4. Identify Dominant Topics\n\n(a) Project and Cluster Find (approximately) optimal k-means clustering of the columns\n\nof B(k).\n\n(b) Lloyd\u2019s Algorithm Using the clustering found in Step 4(a) as the starting clustering,\n(c) Let R1, R2, . . . , Rk be the k\u2212partition of [s] corresponding to the clustering after\n\napply Lloyd\u2019s k-means algorithm to the columns of B (B, not B(k)).\nLloyd\u2019s. //*Will prove that Rl \u2248 Tl*//\n\n(a) For each i, l, compute g(i, l) = the ((cid:98)\u03b50w0s/2(cid:99))th highest element of {A(2)\n\nm\u03b42 ln(20/\u03b5w0), Maxl(cid:48)(cid:54)=l\u03b3 g(i, l(cid:48))(cid:1)(cid:9) , where, \u03b3 =\n\n: j \u2208 Rl}.\n\nij\n\n5. Identify Catchwords\n\n(b) Let Jl = (cid:8)i : g(i, l) > Max(cid:0) 4\n6. Find Topic Vectors Find the (cid:98)\u03b50w0s/2(cid:99) highest(cid:80)\n\n(1+\u03b4)(\u03b2+\u03c1).\n\n1\u22122\u03b4\n\nij among all j \u2208 [s] and return\nA(2)\nthe average of these A\u00b7,j as our approximation \u02c6M\u00b7,l to M\u00b7,l.\n\ni\u2208Jl\n\nTheorem 3.1 Main Theorem Under the Dominant Topic, Catchwords and No-Local-Min assump-\ntions, the algorithm succeeds with high probability in \ufb01nding an \u02c6M so that\n\n|Mil \u2212 \u02c6Mil| \u2208 O(k\u03b4), provided 1s \u2208 \u2126\u2217(cid:18) 1\n\n(cid:18) k6m2\n\n(cid:19)(cid:19)\n\n.\n\nw0\n\n\u03b12p2\n0\n\n+\n\nm2k\n\u03b52\n0\u03b42\u03b1p0\n\n+\n\nd\n\n\u03b50\u03b42\n\n(cid:88)\n\ni,l\n\n1The superscript \u2217 hides a logarithmic factor in dsk/\u03b4fail, where, \u03b4fail > 0 is the desired upper bound on the\n\nprobability of failure.\n\n5\n\n\f\u221a\n\n\u221a\n\n\u221a\n\nMil/\n\nA note on the sample complexity is in order. Notably, the dependence of s on w0 is best possible\n(namely s \u2208 \u2126\u2217(1/w0)) within logarithmic factors, since, if we had fewer than 1/w0 documents, a\ntopic which is dominant with probability only w0 may have none of the documents in the collection.\nThe dependence of s on d needs to be at least d/\u03b50w0\u03b42: to see this, note that we only assume\nthat there are r = O(\u03b50w0s) nearly pure documents on each topic. Assuming we can \ufb01nd this\nset (the algorithm approximately does), their average has standard deviation of about\nr in\ncoordinate i. If topic vector M\u00b7,l has O(d) entries, each of size O(1/d), to get an approximation\nof M\u00b7,l to l1 error \u03b4, we need the per coordinate error 1/\ndr to be at most \u03b4/d which implies\ns \u2265 d/\u03b50w0\u03b42. Note that to get comparable error in [4], we need a quadratic dependence on d.\nThere is a long sequence of Lemmas to prove the theorem. To improve the readability of the paper\nwe relegate the proofs to supplementary material [14]. The essence of the proof lies in proving\nthat the clustering step correctly identi\ufb01es the partition induced by the dominant topics. For this,\nwe take advantage of a recent development on the k\u2212means algorithm from [9] [see also [10]],\nwhere, it is shown that under a condition called the Proximity Condition, Lloyd\u2019s k means algorithm\nstarting with the centers provided by the SVD-based algorithm, correctly identi\ufb01es almost all the\ndocuments\u2019 dominant topics. We prove that indeed the Proximity Condition holds. This calls for\nmachinery from Random Matrix theory (in particular bounds on singular values). We prove that the\nsingular values of the thresholded word-document matrix are nicely bounded. Once the dominant\ntopic of each document is identi\ufb01ed, we are able to \ufb01nd the Catchwords for each topic. Now, we\nrely upon part (b.) of the Dominant Topic assumption : that is there is a small fraction of nearly Pure\nTopic-documents for each topic. The Catchwords help isolate the nearly pure-topic documents and\nhence \ufb01nd the topic vectors. The proofs are complicated by the fact that each step of the algorithm\ninduces conditioning on the data \u2013 for example, after clustering, the document vectors in one cluster\nare not independent anymore.\n\n4 Experimental Results\n\nWe compare the thresholded SVD based k-means (TSVD2) algorithm 3.3 with the algorithms of\n[5], Recover-KL and Recover-L2, using the code made available by the authors3. We observed\nthe results of Recover-KL to be better than Recover-L2, and report here the results of Recover-KL\n(abbreviated R-KL), full set of results can be found in supplementary section 5. We \ufb01rst provide\nempirical support for the algorithm assumptions in Section 3.1, namely the dominant topic and the\ncatchwords assumption. Then we show on 4 different semi-synthetic data that TSVD provides as\ngood or better recovery of topics than the Recover algorithms. Finally on real-life datasets, we show\nthat the algorithm performs as well as [5] in terms of perplexity and topic coherence.\n\nk , \u03b5 = 1\n\nImplementation Details: TSVD parameters (w0, \u03b5, \u03b50, \u03b3) are not known in advance for real cor-\npus. We tested empirically for multiple settings and the following values gave the best performance.\nThresholding parameters used were: w0 = 1\n6. For \ufb01nding the catchwords, \u03b3 = 1.1, \u03b50 = 1\nin step 5. For \ufb01nding the topic vectors (step 6), taking the top 50% (\u03b50w0 = 1\nk ) gave empirically\nbetter results. The same values were used on all the datasets tested. The new algorithm is sensitive\nto the initialization of the \ufb01rst k-means step in the projected SVD space. To remedy this, we run 10\nindependent random initializations of the algorithm with K-Means++ [13] and report the best result.\nDatasets: We use four real word datasets in the experiments. As pre-processing steps we removed\nstandard stop-words, selected the vocabulary size by term-frequency and removed documents with\nless than 20 words. Datasets used are: (1) NIPS4: Consists of 1,500 NIPS full papers, vocabulary\nof 2,000 words and mean document length 1023. (2) NYT4: Consists of a random subset of 30,000\ndocuments from the New York Times dataset, vocabulary of 5,000 words and mean document length\n238. (3) Pubmed4: Consists of a random subset of 30,000 documents from the Pubmed abstracts\ndataset, vocabulary of 5,030 words and mean document length 58. (4) 20NewsGroup5 (20NG):\nConsist of 13,389 documents, vocabulary of 7,118 words and mean document length 160.\n\n3\n\n2Resources available at: http://mllab.csa.iisc.ernet.in/tsvd\n3http://www.cs.nyu.edu/\u02dchalpern/files/anchor-word-recovery.zip\n4http://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n5http://qwone.com/\u02dcjason/20Newsgroups\n\n6\n\n\fCorpus\nNIPS\nNYT\n\nPubmed\n20NG\n\ns\n\n1500\n30000\n30000\n13389\n\nk\n50\n50\n50\n20\n\n% s with Dominant % s with Pure % Topics CW Mean\nTopics (\u03b4 = 0.05) with CW Frequency\n\nTopics (\u03b1 = 0.4)\n\n56.6%\n63.7%\n62.2%\n74.1%\n\n2.3%\n8.5%\n5.1%\n39.5%\n\n96%\n98%\n78%\n85%\n\n0.05\n0.07\n0.05\n0.06\n\nTable 1: Algorithm Assumptions. For dominant topic assumption, fraction of documents with satisfy\nthe assumption for (\u03b1, \u03b2) = (0.4, 0.3) are shown. % documents with almost pure topics (\u03b4 = 0.05,\ni.e. 95% pure) are also shown. Last two columns show results for catchwords (CW) assumption.\n\n4.1 Algorithm Assumptions\n\nTo check the dominant topic and catchwords assumptions, we \ufb01rst run 1000 iterations of Gibbs\nsampling on the real corpus and learn the posterior document-topic distribution ({W.,j}) for each\ndocument in the corpus (by averaging over 10 saved-states separated by 50 iterations after the 500\nburn-in iterations). We will use this posterior document-topic distribution as the document generat-\ning distribution to check the two assumptions.\nDominant topic assumption: Table 1 shows the fraction of the documents in each corpus which\nsatisfy this assumption with \u03b1 = 0.4 (minimum probability of dominant topic) and \u03b2 = 0.3 (maxi-\nmum probability of non-dominant topics). The fraction of documents which have almost pure topics\nwith highest topic weight at least 0.95 (\u03b4 = 0.05) is also shown. The results indicate that the domi-\nnant topic assumption is well justi\ufb01ed (on average 64% documents satisfy the assumption) and there\nis also a substantial fraction of documents satisfying almost pure topic assumption.\nCatchwords assumption: We \ufb01rst \ufb01nd a k-clustering of the documents {T1, . . . , Tk} by assigning\nall documents which have highest posterior probability for the same topic into one cluster. Then\nwe use step 5 of TSVD (Algorithm 3.3) to \ufb01nd the set of catchwords for each topic-cluster, i.e.\n{S1, . . . , Sk}, with the parameters: \u00010w0 = 1\n3k , \u03b3 = 2.3 (taking into account constraints in Section\n3.1, \u03b1 = 0.4, \u03b2 = 0.3, \u03b4 = 0.05, \u03c1 = 0.07). Table 1 reports the fraction of topics with non-empty\nset of catchwords and the average per topic frequency of the catchwords6. Results indicate that\nmost topics on real data contain catchwords (Table 1, second-last column). Moreover, the average\nper-topic frequency of the group of catchwords for that topic is also quite high (Table 1, last column).\n\n4.2 Empirical Results\n\nSemi-synthetic Data: Following [5], we generate semi-synthetic corpora from LDA model trained\nby MCMC, to ensure that the synthetic corpora retain the characteristics of real data. Gibbs sam-\npling7 is run for 1000 iterations on all the four datasets and the \ufb01nal word-topic distribution is used\nto generate varying number of synthetic documents with document-topic distribution drawn from a\nsymmetric Dirichlet with hyper-parameter 0.01. For NIPS, NYT and Pubmed we use k = 50 topics,\nfor 20NewsGroup k = 20, and mean document lengths of 1000, 300, 100 and 200 respectively. Note\nthat the synthetic data is not guaranteed to satisfy dominant topic assumption for every document\n(on average about 80% documents satisfy the assumption for value of (\u03b1, \u03b2) tested in Section 4.1).\nTopic Recovery on Semi-synthetic Data: We learn the word-topic distribution ( \u02c6M) for the semi-\nsynthetic corpora using TSVD and the Recover algorithms of [5]. Given these learned topic dis-\ntributions and the original data-generating distribution (M), we align the topics of M and \u02c6M by\nbipartite matching and evaluate the l1 distance between each pair of topics. We report the average\nof l1 error across topics (called l1 reconstruction-error [5]) in Table 2 for TSVD and Recover-KL\n(R-KL). TSVD has smaller error on most datasets than the R-KL algorithm. We observed perfor-\nmance of TSVD to be always better than Recover-L2 (see supplement Table 1 for full results). Best\nperformance is observed on NIPS which has largest mean document length, indicating that larger\nm leads to better recovery. Results on 20NG are slightly worse than R-KL for smaller sample size,\nbut performance improves for larger number of documents. While the error-values in Table 2 are\n\n6(cid:16) 1\n\nk\n\n(cid:80)k\n\n(cid:80)\n\n(cid:80)\n\n(cid:17)\n\nl=1\n\n1|Tl|\n\ni\u2208Sl\n\nj\u2208Tl\n\nAij\n\n7Dirichlet hyperparameters used: document-topic = 0.03 and topic-word = 1\n\n7\n\n\fCorpus\n\nDocuments\n\nNIPS\n\nPubmed\n\n20NG\n\nNYT\n\n40,000\n50,000\n60,000\n40,000\n50,000\n60,000\n40,000\n50,000\n60,000\n40,000\n50,000\n60,000\n\nR-KL\n0.308\n0.308\n0.311\n0.332\n0.326\n0.328\n0.120\n0.114\n0.110\n0.208\n0.206\n0.200\n\nTSVD\n\n0.115 (62.7%)\n0.145 (52.9%)\n0.131 (57.9%)\n0.288 (13.3%)\n0.280 (14.1%)\n0.284 (13.4%)\n0.124 (-3.3%)\n0.113 (0.9%)\n0.106 (3.6%)\n0.195 (6.3%)\n0.185 (10.2%)\n0.194 (3.0%)\n\nFigure 1: Histogram of l1 error across\ntopics (40,000 documents). TSVD(blue,\nsolid border) gets smaller error on most\ntopics than R-KL(green, dashed border).\n\nTable 2: l1 reconstruction error on various semi-synthetic\ndatasets. Brackets in the last column give percent improve-\nment over R-KL (best performing Recover algorithm).\nFull results in supplementary.\n\naveraged values across topics, Figure 1 shows that TSVD algorithm achieves much better topic re-\ncovery for majority of the topics (>90%) on most datasets (overall average improvement of 27%,\nfull results in supplement Figure 1).\nTopic Recovery on Real Data: To evaluate perplexity [2] on real data, the held-out sets con-\nsist of 350 documents for NIPS, 10000 documents for NYT and Pubmed, and 6780 documents for\n20NewsGroup. TSVD achieved perplexity measure of 835 (NIPS), 1307 (Pubmed), 1555 (NYT),\n2390 (20NG) while Recover-KL achieved 754 (NIPS), 1188 (Pubmed), 1579 (NYT), 2431 (20NG)\n(refer to supplement Table 2 for complete results). TSVD gives comparable perplexity with Recover-\nKL, results being slightly better on NYT and 20NewsGroup which are larger datasets with moder-\nately high mean document lengths. We also \ufb01nd comparable results on Topic Coherence [11] (see\nTable 2 in supplementary for topic coherence results and Table 3 for list of top words of topics).\nSummary: We evaluated the proposed algorithm, TSVD, rigorously on multiple datasets with re-\nspect to the state of the art [5] (Recover-KL and Recover-L2), following the evaluation methodology\nof [5]. In Table 2 we show that the l1 reconstruction error for the new algorithm is small and on\naverage 19.6% better than the best results of the Recover algorithms [5]. In Figure 1, we show that\nTSVD achieves signi\ufb01cantly better recover on majority of the topics. We also demonstrate that on\nreal datasets the algorithm achieves comparable perplexity and topic coherence to Recover algo-\nrithms. Moreover, we show on multiple real world datasets that the algorithm assumptions are well\njusti\ufb01ed in practice.\n\nConclusion\n\nReal world corpora often exhibits the property that in every document there is one topic dominantly\npresent. A standard SVD based procedure will not be able to detect these topics, however TSVD,\na thresholded SVD based procedure, as suggested in this paper, discovers these topics. While SVD\nis time-consuming, there have been a host of recent sampling-based approaches which make SVD\neasier to apply to massive corpora which may be distributed among many servers. We believe that\napart from topic recovery, thresholded SVD can be applied even more broadly to similar problems,\nsuch as matrix factorization, and will be the basis for future research.\n\nAcknowledgements TB was supported by a Department of Science and Technology (DST) grant.\n\nReferences\n[1] Blei, D. Introduction to probabilistic topic models. Communications of the ACM, pp. 77\u201384,\n\n2012.\n\n8\n\n0102030400.000.150.300.450.600.750.901.051.20NIPS0102030400.000.150.300.450.600.750.901.051.20NYTL1 Reconstruction ErrorAlgorithmR\u2212KLTSVDNumber of Topics\f[2] Blei, D., Ng, A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Re-\nsearch, pp. 3:993\u20131022, 2003. Preliminary version in Neural Information Processing Systems\n2001.\n\n[3] Grif\ufb01ths, T. L. and Steyvers, M. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nof Sciences, 101:5228\u20135235, 2004.\n\n[4] Arora, S., Ge, R., and Moitra, A. Learning topic models \u2013 going beyond SVD. In Foundations\n\nof Computer Science, 2012b.\n\n[5] Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu M. A\npractical algorithm for topic modeling with provable guarantees. In Internation Conference on\nMachine Learning, 2013\n\n[6] Anandkumar, A., Foster, D., Hsu, D., Kakade, S., and Liu, Y. A Spectral Algorithm for Latent\n\nDirichlet Allocation In Neural Information Processing Systems, 2012.\n\n[7] Papadimitriou, C., Raghavan, P., Tamaki H., and Vempala S. Latent semantic indexing: a prob-\nabilistic analysis. Journal of Computer and System Sciences, pp. 217\u2013235, 2000. Preliminary\nversion in PODS 1998.\n\n[8] Deerwester, S., Dumais, S., Landauer, T., Furnas, G., and Harshman, R. Indexing by latent\nsemantic analysis. Journal of the American Society for Information Science, pp. 391\u2013407,\n1990.\n\n[9] Kumar, A., and Kannan, R. Clustering with spectral norm and the k-means algorithm.\n\nFoundations of Computer Science, 2010\n\nIn\n\n[10] Awashti, P., and Sheffet, O. Improved spectral-norm bounds for clustering. In Proceedings of\n\nApprox/Random, 2012.\n\n[11] Mimno, D., Wallach, H., Talley, E., Leenders, M. and McCallum, A. Optimizing semantic\ncoherence in topic models. In Empirical Methods in Natural Language Processing, pp. 262\u2013\n272, 2011.\n\n[12] Lloyd, Stuart P. Least squares quantization in PCM, IEEE Transactions on Information Theory\n\n28 (2): 129137,1982.\n\n[13] Arthur, D., and Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceed-\n\nings of ACM-SIAM symposium on Discrete algorithms, pp. 1027\u20131035, 2007\n\n[14] Supplementary material\n\n9\n\n\f", "award": [], "sourceid": 1088, "authors": [{"given_name": "Trapit", "family_name": "Bansal", "institution": "Indian Institute of Science"}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": "Indian Institute of Science"}, {"given_name": "Ravindran", "family_name": "Kannan", "institution": "Indian Institute of Science"}]}