{"title": "On some provably correct cases of variational inference for topic models", "book": "Advances in Neural Information Processing Systems", "page_first": 2098, "page_last": 2106, "abstract": "Variational inference is an efficient, popular heuristic used in the context of latent variable models. We provide the first analysis of instances where variational inference algorithms converge to the global optimum, in the setting of topic models. Our initializations are natural, one of them being used in LDA-c, the mostpopular implementation of variational inference.In addition to providing intuition into why this heuristic might work in practice, the multiplicative, rather than additive nature of the variational inference updates forces us to usenon-standard proof arguments, which we believe might be of general theoretical interest.", "full_text": "On some provably correct cases of variational\n\ninference for topic models\n\nPranjal Awasthi\n\nDepartment of Computer Science\n\nRutgers University\n\nNew Brunswick, NJ 08901\n\npranjal.awasthi@rutgers.edu\n\nristeski@cs.princeton.edu\n\nAndrej Risteski\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08540\n\nAbstract\n\nVariational inference is an ef\ufb01cient, popular heuristic used in the context of latent\nvariable models. We provide the \ufb01rst analysis of instances where variational in-\nference algorithms converge to the global optimum, in the setting of topic models.\nOur initializations are natural, one of them being used in LDA-c, the most popular\nimplementation of variational inference. In addition to providing intuition into\nwhy this heuristic might work in practice, the multiplicative, rather than additive\nnature of the variational inference updates forces us to use non-standard proof\narguments, which we believe might be of general theoretical interest.\n\n1\n\nIntroduction\n\nOver the last few years, heuristics for non-convex optimization have emerged as one of the most\nfascinating phenomena for theoretical study in machine learning. Methods like alternating mini-\nmization, EM, variational inference and the like enjoy immense popularity among ML practitioners,\nand with good reason: they\u2019re vastly more ef\ufb01cient than alternate available methods like convex\nrelaxations, and are usually easily modi\ufb01ed to suite different applications.\nTheoretical understanding however is sparse and we know of very few instances where these meth-\nods come with formal guarantees. Among more classical results in this direction are the analyses of\nLloyd\u2019s algorithm for K-means, which is very closely related to the EM algorithm for mixtures of\nGaussians [20], [13], [14]. The recent work of [9] also characterizes global convergence properties\nof the EM algorithm for more general settings. Another line of recent work has focused on a differ-\nent heuristic called alternating minimization in the context of dictionary learning. [1], [6] prove that\nwith appropriate initialization, alternating minimization can provably recover the ground truth. [22]\nhave proven similar results in the context of phase retreival.\nAnother popular heuristic which has so far eluded such attempts is known as variational infer-\nence [19]. We provide the \ufb01rst characterization of global convergence of variational inference based\nalgorithms for topic models [12]. We show that under natural assumptions on the topic-word matrix\nand the topic priors, along with natural initialization, variational inference converges to the param-\neters of the underlying ground truth model. To prove our result we need to overcome a number\nof technical hurdles which are unique to the nature of variational inference. Firstly, the dif\ufb01culty\nin analyzing alternating minimization methods for dictionary learning is alleviated by the fact that\none can come up with closed form expressions for the updates of the dictionary matrix. We do\nnot have this luxury. Second, the \u201cnorm\u201d in which variational inference naturally operates is KL\ndivergence, which can be dif\ufb01cult to work with. We stress that the focus of this work is not to iden-\ntify new instances of topic modeling that were previously not known to be ef\ufb01ciently solvable, but\nrather providing understanding about the behaviour of variational inference, the defacto method for\nlearning and inference in the context of topic models.\n\n1\n\n\f2 Latent variable models and EM\n\n(cid:88)\n\ni\n\nIn the E-step, we compute the distribution \u02dcP t(Z) = P (Z|X, \u03b8t).\n\nWe brie\ufb02y review EM and variational methods. The setting is latent variable models, where\nthe observations Xi are generated according to a distribution P (Xi|\u03b8) = P (Zi|\u03b8)P (Xi|Zi, \u03b8)\nwhere \u03b8 are parameters of the models, and Zi is a latent variable. Given the observations Xi, a\nlog(P (Xi|\u03b8)).\ncommon task is to \ufb01nd the max likelihood value of the parameter \u03b8: argmax\u03b8\nThe EM algorithm is an iterative method to achieve this, dating all the way back to [15] and\n[24] in the 70s. In the above framework it can be formulated as the following procedure, main-\ntaining estimates \u03b8t, \u02dcP t(Z) of the model parameters and the posterior distribution over the hid-\nden variables:\nIn the M-\nE \u02dcP t[log P (Xi, Zi|\u03b8)]. Sometimes even the above two steps\nstep, we set \u03b8t+1 = argmax\u03b8\nmay not be computationally feasible, in which case they can be relaxed by choosing a fam-\nIn the variational E-step,\nily of simple distributions F , and performing the following updates.\nwe compute the distribution \u02dcP t(Z) = min\nIn the M-step, we set\nP t\u2208F\nE \u02dcP t[log P (Xi, Zi|\u03b8)]. By picking the family F appropriately, it\u2019s often possi-\n\u03b8t+1 = argmax\u03b8\nble to make both steps above run in polynomial time. As expected, none of the above two families\nof approximations, come with any provable global convergence guarantees. With EM, the problem\nis ensuring that one does not get stuck in a local optimum. With variational EM, additionally, we\nare faced with the issue of in principle not even exploring the entire space of solutions.\n\nKL(P t(Z)||P (Z|X, \u03b8t)).\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\n3 Topic models and prior work\n\nWe focus on a particular, popular latent variable model - topic models [12]. The generative model\nover word documents is the following. For each document in the corpus, a proportion of topics\n\u03b31, \u03b32, . . . , \u03b3k is sampled according to a prior distribution \u03b1. Then, for each position p in the doc-\nument, we pick a topic Zp according to a multinomial with parameters \u03b31, . . . , \u03b3k. Conditioned on\nZp = i, we pick a word j from a multinomial with parameters (\u03b2i,1, \u03b2i,2, . . . , \u03b2i,k) to put in position\np. The matrix of values {\u03b2i,j} is known as the topic-word matrix.\nThe body of work on topic models is vast [11]. Prior theoretical work relevant in the context of\nthis paper includes the sequence of works by [7],[4], as well as [2], [16], [17] and [10]. [7] and\n[4] assume that the topic-word matrix contains \u201canchor words\u201d. This means that each topic has a\nword which appears in that topic, and no other. [2] on the other hand work with a certain expansion\nassumption on the word-topic graph, which says that if one takes a subset S of topics, the number\nof words in the support of these topics should be at least |S| + smax, where smax is the maximum\nsupport size of any topic. Neither paper needs any assumption on the topic priors, and can handle\n(almost) arbitrarily short documents.\nThe assumptions we make on the word-topic matrix will be related to the ones in the above works,\nbut our documents will need to be long, so that the empirical counts of the words are close to their\nexpected counts. Our priors will also be more structured. This is expected since we are trying to\nanalyze an existing heuristic rather than develop a new algorithmic strategy. The case where the\ndocuments are short seems signi\ufb01cantly more dif\ufb01cult. Namely, in that case there are two issues to\nconsider. One is proving the variational approximation to the posterior distribution over topics is not\ntoo bad. The second is proving that the updates do actually reach the global optimum. Assuming\nlong documents allows us to focus on the second issue alone, which is already challenging. On a\nhigh level, the instances we consider will have the following structure:\n\u2022 The topics will satisfy a weighted expansion property: for any set S of topics of constant size,\nfor any topic i in this set, the probability mass on words which belong to i, and no other topic in\nS will be large. (Similar to the expansion in [2], but only over constant sized subsets.)\n\u2022 The number of topics per document will be small. Further, the probability of including a given\ntopic in a document is almost independent of any other topics that might be included in the\ndocument already. Similar properties are satis\ufb01ed by the Dirichlet prior, one of the most popular\n\n2\n\n\fpriors in topic modeling.\n\u201cdominating topic\u201d, similarly as in [10].\n\n(Originally introduced by [12].) The documents will also have a\n\u2022 For each word j, and a topic i it appears in, there will be a decent proportion of documents that\ncontain topic i and no other topic containing j. These can be viewed as \u201clocal anchor documents\u201d\nfor that word-pair topic.\n\nWe state below, informally, our main result. See Sections 6 and 7 for more details.\nTheorem. Under the above mentioned assumptions, popular variants of variational inference for\ntopic models, with suitable initializations, provably recover the ground truth model in polynomial\ntime.\n\n4 Variational relaxation for learning topic models\n\nIn this section we brie\ufb02y review the variational relaxation for topic models, following closely [12].\nThroughout the paper, we will denote by N the total number of words and K the number of topics.\nWe will assume that we are working with a sample set of D documents. We will also denote by\n\u02dcfd,j the fractional count of word j in document d (i.e. \u02dcfd,j = Count(j)/Nd, where Count(j) is the\nnumber of times word j appears in the document, and Nd is the number of words in the document).\nFor topic models variational updates are a way to approximate the computationally intractable\nE-step [23] as described in Section 2. Recall the model parameters for topic models are the\ntopic prior parameters \u03b1 and the topic-word matrix \u03b2. The observable X is the list of words\nin the document. The latent variables are the topic assignments Zj at each position j in the\ndocument and the topic proportions \u03b3. The variational E-step hence becomes \u02dcP t(Z, \u03b3) =\nminP t\u2208F KL(P t(Z, \u03b3)||P (Z, \u03b3|X, \u03b1t, \u03b2t) for some family F of distributions. The family F one\nusually considered is P t(\u03b3, Z) = q(\u03b3)\u03a0Nd\nj(Zj), i.e. a mean \ufb01eld family. In [12] it\u2019s shown\nthat for Dirichlet priors \u03b1 the optimal distributions q, q(cid:48)\nj are a Dirichlet distribution for q, with some\nparameter \u02dc\u03b3, and multinomials for q(cid:48)\nj, with some parameters \u03c6j. The variational EM updates are\nshown to have the following form. In the E-step, one runs to convergence the following updates on\nthe \u03c6 and \u02dc\u03b3 parameters: \u03c6d,j,i \u221d \u03b2t\nj=1 \u03c6d,j,i. In the M-step, one\n\neEq[log(\u03b3d)|\u02dc\u03b3d], \u02dc\u03b3d,i = \u03b1t\n\nd,i +(cid:80)Nd\n\nj=1q(cid:48)\n\ni,wd,j\n\nupdates the \u03b2 and parameters by setting \u03b2t+1\n\nd,j,iwd,j,j(cid:48) where \u03c6t\n\u03c6t\n\nd,j,i is the converged\n\nvalue of \u03c6d,j,i; wd,j is the word in document d, position j; wd,j,j(cid:48) is an indicator variable which is 1\nif the word in position j(cid:48) in document d is word j. The \u03b1 Dirichlet parameters do not have a closed\nform expression and are updated via gradient descent.\n\ni,j \u221d D(cid:88)\n\nNd(cid:88)\n\nd=1\n\nj(cid:48)=1\n\n4.1 Simpli\ufb01ed updates in the long document limit\n\nFrom the above updates it is dif\ufb01cult to give assign an intuitive meaning to the \u02dc\u03b3 and \u03c6 parameters.\n(Indeed, it\u2019s not even clear what one would like them to be ideally at the global optimum.) We will\nbe however working in the large document limit - and this will simplify the updates. In particular,\nin the E-step, in the large document limit, the \ufb01rst term in the update equation for \u02dc\u03b3 has a vanishing\ncontribution. In this case, we can simplify the E-update as: \u03c6d,j,i \u221d \u03b2t\nj=1 \u03c6d,j,i.\nNotice, importantly, in the second update we now use variables \u03b3d,i instead of \u02dc\u03b3d,i, which are nor-\n\ni,j\u03b3d,i, \u03b3d,i \u221d(cid:80)Nd\n\nmalized such that\n\n\u03b3d,i = 1. These correspond to the max-likelihood topic proportions, given\n\nK(cid:88)\n\ni=1\n\ni,j \u221d D(cid:88)\n\n\u03b2t+1\n\nour current estimates \u03b2t\ni,j for the model parameters. The M-step will remain as is - but we will\nfocus on the \u03b2 only, and ignore the \u03b1 updates - as the \u03b1 estimates disappeared from the E updates:\n\n\u02dcfd,j\u03b3t\n\nd,i, where \u03b3t\n\nd,i is the converged value of \u03b3d,i. In this case, the intuitive mean-\n\nd=1\n\ning of the \u03b2t and \u03b3t variables is clear: they are estimates of the the model parameters, and the\nmax-likelihood topic proportions, given an estimate of the model parameters, respectively.\nThe way we derived them, these updates appear to be an approximate form of the variational updates\nin [12]. However it is possible to also view them in a more principled manner. These updates\n\n3\n\n\fapproximate the posterior distribution P (Z, \u03b3|X, \u03b1t, \u03b2t) by \ufb01rst approximating this posterior by\nP (Z|X, \u03b3\u2217, \u03b1t, \u03b2t), where \u03b3\u2217 is the max-likelihood value for \u03b3, given our current estimates of\n\u03b1, \u03b2, and then setting P (Z|X, \u03b3\u2217, \u03b1t, \u03b2t) to be a product distribution. It is intuitively clear that\nin the large document limit, this approximation should not be much worse than the one in [12],\nas the posterior concentrates around the maximum likelihood value. (And in fact, our proofs will\nwork for \ufb01nite, but long documents.) Finally, we will rewrite the above equations in a slightly\n\nmore convenient form. Denoting fd,j =\n\n\u03b3d,i\u03b2t\n\ni,j, the E-step can be written as: iterate until\n\nK(cid:88)\n\nN(cid:88)\n\nj=1\n\ni=1\n\n\u02dcfd,j\nfd,j\n\ni,j. The M-step becomes: \u03b2t+1\n\u03b2t\n\ni,j = \u03b2t\ni,j\n\n(cid:80)D\n(cid:80)D\n\nd=1\n\n\u02dcfd,j\nf t\nd,j\n\n\u03b3t\nd,i\n\nd=1 \u03b3t\nd,i\n\nwhere\n\nconvergence \u03b3d,i = \u03b3d,i\n\nK(cid:88)\n\nf t\nd,j =\n\n\u03b3t\nd,i\u03b2t\n\ni,j and \u03b3t\n\nd,i is the converged value of \u03b3d,i.\n\ni=1\n\n4.2 Alternating KL minimization and thresholded updates\n\nd = min\u03b3t\n\nd=1 KL(f t\n\nThere the authors proved that under these updates(cid:80)D\n\nWe will further modify the E and M-step update equations we derived above. In a slightly modi\ufb01ed\nform, these updates were used in a paper by [21] in the context of non-negative matrix factorization.\nd,j|| \u02dcfd,j) is non-decreasing. One can\neasily modify their arguments to show that the same property is preserved if the E-step is replaced\nd\u2208\u2206K KL( \u02dcfd||fd), where \u2206K is the K-dimensional simplex - i.e. minimizing\nby a step \u03b3t\nthe KL divergence between the counts and the \u201dpredicted counts\u201d with respect to the \u03b3 variables. (In\nfact, iterating the \u03b3 updates above is a way to solve this convex minimization problem via a version\nof gradient descent which makes multiplicative updates, rather than additive updates.)\nThus the updates are performing alternating minimization using the KL divergence as the distance\nmeasure (with the difference that for the \u03b2 variables one essentially just performs a single gradient\nstep). In this paper, we will make a modi\ufb01cation of the M-step which is very natural. Intuitively, the\nupdate for \u03b2t\ni,j goes over all appearances of the word j and adds the \u201cfractional assignment\u201d of the\nword j to topic i under our current estimates of the variables \u03b2, \u03b3. In the modi\ufb01ed version we will\nd,i(cid:48),\u2200i(cid:48) (cid:54)= i. The intuitive reason behind this\nonly average over those documents d, where \u03b3t\nmodi\ufb01cation is the following. The EM updates we are studying work with the KL divergence, which\nd,i should\nputs more weight on the larger entries. Thus, for the documents in Di, the estimates for \u03b3t\nbe better than they might be in the documents D \\ Di. (Of course, since the terms f t\nd,j involve all\nthe variables \u03b3t\nd,i, it is not a priori clear that this modi\ufb01cation will gain us much, but we will prove\nthat it in fact does.) Formally, we discuss the three modi\ufb01cations of variational inference speci\ufb01ed\nas Algorithm 1, 2 and 3 (we call them tEM, for thresholded EM):\n\nd,i > \u03b3t\n\nAlgorithm 1 KL-tEM\n\nd,i \u2265 0,(cid:80)\n\n(E-step) Solve the following convex program for each document d: min\u03b3t\nd,i = 0 if i is not in the support of document d\n\u03b3t\n(M-step) Let Di to be the set of documents d, s.t. \u03b3t\n\nd,i(cid:48),\u2200i(cid:48) (cid:54)= i.\n\nd,i > \u03b3t\n\ni \u03b3t\n\nd,i\n\nd,i = 1 and \u03b3t\n(cid:80)\n(cid:80)\n\n\u02dcfd,j\nf t\nd\u2208Di\n\nd\u2208Di\n\nd,j\n\u03b3t\n\nd,i\n\n\u03b3t\nd,i\n\nSet \u03b2t+1\n\ni,j = \u03b2t\ni,j\n\n(cid:80)\n\nj\n\n\u02dcfd,j log(\n\n\u02dcfd,j\nf t\nd,j\n\n), s.t.\n\n5\n\nInitializations\n\nWe will consider two different strategies for initialization. First, we will consider the case where\nwe initialize with the topic-word matrix, and the document priors having the correct support. The\nanalysis of tEM in this case will be the cleanest. While the main focus of the paper is tEM, we\u2019ll\nshow that this initialization can actually be done for our case ef\ufb01ciently. Second, we will consider\nan initialization that is inspired by what the current LDA-c implementation uses. Concretely, we\u2019ll\n\n4\n\n\fAlgorithm 2 Iterative tEM\n\n(E-step) Initialize \u03b3d,i uniformly among the topics in the support of document d.\nRepeat\n\nN(cid:88)\n\n\u03b3d,i = \u03b3d,i\n\n\u02dcfd,j\nfd,j\n\n\u03b2t\ni,j\n\n(4.1)\n\nuntil convergence.\n(M-step) Same as above.\n\nAlgorithm 3 Incomplete tEM\n\nj=1\n\n(E-step) Initialize \u03b3d,i with the values gotten in the previous iteration, then perform just one step\nof 4.1.\n(M-step) Same as before.\n\nassume that the user has some way of \ufb01nding, for each topic i, a seed document in which the\nproportion of topic i is at least Cl. Then, when initializing, one treats this document as if it were\npure: namely one sets \u03b20\ni,j to be the fractional count of word j in this document. We do not attempt\nto design an algorithm to \ufb01nd these documents.\n\n6 Case study 1: Sparse topic priors, support initialization\n\nd,i\u03b2\u2217\n\ni=1 \u03b3\u2217\n\nWe start with a simple case. As mentioned, all of our results only hold in the long documents\nregime: we will assume for each document d, the number of sampled words is large enough, so that\none can approximate the expected frequencies of the words, i.e., one can \ufb01nd values \u03b3\u2217\nd,i, such that\ni,j. We\u2019ll split the rest of the assumptions into those that apply to the topic-\nword matrix, and the topic priors. Let\u2019s \ufb01rst consider the assumptions on the topic-word matrix. We\nwill impose conditions that ensure the topics don\u2019t overlap too much. Namely, we assume:\n\u2022 Words are discriminative: Each word appears in o(K) topics.\n\n\u02dcfd,j = (1\u00b1\u0001)(cid:80)K\n\u2022 Almost disjoint supports: \u2200i, i(cid:48), if the intersection of the supports of i and i(cid:48) is S,(cid:80)\no(1) \u00b7(cid:80)\n\nj\u2208S \u03b2\u2217\n\ni,j \u2264\n\nj \u03b2\u2217\ni,j.\n\nWe also need assumptions on the topic priors. The documents will be sparse, and all topics will\nbe roughly equally likely to appear. There will be virtually no dependence between the topics:\nconditioning on the size or presence of a certain topic will not in\ufb02uence much the probability of\nanother topic being included. These are analogues of distributions that have been analyzed for\ndictionary learning [6]. Formally:\n\u2022 Sparse and gapped documents: Each of the documents in our samples has at most T = O(1)\ntopics. Furthermore, for each document d, the largest topic i0 = argmaxi\u03b3\u2217\nd,i is such that for any\nother topic i(cid:48), \u03b3\u2217\nd,i(cid:48),\u2200i(cid:48) (cid:54)= i is\n\u2022 Dominant topic equidistribution: The probability that topic i is such that \u03b3\u2217\nd,i > \u03b3\u2217\n\u2022 Weak topic correlations and independent topic distribution: For all sets S with o(K) topics, it\nd,i(cid:48) =\nd,i(cid:48)\u2200i(cid:48) \u2208 S] =\n\n\u0398(1/K).\nd,i|\u03b3\u2217\nmust be the case that: E[\u03b3\u2217\n0, i(cid:48) \u2208 S]. Furthermore, for any set S of topics, s.t. |S| \u2264 T \u2212 1, Pr[\u03b3\u2217\n\u0398( 1\n\nd,i is dominating] = (1 \u00b1 o(1))E[\u03b3\u2217\n\nd,i is dominating, \u03b3\u2217\nd,i > 0|\u03b3\u2217\n\n> \u03c1 for some (arbitrarily small) constant \u03c1.\n\nd,i(cid:48) \u2212 \u03b3\u2217\n\nd,i|\u03b3\u2217\n\nd,i0\n\nK )\n\nThese assumptions are a less smooth version of properties of the Dirichlet prior. Namely, it\u2019s a\nfolklore result that Dirichlet draws are sparse with high probability, for a certain reasonable range of\nparameters. This was formally proven by [25] - though sparsity there means a small number of large\ncoordinates. It\u2019s also well known that Dirichlet essentially cannot enforce any correlation between\ndifferent topics. 1\n\n1We show analogues of the weak topic correlations property and equidistribution in the supplementary\n\nmaterial for completeness sake.\n\n5\n\n\fThe above assumptions can be viewed as a local notion of separability of the model, in the following\nsense. First, consider a particular document d. For each topic i that participates in that document,\nconsider the words j, which only appear in the support of topic i in the document. In some sense,\nthese words are local anchor words for that document: these words appear only in one topic of that\ndocument. Because of the \u201dalmost disjoint supports\u201d property, there will be a decent mass on these\nwords in each document. Similarly, consider a particular non-zero element \u03b2\u2217\ni,j of the topic-word\ni(cid:48),j = 0 for all other topics i(cid:48) (cid:54)= i appearing in\nmatrix. Let\u2019s call Dl the set of documents where \u03b2\u2217\nthat document. These documents are like local anchor documents for that word-topic pair: in those\ndocuments, the word appears as part of only topic i. It turns out the above properties imply there is\na decent number of these for any word-topic pair.\nFinally, a technical condition: we will also assume that all nonzero \u03b3\u2217\npoly(N ).\nIntuitively, this means if a topic is present, it needs to be reasonably large, and similarly for words\nin topics. Such assumptions also appear in the context of dictionary learning [6].\nWe will prove the following\nTheorem 1. Given an instance of topic modelling satisfying the properties speci\ufb01ed above, where\nthe number of documents is \u2126( K log2 N\nd,i variables\ncorrectly, after O (log(1/\u0001(cid:48)) + log N ) KL-tEM, iterative-tEM updates or incomplete-tEM updates,\nwe recover the topic-word matrix and topic proportions to multiplicative accuracy 1 + \u0001(cid:48), for any \u0001(cid:48)\ns.t. 1 + \u0001(cid:48) \u2264 1\nTheorem 2. If the number of documents is \u2126(K 4 log2 K), there is a polynomial-time procedure\nwhich with probability 1 \u2212 \u2126( 1\n\nK ) correctly identi\ufb01es the supports of the \u03b2\u2217\n\n), if we initialize the supports of the \u03b2t\n\ni,j are at least\n\ni,j and \u03b3t\n\ni,j and \u03b3\u2217\n\nd,i variables.\n\nd,i, \u03b2\u2217\n\n1\n\n\u00012\n\n(1\u2212\u0001)7 .\n\ndocument, topic i is actually the largest topic.\n\nProvable convergence of tEM: The correctness of the tEM updates is proven in 3 steps:\n\u2022 Identifying dominating topic: First, we prove that if \u03b3t\nd,i is the largest one among all topics in the\n\u2022 Phase I: Getting constant multiplicative factor estimates: After initialization, after O(log N )\nd,i which are within a constant multiplicative factor from\n\u2022 Phase II (Alternating minimization - lower and upper bound evolution): Once the \u03b2 and \u03b3 es-\ntimates are within a constant factor of their true values, we show that the lone words and docu-\nments have a boosting effect: they cause the multiplicative upper and lower bounds to improve\nat each round.\n\nrounds, we will get to variables \u03b2t\n\u03b2\u2217\ni,j, \u03b3\u2217\nd,i.\n\ni,j, \u03b3t\n\ni,j \u2248 \u03b1\u03b2\u2217\n\ni,j + (1 \u2212 \u03b1)C t\u03b2\u2217\n\nThe updates we are studying are multiplicative, not additive in nature, and the objective they are\noptimizing is non-convex, so the standard techniques do not work. The intuition behind our proof in\nPhase II can be described as follows. Consider one update for one of the variables, say \u03b2t\ni,j. We show\nthat \u03b2t+1\ni,j for some constant C t at time step t. \u03b1 is something fairly large\n(one should think of it as 1 \u2212 o(1)), and comes from the existence of the local anchor documents.\nA similar equation holds for the \u03b3 variables, in which case the \u201cgood\u201d term comes from the local\nanchor words. Furthermore, we show that the error in the \u2248 decreases over time, as does the value\nof C t, so that eventually we can reach \u03b2\u2217\ni,j. The analysis bears a resemblance to the state evolution\nand density evolution methods in error decoding algorithm analysis - in the sense that we maintain\na quantity about the evolving system, and analyze how it evolves under the speci\ufb01ed iterations. The\nquantities we maintain are quite simple - upper and lower multiplicative bounds on our estimates at\nany round t.\nInitialization: Recall the goal of this phase is to recover the supports - i.e. to \ufb01nd out which topics\nare present in a document, and identify the support of each topic. We will \ufb01nd the topic supports\n\ufb01rst. This uses an idea inspired by [8] in the setting of dictionary learning. Roughly, we devise a\ntest, which will take as input two documents d, d(cid:48), and will try to determine if the two documents\nhave a topic in common or not. The test will have no false positives, i.e., will never say YES, if the\ndocuments don\u2019t have a topic in common, but might say NO even if they do. We then ensure that\nwith high probability, for each topic we \ufb01nd a pair of documents intersecting in that topic, such that\nthe test says YES. 2\n\n2The detailed initialization algorithm is included in the supplementary material.\n\n6\n\n\f7 Case study 2: Dominating topics, seeded initialization\n\nd,i \u2265 Cl.\n\ni,j = f\u2217\nd,j.\n\nNext, we\u2019ll consider an initialization which is essentially what the current implementation of LDA-c\nuses. Namely, we will call the following initialization a seeded initialization:\n\u2022 For each topic i, the user supplies a document d, in which \u03b3\u2217\n\u2022 We treat the document as if it only contains topic i and initialize with \u03b20\nWe show how to modify the previous analysis to show that with a few more assumptions, this\nstrategy works as well. Firstly, we will have to assume anchor words, that make up a decent fraction\nof the mass of each topic. Second, we also assume that the words have a bounded dynamic range, i.e.\nthe values of a word in two different topics are within a constant B from each other. The documents\nare still gapped, but the gap now must be larger. Finally, in roughly 1/B fraction of the documents\nwhere topic i is dominant, that topic has proportion 1 \u2212 \u03b4, for some small (but still constant) \u03b4. A\nsimilar assumption (a small fraction of almost pure documents) appeared in a recent paper by [10].\nFormally, we have:\n\u2022 Small dynamic range and large fraction of anchors: For each discriminative words, if \u03b2\u2217\n\ni,j (cid:54)= 0\ni(cid:48),j. Furthermore, each topic i has anchor words, such that their total\n\u2022 Gapped documents: In each document, the largest topic has proportion at least Cl, and all the\n\ni(cid:48),j (cid:54)= 0, \u03b2\u2217\nand \u03b2\u2217\nweight is at least p.\n\nother topics are at most Cs, s.t.\n\ni,j \u2264 B\u03b2\u2217\n(cid:32)(cid:115)\n\n(cid:18)\n\n\u2022 Small fraction of 1 \u2212 \u03b4 dominant documents: Among all the documents where topic i is domi-\n\nCl \u2212 Cs \u2265 1\np\n\n2\n\np log(\n\n) + (1 \u2212 p) log(BCl)\n\n1\nCl\n\n(cid:32)\n\n(cid:32)(cid:115)\n\n(cid:18)\n\nnating, in a 8/B fraction of them, \u03b3\u2217\n\n\u03b4 := min\n\n2\n\np log(\n\nC 2\nl\n\n2B3 \u2212 1\n\np\n\nd,i \u2265 1 \u2212 \u03b4, where\n1\nCl\n\n) + (1 \u2212 p) log(BCl)\n\n(cid:19)\n\n(cid:19)\n\n(cid:33)\n+(cid:112)log(1 + \u0001)\n(cid:33)\n+(cid:112)log(1 + \u0001)\n\n+ \u0001\n\n(cid:33)\n\n\u2212 \u0001, 1 \u2212(cid:112)\n\nCl\n\nlog B , since log( 1\nCl\n\n) \u2248 1+\u03b7, roughly we want that Cl\u2212Cs (cid:29) 2\n\n\u221a\nThe dependency between the parameters B, p, Cl is a little dif\ufb01cult to parse, but if one thinks of Cl\nas 1\u2212\u03b7 for \u03b7 small, and p \u2265 1\u2212 \u03b7\n\u03b7.\n(In other words, the weight we require to have on the anchors depends only logarithmically on the\nrange B.) In the documents where the dominant topic has proportion 1 \u2212 \u03b4, a similar reasoning as\nabove gives that we want is approximately \u03b3\u2217\n\u03b7. The precise statement is as\nfollows:\nTheorem 3. Given an instance of topic modelling satisfying the properties speci\ufb01ed above,\nwhere the number of documents is \u2126( K log2 N\n), if we initialize with seeded initialization, after\nO (log(1/\u0001(cid:48)) + log N ) of KL-tEM updates, we recover the topic-word matrix and topic proportions\nto multiplicative accuracy 1 + \u0001(cid:48), if 1 + \u0001(cid:48) \u2265 1\n\nd,i \u2265 1 \u2212 1 \u2212 2\u03b7\n2B3 +\n\n\u221a\n\n2\np\n\n\u00012\n\np\n\n(1\u2212\u0001)7 .\n\nThe proof is carried out in a few phases:\n\u2022 Phase I: Anchor identi\ufb01cation: We show that as long as we can identify the dominating topic in\neach of the documents, anchor words will make progress: after O(log N ) number of rounds, the\nvalues for the topic-word estimates will be almost zero for the topics for which word w is not an\nanchor. For topic for which a word is an anchor we\u2019ll have a good estimate.\n\u2022 Phase II: Discriminative word identi\ufb01cation: After the anchor words are properly identi\ufb01ed in\nthe previous phase, if \u03b2\u2217\ni,j will keep dropping and quickly reach almost zero. The\nvalues corresponding to \u03b2\u2217\n\u2022 Phase III: Alternating minimization: After Phase I and II above, we are back to the scenario of\n\ni,j = 0, \u03b2t\ni,j (cid:54)= 0 will be decently estimated.\n\nthe previous section: namely, there is improvement in each next round.\n\nDuring Phase I and II the intuition is the following: due to our initialization, even in the beginning,\neach topic is \u201dcorrelated\u201d with the correct values. In a \u03b3 update, we are minimizing KL( \u02dcfd||fd)\nwith respect to the \u03b3d variables, so we need a way to argue that whenever the \u03b2 estimates are not too\nbad, minimizing this quantity provides an estimate about how far the optimal \u03b3d variables are from\n\u03b3\u2217\nd. We show the following useful claim:\n\n7\n\n\f2\n\ni ||\u03b2t\n\n(cid:113) 1\n\n2 R\u03b2 + 1\n\n(cid:112)Rf ) + \u0001.\n\ni ) \u2264 R\u03b2, and min\u03b3d\u2208\u2206K KL( \u02dcfd,j||fd,j) \u2264 Rf , after\nd\u2212\u03b3d||1 \u2264\n\nLemma 4. If, for all topics i, KL(\u03b2\u2217\nrunning a KL divergence minimization step with respect to the \u03b3d variables, we get that ||\u03b3\u2217\n1\np (\nThis lemma critically uses the existence of anchor words - namely we show ||\u03b2\u2217v||1 \u2265 p||v||1.\nIntuitively, if one thinks of v as \u03b3\u2217 \u2212 \u03b3t, ||\u03b2\u2217v||1 will be large if ||v||1 is large. Hence, if ||\u03b2\u2217 \u2212 \u03b2t||1\nis not too large, whenever ||f\u2217 \u2212 f t||1 is small, so is ||\u03b3\u2217 \u2212 \u03b3t||1. We will be able to maintain R\u03b2\nand Rf small enough throughout the iterations, so that we can identify the largest topic in each of\nthe documents.\n\ni,j \u2264 \u03ba\u03b2\u2217\n\n8 On common words\ni(cid:48),j,\u2200i, i(cid:48), \u03ba \u2264 B. In this case, the\nWe brie\ufb02y remark on common words: words such that \u03b2\u2217\nproofs above, as they are, will not work, 3 since common words do not have any lone documents.\nHowever, if 1 \u2212 1\n\u03ba100 fraction of the documents where topic i is dominant contains topic i with\nproportion 1 \u2212 1\n\u03ba100 and furthermore, in each topic, the weight on these words is no more than\n\u03ba100 , then our proofs still work with either initialization4 The idea for the argument is simple: when\nthe dominating topic is very large, we show that f\u2217\n, so these\ndocuments behave like anchor documents. Namely, one can show:\nTheorem 5. If we additionally have common words satisfying the properties speci\ufb01ed above, after\nO(log(1/\u0001(cid:48)) + log N ) KL-tEM updates in Case Study 2, or any of the tEM variants in Case Study 1,\nand we use the same initializations as before, we recover the topic-word matrix and topic proportions\nto multiplicative accuracy 1 + \u0001(cid:48), if 1 + \u0001(cid:48) \u2265 1\n\nis very highly correlated with \u03b2\u2217\n\n\u03b2t\n\nf t\n\nd,j\n\nd,j\n\ni,j\n\ni,j\n\n1\n\n(1\u2212\u0001)7 .\n\n9 Discussion and open problems\n\nIn this work we provide the \ufb01rst characterization of suf\ufb01cient conditions when variational inference\nleads to optimal parameter estimates for topic models. Our proofs also suggest possible hard cases\nfor variational inference, namely instances with large dynamic range compared to the proportion of\nanchor words and/or correlated topic priors. It\u2019s not hard to hand-craft such instances where support\ninitialization performs very badly, even with only anchor and common words. We made no effort to\nexplore the optimal relationship between the dynamic range and the proportion of anchor words, as\nit\u2019s not clear what are the \u201cworst case\u201d instances for this trade-off.\nSeeded initialization, on the other hand, empirically works much better. We found that when Cl \u2265\n0.6, and when the proportion of anchor words is as low as 0.2, variational inference recovers the\nground truth, even on instances with fairly large dynamic range. Our current proof methods are too\nweak to capture this observation. (In fact, even the largest topic is sometimes misidenti\ufb01ed in the\ninitial stages, so one cannot even run tEM, only the vanilla variational inference updates.) Analyzing\nthe dynamics of variational inference in this regime seems like a challenging problem which would\nrequire signi\ufb01cantly new ideas.\n\nReferences\n[1] A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli. Learning sparsely used overcomplete\ndictionaries via alternating minimization. In Proceedings of The 27th Conference on Learning\nTheory (COLT), 2013.\n\n[2] A. Anandkumar, D. Hsu, A. Javanmard, and S. Kakade. Learning latent bayesian networks and\ntopic models under expansion constraints. In Proceedings of the 30th International Conference\non Machine Learning (ICML), 2013.\n\n3We stress we want to analyze whether variational inference will work or not. Handling common words\nalgorithmically is easy: they can be detected and \u201d\ufb01ltered out\u201d initially. Then we can perform the variational\ninference updates over the rest of the words only. This is in fact often done in practice.\n\n4See supplementary material.\n\n8\n\n\f[3] A. Anandkumar, S. Kakade, D. Foster, Y. Liu, and D. Hsu. Two svds suf\ufb01ce: Spectral de-\ncompositions for probabilistic topic modeling and latent dirichlet allocation. Technical report,\n2012.\n\n[4] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical\nalgorithm for topic modeling with provable guarantees. In Proceedings of the 30th Interna-\ntional Conference on Machine Learning (ICML), 2013.\n\n[5] S. Arora, R. Ge, R. Kanna, and A. Moitra. Computing a nonnegative matrix factorization\u2013\nprovably. In Proceedings of the forty-fourth annual ACM symposium on Theory of Computing,\npages 145\u2013162. ACM, 2012.\n\n[6] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse\n\ncoding. In Proceedings of The 28th Conference on Learning Theory (COLT), 2015.\n\n[7] S. Arora, R. Ge, and A. Moitra. Learning topic models \u2013 going beyond svd. In Proceedings of\n\nthe 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2012.\n\n[8] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete\n\ndictionaries. In Proceedings of The 27th Conference on Learning Theory (COLT), 2014.\n\n[9] S. Balakrishnan, M.J. Wainwright, and B. Yu. Statistical guarantees for the em algorithm:\n\nFrom population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[10] T. Bansal, C. Bhattacharyya, and R. Kannan. A provable svd-based algorithm for learning\ntopics in dominant admixture corpus. In Advances in Neural Information Processing Systems\n(NIPS), 2014.\n\n[11] D. Blei and J.D. Lafferty. Topic models. Text mining: classi\ufb01cation, clustering, and applica-\n\ntions, 10:71, 2009.\n\n[12] D. Blei, A. Ng, , and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[13] S. Dasgupta and L. Schulman. A two-round variant of em for gaussian mixtures. In Proceed-\n\nings of Uncertainty in Arti\ufb01cial Intelligence (UAI), 2000.\n\n[14] S. Dasgupta and L. Schulman. A probabilistic analysis of em for mixtures of separated, spher-\n\nical gaussians. Journal of Machine Learning Research, 8:203\u2013226, 2007.\n\n[15] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[16] W. Ding, M.H. Rohban, P. Ishwar, and V. Saligrama. Topic discovery through data dependent\n\nand random projections. arXiv preprint arXiv:1303.3664, 2013.\n\n[17] W. Ding, M.H. Rohban, P. Ishwar, and V. Saligrama. Ef\ufb01cient distributed topic modeling\nwith provable guarantees. In Proceedings ot the 17th International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 167\u2013175, 2014.\n\n[18] M. Hoffman, D. Blei, J. Paisley, and C. Wan. Stochastic variational inference. Journal of\n\nMachine Learning Research, 14:1303\u20131347, 2013.\n\n[19] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods\n\nfor graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[20] A. Kumar and R. Kannan. Clustering with spectral norm and the k-means algorithm.\n\nProceedings of Foundations of Computer Science (FOCS), 2010.\n\nIn\n\n[21] D. Lee and S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2000.\n\n[22] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization.\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\nIn\n\n[23] D. Sontag and D. Roy. Complexity of inference in latent dirichlet allocation. In Advances in\n\nNeural Information Processing Systems (NIPS), 2000.\n\n[24] R. Sundberg. Maximum likelihood from incomplete data via the em algorithm. Scandinavian\n\nJournal of Statistics, 1:49\u201358, 1974.\n\n[25] M. Telgarsky. Dirichlet draws are sparse with high probability. Manuscript, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1259, "authors": [{"given_name": "Pranjal", "family_name": "Awasthi", "institution": "Princeton"}, {"given_name": "Andrej", "family_name": "Risteski", "institution": "Princeton"}]}