{"title": "Geometric Dirichlet Means Algorithm for topic inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2505, "page_last": 2513, "abstract": "We propose a geometric algorithm for topic learning and inference that is built on the convex geometry of topics arising from the Latent Dirichlet Allocation (LDA) model and its nonparametric extensions. To this end we study the optimization of a geometric loss function, which is a surrogate to the LDA's likelihood. Our method involves a fast optimization based weighted clustering procedure augmented with geometric corrections, which overcomes the computational and statistical inefficiencies encountered by other techniques based on Gibbs sampling and variational inference, while achieving the accuracy comparable to that of a Gibbs sampler. The topic estimates produced by our method are shown to be statistically consistent under some conditions. The algorithm is evaluated with extensive experiments on simulated and real data.", "full_text": "Geometric Dirichlet Means algorithm\n\nfor topic inference\n\nMikhail Yurochkin\nDepartment of Statistics\nUniversity of Michigan\n\nmoonfolk@umich.edu\n\nXuanLong Nguyen\n\nDepartment of Statistics\nUniversity of Michigan\n\nxuanlong@umich.edu\n\nAbstract\n\nWe propose a geometric algorithm for topic learning and inference that is built on\nthe convex geometry of topics arising from the Latent Dirichlet Allocation (LDA)\nmodel and its nonparametric extensions. To this end we study the optimization of a\ngeometric loss function, which is a surrogate to the LDA\u2019s likelihood. Our method\ninvolves a fast optimization based weighted clustering procedure augmented with\ngeometric corrections, which overcomes the computational and statistical inef\ufb01-\nciencies encountered by other techniques based on Gibbs sampling and variational\ninference, while achieving the accuracy comparable to that of a Gibbs sampler. The\ntopic estimates produced by our method are shown to be statistically consistent\nunder some conditions. The algorithm is evaluated with extensive experiments on\nsimulated and real data.\n\n1\n\nIntroduction\n\nMost learning and inference algorithms in the probabilistic topic modeling literature can be delineated\nalong two major lines: the variational approximation popularized in the seminal paper of Blei et al.\n(2003), and the sampling based approach studied by Pritchard et al. (2000) and other authors. Both\nclasses of inference algorithms, their virtues notwithstanding, are known to exhibit certain de\ufb01ciencies,\nwhich can be traced back to the need for approximating or sampling from the posterior distributions\nof the latent variables representing the topic labels. Since these latent variables are not geometrically\nintrinsic \u2014 any permutation of the labels yields the same likelihood \u2014 the manipulation of these\nredundant quantities tend to slow down the computation, and compromise with the learning accuracy.\nIn this paper we take a convex geometric perspective of the Latent Dirichlet Allocation, which may\nbe obtained by integrating out the latent topic label variables. As a result, topic learning and inference\nmay be formulated as a convex geometric problem: the observed documents correspond to points\nrandomly drawn from a topic polytope, a convex set whose vertices represent the topics to be inferred.\nThe original paper of Blei et al. (2003) (see also Hofmann (1999)) contains early hints about a convex\ngeometric viewpoint, which is left unexplored. This viewpoint had laid dormant for quite some time,\nuntil studied in depth in the work of Nguyen and co-workers, who investigated posterior contraction\nbehaviors for the LDA both theoretically and practically (Nguyen, 2015; Tang et al., 2014).\nAnother fruitful perspective on topic modeling can be obtained by partially stripping away the\ndistributional properties of the probabilistic model and turning the estimation problem into a form\nof matrix factorization (Deerwester et al., 1990; Xu et al., 2003; Anandkumar et al., 2012; Arora\net al., 2012). We call this the linear subspace viewpoint. For instance, the Latent Semantic Analysis\napproach (Deerwester et al., 1990), which can be viewed as a precursor of the LDA model, looks\nto \ufb01nd a latent subspace via singular-value decomposition, but has no topic structure. Notably, the\nRecoverKL by Arora et al. (2012) is one of the recent fast algorithms with provable guarantees\ncoming from the linear subspace perspective.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe geometric perspective continues to be the main force driving this work. We develop and analyze a\nnew class of algorithms for topic inference, which exploits both the convex geometry of topic models\nand the distributional properties they carry. The main contributions in this work are the following: (i)\nwe investigate a geometric loss function to be optimized, which can be viewed as a surrogate to the\nLDA\u2019s likelihood; this leads to a novel estimation and inference algorithm \u2014 the Geometric Dirichlet\nMeans algorithm, which builds upon a weighted k-means clustering procedure and is augmented with\na geometric correction for obtaining polytope estimates; (ii) we prove that the GDM algorithm is\nconsistent, under conditions on the Dirichlet distribution and the geometry of the topic polytope; (iii)\nwe propose a nonparametric extension of GDM and discuss geometric treatments for some of the\nLDA extensions; (v) \ufb01nally we provide a thorough evaluation of our method against a Gibbs sampler,\na variational algorithm, and the RecoverKL algorithm. Our method is shown to be comparable to a\nGibbs sampler in terms of estimation accuracy, but much more ef\ufb01cient in runtime. It outperforms\nRecoverKL algorithm in terms of accuracy, in some realistic settings of simulations and in real data.\nThe paper proceeds as follows. Section 2 provides a brief background of the LDA and its convex\ngeometric formulation. Section 3 carries out the contributions outlined above. Section 4 presents\nexperiments results. We conclude with a discussion in Section 5.\n\n2 Background on topic models\n\nIn this section we give an overview of the well-known Latent Dirichlet Allocation model for topic\nmodeling (Blei et al., 2003), and the geometry it entails. Let \u03b1 \u2208 RK\n+ be hyperparameters,\nwhere V denotes the number of words in a vocabulary, and K the number of topics. The K topics are\nrepresented as distributions on words: \u03b2k|\u03b7 \u223c DirV (\u03b7), for k = 1, . . . , K. Each of the M documents\ncan be generated as follows. First, draw the document topic proportions: \u03b8m|\u03b1 \u223c DirK(\u03b1), for\nm = 1, . . . , M. Next, for each of the Nm words in document m, pick a topic label z and then sample\na word d from the chosen topic:\n\n+ and \u03b7 \u2208 RV\n\nznm|\u03b8m \u223c Categorical(\u03b8m); dnm|znm, \u03b21...K \u223c Categorical(\u03b2znm\n\n(1)\nEach of the resulting documents is a vector of length Nm with entries dnm \u2208 {1, . . . , V }, where\nnm = 1, . . . , Nm. Because these words are exchangeable by the modeling, they are equivalently\nrepresented as a vector of word counts wm \u2208 NV . In practice, the Dirichlet distributions are often\nsimpli\ufb01ed to be symmetric Dirichlet, in which case hyperparameters \u03b1, \u03b7 \u2208 R+ and we will proceed\nwith this setting. Two most common approaches for inference with the LDA are Gibbs sampling\n(Grif\ufb01ths & Steyvers, 2004), based on the Multinomial-Dirichlet conjugacy, and mean-\ufb01eld inference\n(Blei et al., 2003). The former approach produces more accurate estimates but is less computationally\nef\ufb01cient than the latter. The inef\ufb01ciency of both techniques can be traced to the need for sampling or\nestimating the (redundant) topic labels. These labels are not intrinsic \u2014 any permutation of the topic\nlabels yield the same likelihood function.\n\n).\n\nConvex geometry of topics. By integrating out the latent variables that represent the topic labels,\nwe obtain a geometric formulation of the LDA. Indeed, integrating z\u2019s out yields that, for m =\n1, . . . , M,\n\nwm|\u03b8m, \u03b21...K, Nm \u223c Multinomial(pm1, . . . , pmV , Nm),\n\nwhere pmi denotes probability of observing the i-th word from the vocabulary in the m-th document,\nand is given by\n\nK(cid:88)\n\nk=1\n\npmi =\n\n\u03b8mk\u03b2ki for i = 1, . . . , V ; m = 1, . . . , M.\n\n(2)\nThe model\u2019s geometry becomes clear. Each topic is represented by a point \u03b2k lying in the V \u2212 1\ndimensional probability simplex \u2206V \u22121. Let B := Conv(\u03b21, . . . , \u03b2K) be the convex hull of the\nK topics \u03b2k, then each document corresponds to a point pm := (pm1, . . . , pmV ) lying inside the\npolytope B. This point of view has been proposed before (Hofmann, 1999), although topic proportions\n\u03b8 were not given any geometric meaning. The following treatment of \u03b8 lets us relate to the LDA\u2019s\nDirichlet prior assumption and complete the geometric perspective of the problem. The Dirichlet\ndistribution generates probability vectors \u03b8m, which can be viewed as the (random) barycentric\nk \u03b8mk\u03b2k is a vector of\ncartesian coordinates of the m-th document\u2019s multinomial probabilities. Given pm, document m is\n\ncoordinates of the document m with respect to the polytope B. Each pm =(cid:80)\n\n2\n\n\fgenerated by taking wm \u223c Multinomial(pm, Nm). In Section 4 we will show how this interpretation\nof topic proportions can be utilized by other topic modeling approaches, including for example the\nRecoverKL algorithm of Arora et al. (2012). In the following the model geometry is exploited to\nderive fast and effective geometric algorithm for inference and parameter estimation.\n\n3 Geometric inference of topics\n\nWe shall introduce a geometric loss function that can be viewed as a surrogate to the LDA\u2019s likelihood.\nTo begin, let \u03b2 denote the K \u00d7 V topic matrix with rows \u03b2k, \u03b8 be a M \u00d7 K document topic\nproportions matrix with rows \u03b8m, and W be M \u00d7 V normalized word counts matrix with rows\n\u00afwm = wm/Nm.\n\n3.1 Geometric surrogate loss to the likelihood\n\nUnlike the original LDA formulation, here the Dirichlet distribution on \u03b8 can be viewed as a prior on\nparameters \u03b8. The log-likelihood of the observed corpora of M documents is\n\n(cid:32) K(cid:88)\nwhere the parameters \u03b2 and \u03b8 are subject to constraints(cid:80)\n(cid:80)\n\nM(cid:88)\n\nV(cid:88)\n\nL(\u03b8, \u03b2) =\n\nwmi log\n\nm=1\n\nk=1\n\ni=1\n\n(cid:33)\n\n\u03b8mk\u03b2ki\n\n,\n\ni \u03b2ki = 1 for each k = 1, . . . , K, and\nk \u03b8mk = 1 for each m = 1, . . . , M. Partially relaxing these constraints and keeping only the one\nthat the sum of all entries for each row of the matrix product \u03b8\u03b2 is 1, yields the upper bound that\nL(\u03b8, \u03b2) \u2264 L(W ), where function L(W ) is given by\n\n(cid:88)\n\n(cid:88)\n\nm\n\ni\n\nL(W ) =\n\nwmi log \u00afwmi.\n\nWe can establish a tighter bound, which will prove useful (the proof of this and other technical results\nare in the Supplement):\nProposition 1. Given a \ufb01xed topic polytope B and \u03b8. Let Um be the set of words present in document\nm, and assume that pmi > 0 \u2200 i \u2208 Um, then\n\nM(cid:88)\n\nm=1\n\nL(W ) \u2212 1\n2\n\nNm\n\ni\u2208Um\n\n(cid:88)\n\n( \u00afwmi \u2212 pmi)2 \u2265 L(\u03b8, \u03b2) \u2265 L(W ) \u2212 M(cid:88)\n\n(cid:88)\n\ni\u2208Um\n\nNm\n\nm=1\n\n( \u00afwmi \u2212 pmi)2.\n\n1\npmi\n\nSince L(W ) is constant, the proposition above shows that maximizing the likelihood has the effect of\nminimizing the following quantity with respect to both \u03b8 and \u03b2:\n( \u00afwmi \u2212 pmi)2.\n\n(cid:88)\n\n(cid:88)\n\nNm\n\nm\n\ni\n\nG(B)\n\n(cid:88)\n\n(cid:88)\n\nFor each \ufb01xed \u03b2 (and thus B), minimizing \ufb01rst with respect to \u03b8 leads to the following\n\nM(cid:88)\nwhere the second equality in the above display is due pm = (cid:80)\n\nNm min\nx:x\u2208B\nk \u03b8mk\u03b2k \u2208 B. The proposition\nsuggests a strategy for parameter estimation: \u03b2 (and B) can be estimated by minimizing the geometric\nloss function G:\n\n( \u00afwmi \u2212 pmi)2 =\n\n(cid:107)x \u2212 \u00afwm(cid:107)2\n2,\n\n:= min\n\nNm\n\n(3)\n\nm=1\n\nm\n\n\u03b8\n\ni\n\nB\n\nmin\n\nG(B) = min\n\n(4)\nIn words, we aim to \ufb01nd a convex polytope B \u2208 \u2206V \u22121, which is closest to the normalized word\ncounts \u00afwm of the observed documents. It is interesting to note the presence of document length Nm,\nwhich provides the weight for the squared (cid:96)2 error for each document. Thus, our loss function adapts\nto the varying length of documents in the collection. Without the weights, our objective is similar to\nthe sum of squared errors of the Nonnegative Matrix Factorization(NMF). Ding et al. (2006) studied\n\nNm min\nx:x\u2208B\n\nm=1\n\nB\n\n(cid:107)x \u2212 \u00afwm(cid:107)2\n2.\n\nM(cid:88)\n\n3\n\n\fthe relation between the likelihood function of interest and NMF, but with a different objective of\nthe NMF problem and without geometric considerations. Once \u02c6B is solved, \u02c6\u03b8 can be obtained as\nthe barycentric coordinates of the projection of \u00afwm onto \u02c6B for each document m = 1, . . . , M (cf.\nEq (3)). We note that if K \u2264 V , then B is a simplex and \u03b21, . . . , \u03b2k in general positions are the\nextreme points of B, and the barycentric coordinates are unique. (If K > V , the uniqueness no\n\u02c6\u03b2 gives the cartesian coordinates of a point in B that minimizes\nlonger holds). Finally, \u02c6pm = \u02c6\u03b8T\n(cid:107)x \u2212 \u00afwm(cid:107)2. This projection\nm\nEuclidean distance to the maximum likelihood estimate: \u02c6pm = argmin\nis not available in the closed form, but a fast algorithm is available (Golubitsky et al., 2012), which\ncan easily be extended to \ufb01nd the corresponding distance and to evaluate our geometric objective.\n\nx\u2208B\n\n3.2 Geometric Dirichlet Means algorithm\n\nWe proceed to devise a procedure for approximately solving the topic polytope B via Eq. (4): \ufb01rst,\nobtain an estimate of the underlying subspace based on weighted k-means clustering and then,\nestimate the vertices of the polytope that lie on the subspace just obtained via a geometric correction\ntechnique. Please refer to the Supplement for a clari\ufb01cation of the concrete connection between our\ngeometric loss function and other objectives which arise in subspace learning and weighted k-means\nclustering literature, the connection that motivates the \ufb01rst step of our algorithm.\n\nGeometric Dirichlet Means (GDM) algorithm estimates a topic polytope B based on the training\ndocuments (see Algorithm 1). The algorithm is conceptually simple, and consists of two main steps:\nFirst, we perform a (weighted) k-means clustering on the M points \u00afw1, . . . , \u00afwM to obtain the K\ncentroids \u00b51, . . . , \u00b5K, and second, construct a ray emanating from a (weighted) center of the polytope\nand extending through each of the centroids \u00b5k until it intersects with a sphere of radius Rk or\nwith the simplex \u2206V \u22121 (whichever comes \ufb01rst). The intersection point will be our estimate for\nvertices \u03b2k, k = 1, . . . , K of the polytope B. The center C of the sphere is given in step 1 of the\n(cid:107)C \u2212 \u00afwm(cid:107)2, where the maximum is taken over those documents m\nalgorithm, while Rk = max\n1\u2264m\u2264M\nthat are clustered with label k. To see the intuition behind the algorithm, let us consider a simple\n\nAlgorithm 1 Geometric Dirichlet Means (GDM)\nInput: documents w1, . . . , wM , K,\n\nextension scalar parameters m1, . . . , mK\n\n(cid:80)\n\nm \u00afwm {\ufb01nd center of the data}\n\nOutput: topics \u03b21, . . . , \u03b2K\n1: C = 1\nM\n2: \u00b51, . . . , \u00b5K = weighted k-means( \u00afw1, . . . , \u00afwM , K) {\ufb01nd centers of K clusters}.\n3: for all k = 1, . . . , K do\n4:\n5:\n6:\n7:\n8:\nend if\n9:\n10: end for\n11: \u03b21, . . . , \u03b2K.\n\n\u03b2k = C + mk (\u00b5k \u2212 C).\nif any \u03b2ki < 0 then {threshold topic if it is outside vocabulary simplex \u2206V \u22121}\nfor all i = 1, . . . , V do\n.\n\n(cid:80)\n\n\u03b2ki = \u03b2ik1\u03b2ki>0\ni \u03b2ki1\u03b2ki>0\n\nend for\n\nsimulation experiment. We use the LDA data generative model with \u03b1 = 0.1, \u03b7 = 0.1, V = 5,\nK = 4, M = 5000, Nm = 100. Multidimensional scaling is used for visualization (Fig. 1). We\nobserve that the k-means centroids (pink) do not represent the topics very well, but our geometric\nmodi\ufb01cation \ufb01nds extreme points of the tetrahedron: red and yellow spheres overlap, meaning we\nfound the true topics. In this example, we have used a very small vocabulary size, but in practice V is\nmuch higher and the cluster centroids are often on the boundary of the vocabulary simplex, therefore\nwe have to threshold the betas at 0. Extending length until Rk is our default choice for the extension\nparameters:\n\nmk =\n\nRk\n\n(cid:107)C \u2212 \u00b5k(cid:107)2\n\nfor k = 1, . . . , K,\n\n(5)\n\n4\n\n\fFigure 1: Visualization of GDM: Black, green, red and blue are cluster assignments; purple is the\ncenter, pink are cluster centroids, dark red are estimated topics and yellow are the true topics.\n\nbut we will see in our experiments that a careful tuning of the extension parameters based on\noptimizing the geometric objective (4) over a small range of mk helps to improve the performance\nconsiderably. We call this tGDM algorithm (tuning details are presented in the Supplement). The\nconnection between extension parameters and the thresholding is the following: if the cluster centroid\nassigns probability to a word smaller than the whole data does on average, this word will be excluded\nfrom topic k with large enough mk. Therefore, the extension parameters can as well be used to\ncontrol for the sparsity of the inferred topics.\n\n3.3 Consistency of Geometric Dirichlet Means\n\nWe shall present a theorem which provides a theoretical justi\ufb01cation for the Geometric Dirichlet\nMeans algorithm. In particular, we will show that the algorithm can achieve consistent estimates\nof the topic polytope, under conditions on the parameters of the Dirichlet distribution of the topic\nproportion vector \u03b8m, along with conditions on the geometry of the convex polytope B. The problem\nof estimating vertices of a convex polytope given data drawn from the interior of the polytope has\nlong been a subject of convex geometry \u2014 the usual setting in this literature is to assume the uniform\ndistribution for the data sample. Our setting is somewhat more general \u2014 the distribution of the points\niid\u223c DirK(\u03b1).\ninside the polytope will be driven by a symmetric Dirichlet distribution setting, i.e., \u03b8m\n(If \u03b1 = 1 this results in the uniform distribution on B.) Let n = K \u2212 1. Assume that the document\nmultinomial parameters p1, . . . , pM (given in Eq. (2)) are the actual data. Now we formulate a\ngeometric problem linking the population version of k-means and polytope estimation:\nProblem 1. Given a convex polytope A \u2208 Rn, a continuous probability density function f (x)\nsupported by A, \ufb01nd a K-partition A =\n\nAk that minimizes:\n\nK(cid:70)\n(cid:90)\n\nk=1\n\nK(cid:88)\n\nwhere \u00b5k is the center of mass of Ak: \u00b5k :=\n\nAk\n\nk\n\n(cid:107)\u00b5k \u2212 x(cid:107)2\n(cid:82)\n\n1\nf (x) dx\n\nAk\n\n(cid:82)\n\n2f (x) dx,\n\nxf (x) dx.\n\nAk\n\nThis problem is closely related to the Centroidal Voronoi Tessellations (Du et al., 1999). This\nconnection can be exploited to show that\nLemma 1. Problem 1 has a unique global minimizer.\n\nIn the following lemma, a median of a simplex is a line segment joining a vertex of a simplex with\nthe centroid of the opposite face.\nLemma 2. If A \u2208 Rn is an equilateral simplex with symmetric Dirichlet density f parameterized by\n\u03b1, then the optimal centers of mass of the Problem 1 lie on the corresponding medians of A.\n\n5\n\n\fBased upon these two lemmas, consistency is established under two distinct asymptotic regimes.\nTheorem 1. Let B = Conv(\u03b21, . . . , \u03b2K) be the true convex polytope from which the M-sample\np1, . . . , pM \u2208 \u2206V \u22121 are drawn via Eq. (2), where \u03b8m\n\niid\u223c DirK(\u03b1) for m = 1, . . . , M.\n\n(a) If B is also an equilateral simplex, then topic estimates obtained by the GDM algorithm\nusing the extension parameters given in Eq. (5) converge to the vertices of B in probability,\nas \u03b1 is \ufb01xed and M \u2192 \u221e.\n\n(b) If M is \ufb01xed, while \u03b1 \u2192 0 then the topic estimates obtained by the GDM also converge to\n\nthe vertices of B in probability.\n\n3.4 nGDM: nonparametric geometric inference of topics\n\nIn practice, the number of topics K may be unknown, necessitating a nonparametric probabilistic\napproach such as the well-known Hierarchical Dirichlet Process (HDP) (Teh et al., 2006). Our\ngeometric approach can be easily extended to this situation. The objective (4) is now given by\n\nM(cid:88)\n\nm=1\n\nmin\n\nB\n\nG(B) = min\n\nB\n\nNm min\nx\u2208B\n\n(cid:107)x \u2212 \u00afwm(cid:107)2\n\n2 + \u03bb|B|,\n\n(6)\n\nwhere |B| denotes the number of extreme points of convex polytope B = Conv(\u03b21, . . . , \u03b2K).\nAccordingly, our nGDM algorithm now consists of two steps: (i) solve a penalized and weighted\nk-means clustering to obtain the cluster centroids (e.g. using DP-means (Kulis & Jordan, 2012));\n(ii) apply geometric correction for recovering the extreme points, which proceeds as before. Our\ntheoretical analysis can be also extended to this nonparametric framework. We note that the penalty\nterm is reminiscent of the DP-means algorithm of Kulis & Jordan (2012), which was derived under a\nsmall-variance asymptotics regime. For the HDP this corresponds to \u03b1 \u2192 0 \u2014 the regime in part\n(b) of Theorem 1. This is an unrealistic assumption in practice. Our geometric correction arguably\nenables the accounting of the non-vanishing variance in data. We perform a simulation experiment\nfor varying values of \u03b1 and show that nGDM outperforms the KL version of DP-means (Jiang et al.,\n2012) in terms of perplexity. This result is reported in the Supplement.\n\n4 Performance evaluation\n\nSimulation experiments We use the LDA model to simulate data and focus our attention on the\nperplexity of held-out data and minimum-matching Euclidean distance between the true and estimated\ntopics (Tang et al., 2014). We explore settings with varying document lengths (Nm increasing from\n10 to 1400 - Fig. 2(a) and Fig. 3(a)), different number of documents (M increasing from 100 to 7000\n- Fig. 2(b) and Fig. 3(b)) and when lengths of documents are small, while number of documents\nis large (Nm = 50, M ranging from 1000 to 15000 - Fig. 2(c) and Fig. 3(c)). This last setting is\nof particular interest, since it is the most challenging for our algorithm, which in theory works well\ngiven long documents, but this is not always the case in practice. We compare two versions of the\nGeometric Dirichlet Means algorithm: with tuned extension parameters (tGDM) and the default one\n(GDM) (cf. Eq. 5) against the variational EM (VEM) algorithm (Blei et al., 2003) (with tuned\nhyperparameters), collapsed Gibbs sampling (Grif\ufb01ths & Steyvers, 2004) (with true data generating\nhyperparameters), and RecoverKL (Arora et al., 2012) and verify the theoretical upper bounds for\ntopic polytope estimation (i.e. either (log M/M )0.5 or (log Nm/Nm)0.5) - cf. Tang et al. (2014)\nand Nguyen (2015). We are also interested in estimating each document\u2019s topic proportion via\nthe projection technique. RecoverKL produced only a topic matrix, which is combined with our\nprojection based estimates to compute the perplexity (Fig. 3). Unless otherwise speci\ufb01ed, we set\n\u03b7 = 0.1, \u03b1 = 0.1, V = 1200, M = 1000, K = 5; Nm = 1000 for each m; the number of held-out\ndocuments is 100; results are averaged over 5 repetitions. Since \ufb01nding exact solution to the k-means\nobjective is NP hard, we use the algorithm of Hartigan & Wong (1979) with 10 restarts and the\nk-means++ initialization. Our results show that (i) Gibbs sampling and tGDM have the best and\nalmost identical performance in terms of statistical estimation; (ii) RecoverKL and GDM are the\nfastest while sharing comparable statistical accuracy; (iii) VEM is the worst in most scenarios due\nto its instability (i.e. often producing poor topic estimates); (iv) short document lengths (Fig. 2(c)\nand Fig. 3(c)) do not degrade performance of GDM, (this appears to be an effect of the law of large\n\n6\n\n\fnumbers, as the algorithm relies on the cluster means, which are obtained by averaging over a large\nnumber of documents); (v) our procedure for estimating document topic proportions results in a\ngood quality perplexity of the RecoverKL algorithm in all scenarios (Fig. 3) and could be potentially\nutilized by other algorithms. Additional simulation experiments are presented in the Supplement,\nwhich considers settings with varying Nm, \u03b1 and the nonparametric extension.\n\nFigure 2: Minimum-matching Euclidean distance: increasing Nm, M = 1000 (a); increasing M,\nNm = 1000 (b); increasing M, Nm = 50 (c); increasing \u03b7, Nm = 50, M = 5000 (d).\n\nFigure 3: Perplexity of the held-out data: increasing Nm, M = 1000 (a); increasing M, Nm = 1000\n(b); increasing M, Nm = 50 (c); increasing \u03b7, Nm = 50, M = 5000 (d).\n\nComparison to RecoverKL Both tGDM and RecoverKL exploit the geometry of the model, but\nthey rely on very different assumptions: RecoverKL requires the presence of anchor words in the\ntopics and exploits this in a crucial way (Arora et al., 2012); our method relies on long documents in\ntheory, even though the violation of this does not appear to degrade its performance in practice, as we\nhave shown earlier. The comparisons are performed by varying the document length Nm, and varying\nthe Dirichlet parameter \u03b7 (recall that \u03b2k|\u03b7 \u223c DirV (\u03b7)). In terms of perplexity, RecoverKL, GDM\nand tGDM perform similarly (see Fig.4(c,d)), with a slight edge to tGDM. Pronounced differences\ncome in the quality of topic\u2019s word distribution estimates. To give RecoverKL the advantage, we\nconsidered manually inserting anchor words for each topic generated, while keeping the document\nlength short, Nm = 50 (Fig. 4(a,c)). We found that tGDM outperforms RecoverKL when \u03b7 \u2264 0.3,\nan arguably more common setting, while RecoverKL is more accurate when \u03b7 \u2265 0.5. However, if the\npresence of anchor words is not explicitly enforced, tGDM always outperforms RecoverKL in terms\nof topic distribution estimation accuracy for all \u03b7 (Fig. 2(d)). The superiority of tGDM persists even\nas Nm varies from 50 to 10000 (Fig. 4(b)), while GDM is comparable to RecoverKL in this setting.\n\nNIPS corpora analysis We proceed with the analysis of the NIPS corpus.1 After preprocessing,\nthere are 1738 documents and 4188 unique words. Length of documents ranges from 39 to 1403 with\nmean of 272. We consider K = 5, 10, 15, 20, \u03b1 = 5\nK , \u03b7 = 0.1. For each value of K we set aside\n300 documents chosen at random to compute the perplexity and average results over 3 repetitions.\nOur results are compared against Gibbs sampling, Variational EM and RecoverKL (Table 1). For\nK = 10, GDM with 1500 k-means iterations and 5 restarts in R took 50sec; Gibbs sampling with\n5000 iterations took 10.5min; VEM with 750 variational, 1500 EM iterations and 3 restarts took\n25.2min; RecoverKL coded in Python took 1.1min. We note that with recent developments (e.g.,\n\n1https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n\n7\n\nllllllllllllllll0.0000.0250.0500.07505001000Document length NmMM distancellGDMtGDMGibbs samplingVEM0.1(log(Nm)Nm)0.5RecoverKLllllllllllllllll0.0000.0050.0100.0150.0200200040006000Number of documents M with Nm=1000llGDMtGDMGibbs samplingVEM0.1(log(M)M)0.5RecoverKLllllllllllllllllllllllllllllll0.0000.0250.0500.0754000800012000Number of documents M with Nm=50llGDMtGDMGibbs samplingVEM0.1(log(M)M)0.5RecoverKL0.010.020.030.00.30.60.9hGDMtGDMRecoverKLllllllll25027530032535037505001000Document length NmPerplexitylGDMtGDMGibbs samplingVEMRecoverKLllllllll2602702802903000200040006000Number of documents M with Nm=1000lGDMtGDMGibbs samplingVEMRecoverKLlllllllllllllll3004005006004000800012000Number of documents M with Nm=50lGDMtGDMGibbs samplingVEMRecoverKL02505007500.00.30.60.9hGDMtGDMRecoverKL\fFigure 4: MM distance and Perplexity for varying \u03b7, Nm = 50 with anchors (a,c); varying Nm (b,d).\n\n(Hoffman et al., 2013)) VEM could be made faster, but its statistical accuracy remains poor. Although\nRecoverKL is as fast as GDM, its perplexity performance is poor and is getting worse with more\ntopics, which we believe could be due to lack of anchor words in the data. We present topics found\nby Gibbs sampling, GDM and RecoverKL for K = 10 in the Supplement.\n\nTable 1: Perplexities of the 4 topic modeling algorithms trained on the NIPS dataset.\n\nGDM RecoverKL VEM Gibbs sampling\n1269\n1061\n957\n763\n\n1980\n1953\n1545\n1352\n\n1378\n1235\n1409\n1586\n\n1168\n924\n802\n704\n\nK = 5\nK = 10\nK = 15\nK = 20\n\n5 Discussion\n\nWe wish to highlight a conceptual aspect of GDM distinguishing it from moment-based methods\nsuch as RecoverKL. GDM operates on the document-to-document distance/similarity matrix, as\nopposed to the second-order word-to-word matrix. So, from an optimization viewpoint, our method\ncan be viewed as the dual to RecoverKL method, which must require anchor-word assumption to\nbe computationally feasible and theoretically justi\ufb01able. While the computational complexity of\nRecoverKL grows with the vocabulary size and not the corpora size, our convex geometric approach\ncontinues to be computationally feasible when number of documents is large: since only documents\nnear the polytope boundary are relevant in the inference of the extreme points, we can discard most\ndocuments residing near the polytope\u2019s center.\nWe discuss some potential improvements and extensions next. The tGDM algorithm showed a superior\nperformance when the extension parameters are optimized. This procedure, while computationally\neffective relative to methods such as Gibbs sampler, may still be not scalable to massive datasets. It\nseems possible to reformulate the geometric objective as a function of extension parameters, whose\noptimization can be performed more ef\ufb01ciently. In terms of theory, we would like to establish the\nerror bounds by exploiting the connection of topic inference to the geometric problem of Centroidal\nVoronoi Tessellation of a convex polytope.\nThe geometric approach to topic modeling and inference may lend itself naturally to other LDA\nextensions, as we have demonstrated with nGDM algorithm for the HDP (Teh et al., 2006). Correlated\ntopic models of Blei & Lafferty (2006a) also \ufb01t naturally into the geometric framework \u2014 we would\nneed to adjust geometric modi\ufb01cation to capture logistic normal distribution of topic proportions\ninside the topic polytope. Another interesting direction is to consider dynamic (Blei & Lafferty,\n2006b) (extreme points of topic polytope evolving over time) and supervised (McAuliffe & Blei,\n2008) settings. Such settings appear relatively more challenging, but they are worth pursuing further.\n\nAcknowledgments\n\nThis research is supported in part by grants NSF CAREER DMS-1351362 and NSF CNS-1409303.\n\n8\n\n0.010.020.030.00.30.60.9hMM distanceGDMtGDMRecoverKL0.0050.010025005000750010000Document length NmGDMtGDMRecoverKL02505007500.00.30.60.9hPerplexityGDMtGDMRecoverKL260265270275280025005000750010000Document length NmGDMtGDMRecoverKL\fReferences\nAnandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. A spectral algorithm for Latent Dirichlet\n\nAllocation. Advances in Neural Information Processing Systems, 2012.\n\nArora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for\n\ntopic modeling with provable guarantees. arXiv preprint arXiv:1212.4777, 2012.\n\nBlei, D. M. and Lafferty, J. D. Correlated topic models. Advances in Neural Information Processing Systems,\n\n2006a.\n\nBlei, D. M. and Lafferty, J. D. Dynamic topic models. In Proceedings of the 23rd international conference on\n\nMachine learning, pp. 113\u2013120. ACM, 2006b.\n\nBlei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993\u20131022, March\n\n2003.\n\nDeerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic\n\nanalysis. Journal of the American Society for Information Science, 41(6):391, Sep 01 1990.\n\nDing, C., Li, T., and Peng, W. Nonnegative matrix factorization and probabilistic latent semantic indexing:\nEquivalence chi-square statistic, and a hybrid method. In Proceedings of the National Conference on Arti\ufb01cial\nIntelligence, volume 21, pp. 342. AAAI Press; MIT Press, 2006.\n\nDu, Q., Faber, V., and Gunzburger, M. Centroidal Voronoi Tessellations: applications and algorithms. SIAM\n\nReview, 41(4):637\u2013676, 1999.\n\nGolubitsky, O., Mazalov, V., and Watt, S. M. An algorithm to compute the distance from a point to a simplex.\n\nACM Commun. Comput. Algebra, 46:57\u201357, 2012.\n\nGrif\ufb01ths, T. L. and Steyvers, M. Finding scienti\ufb01c topics. PNAS, 101(suppl. 1):5228\u20135235, 2004.\n\nHartigan, J. A. and Wong, M. A. Algorithm as 136: A K-means clustering algorithm. Journal of the Royal\n\nStatistical Society. Series C (Applied Statistics), 28(1):100\u2013108, 1979.\n\nHoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. J. Mach. Learn. Res., 14\n\n(1):1303\u20131347, May 2013.\n\nHofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM\nSIGIR Conference on Research and Development in Information Retrieval, SIGIR \u201999, pp. 50\u201357. ACM,\n1999.\n\nJiang, K., Kulis, B., and Jordan, M. I. Small-variance asymptotics for exponential family Dirichlet process\n\nmixture models. In Advances in Neural Information Processing Systems, pp. 3158\u20133166, 2012.\n\nKulis, B. and Jordan, M. I. Revisiting k-means: new algorithms via Bayesian nonparametrics. In Proceedings of\n\nthe 29th International Conference on Machine Learning. ACM, 2012.\n\nMcAuliffe, J. D. and Blei, D. M. Supervised topic models. In Advances in Neural Information Processing\n\nSystems, pp. 121\u2013128, 2008.\n\nNguyen, X. Posterior contraction of the population polytope in \ufb01nite admixture models. Bernoulli, 21(1):\n\n618\u2013646, 02 2015.\n\nPritchard, J. K., Stephens, M., and Donnelly, P. Inference of population structure using multilocus genotype data.\n\nGenetics, 155(2):945\u2013959, 2000.\n\nTang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. Understanding the limiting factors of topic modeling\nvia posterior contraction analysis. In Proceedings of the 31st International Conference on Machine Learning,\npp. 190\u2013198. ACM, 2014.\n\nTeh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American\n\nStatistical Association, 101(476), 2006.\n\nXu, W., Liu, X., and Gong, Y. Document clustering based on non-negative matrix factorization. In Proceedings\nof the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion\nRetrieval, SIGIR \u201903, pp. 267\u2013273. ACM, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1308, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "University of Michigan"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}]}