{"title": "Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1786, "page_last": 1794, "abstract": "In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words -- i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard to obtain, however, and the identification of topics under those models hinges on uncorrelatedness of the topics, which can be unrealistic. This paper revisits topic modeling based on second-order moments, and proposes an anchor-free topic mining framework. The proposed approach guarantees the identification of the topics under a much milder condition compared to the anchor-word assumption, thereby exhibiting much better robustness in practice. The associated algorithm only involves one eigen-decomposition and a few small linear programs. This makes it easy to implement and scale up to very large problem instances. Experiments using the TDT2 and Reuters-21578 corpus demonstrate that the proposed anchor-free approach exhibits very favorable performance (measured using coherence, similarity count, and clustering accuracy metrics) compared to the prior art.", "full_text": "Anchor-Free Correlated Topic Modeling:\n\nIdenti\ufb01ability and Algorithm\n\nKejun Huang\u2217\n\nXiao Fu\u2217\n\nDepartment of Electrical and Computer Engineering\n\nNicholas D. Sidiropoulos\n\nUniversity of Minnesota\n\nMinneapolis, MN 55455, USA\n\nhuang663@umn.edu\n\nxfu@umn.edu\n\nnikos@ece.umn.edu\n\nAbstract\n\nIn topic modeling, many algorithms that guarantee identi\ufb01ability of the topics have\nbeen developed under the premise that there exist anchor words \u2013 i.e., words that\nonly appear (with positive probability) in one topic. Follow-up work has resorted\nto three or higher-order statistics of the data corpus to relax the anchor word\nassumption. Reliable estimates of higher-order statistics are hard to obtain, however,\nand the identi\ufb01cation of topics under those models hinges on uncorrelatedness of\nthe topics, which can be unrealistic. This paper revisits topic modeling based on\nsecond-order moments, and proposes an anchor-free topic mining framework. The\nproposed approach guarantees the identi\ufb01cation of the topics under a much milder\ncondition compared to the anchor-word assumption, thereby exhibiting much\nbetter robustness in practice. The associated algorithm only involves one eigen-\ndecomposition and a few small linear programs. This makes it easy to implement\nand scale up to very large problem instances. Experiments using the TDT2 and\nReuters-21578 corpus demonstrate that the proposed anchor-free approach exhibits\nvery favorable performance (measured using coherence, similarity count, and\nclustering accuracy metrics) compared to the prior art.\n\nIntroduction\n\n1\nGiven a large collection of text data, e.g., documents, tweets, or Facebook posts, a natural question is\nwhat are the prominent topics in these data. Mining topics from a text corpus is motivated by a number\nof applications, from commercial design, news recommendation, document classi\ufb01cation, content\nsummarization, and information retrieval, to national security. Topic mining, or topic modeling, has\nattracted signi\ufb01cant attention in the broader machine learning and data mining community [1].\nIn 2003, Blei et al. proposed a Latent Dirichlet Allocation (LDA) model for topic mining [2], where\nthe topics are modeled as probability mass functions (PMFs) over a vocabulary and each document\nis a mixture of the PMFs. Therefore, a word-document text data corpus can be viewed as a matrix\nfactorization model. Under this model, posterior inference-based methods and approximations were\nproposed [2, 3], but identi\ufb01ability issues \u2013 i.e., whether the matrix factors are unique \u2013 were not\nconsidered. Identi\ufb01ability, however, is essential for topic modeling since it prevents the mixing of\ntopics that confounds interpretation.\nIn recent years, considerable effort has been invested in designing identi\ufb01able models and estimation\ncriteria as well as polynomial time solvable algorithms for topic modeling [4, 5, 6, 7, 8, 9, 10, 11].\nEssentially, these algorithms are based on the so-called separable nonnegative matrix factorization\n(NMF) model [12]. The key assumption is that every topic has an \u2018anchor word\u2019 that only appears\nin that particular topic. Based on this assumption, two classes of algorithms are usually employed,\nnamely linear programming based methods [5, 7] and greedy pursuit approaches [11, 6, 8, 10]. The\n\n\u2217These authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fformer class has a serious complexity issue, as it lifts the number of variables to the square of the size\nof vocabulary (or documents); the latter, although computationally very ef\ufb01cient, usually suffers from\nerror propagation, if at some point one anchor word is incorrectly identi\ufb01ed. Furthermore, since all\nthe anchor word-based approaches essentially convert topic identi\ufb01cation to the problem of seeking\nthe vertices of a simplex, most of the above algorithms require normalizing each data column (or row)\nby its (cid:96)1 norm. However, normalization at the factorization stage is usually not desired, since it may\ndestroy the good conditioning of the data matrix brought by pre-processing and amplify noise [8].\nUnlike many NMF-based methods that work directly with the word-document data, the approach\nproposed by Arora et al. [9, 10] works with the pairwise word-word correlation matrix, which has\nthe advantage of suppressing sampling noise and also features better scalability. However, [9, 10]\ndid not relax the anchor-word assumption or the need for normalization, and did not explore the\nsymmetric structure of the co-occurrence matrix \u2013 i.e., the algorithms in [9, 10] are essentially the\nsame asymmetric separable NMF algorithms as in [4, 6, 8].\nThe anchor-word assumption is reasonable in some cases, but using models without it is more\nappealing in more critical scenarios, e.g., when some topics are closely related and many key words\noverlap. Identi\ufb01able models without anchor words have been considered in the literature; e.g.,\n[13, 14, 15] make use of third or higher-order statistics of the data corpus to formulate the topic\nmodeling problem as a tensor factorization problem. There are two major drawbacks with this\napproach: i) third- or higher-order statistics require a lot more samples for reliable estimation relative\nto their lower-order counterparts (e.g., second-order word correlation statistics); and ii) identi\ufb01ability\nis guaranteed only when the topics are uncorrelated \u2013 where a super-symmetric parallel factor analysis\n(PARAFAC) model can be obtained [13, 14]. Uncorrelatedness is a restrictive assumption [10]. When\nthe topics are correlated, the model becomes a Tucker model which is not identi\ufb01able in general;\nidenti\ufb01ability needs more assumptions, e.g., sparsity of topic PMFs [15].\nContributions.\nIn this work, our interest lies in topic mining using word-word correlation matrices\nlike in [9, 10], because of its potential scalability and noise robustness. We propose an anchor-free\nidenti\ufb01able model and a practically implementable companion algorithm. Our contributions are two-\nfold: First, we propose an anchor-free topic identi\ufb01cation criterion. The criterion aims at factoring\nthe word-word correlation matrix using a word-topic PMF matrix and a topic-topic correlation matrix\nvia minimizing the determinant of the topic-topic correlation matrix. We show that under a so-called\nsuf\ufb01ciently scattered condition, which is much milder than the anchor-word assumption, the two\nmatrices can be uniquely identi\ufb01ed by the proposed criterion. We emphasize that the proposed\napproach does not need to resort to higher-order statistics tensors to ensure topic identi\ufb01ability, and\nit can naturally deal with correlated topics, unlike what was previously available in topic modeling,\nto the best of our knowledge. Second, we propose a simple procedure for handling the proposed\ncriterion that only involves eigen-decomposition of a large but sparse matrix, plus a few small linear\nprograms \u2013 therefore highly scalable and well-suited for topic mining. Unlike greedy pursuit-based\nalgorithms, the proposed algorithm does not involve de\ufb02ation and is thus free from error propagation;\nit also does not require normalization of the data columns / rows. Carefully designed experiments\nusing the TDT2 and Reuters text corpora showcase the effectiveness of the proposed approach.\n2 Background\nConsider a document corpus D \u2208 RV \u00d7D, where each column of D corresponds to a document and\nD(v, d) denotes a certain measurement of word v in document d, e.g., the word-frequency of term v\nin document d or the term frequency\u2013inverse document frequency (tf-idf) measurement that is often\nused in topic mining. A commonly used model is\n(1)\nwhere C \u2208 RV \u00d7F is the word-topic matrix, whose f-th column C(:, f ) represents the probability\nmass function (PMF) of topic f over a vocabulary of words, and W (f, d) denotes the weight of topic\nf in document d [2, 13, 10]. Since matrix C and W are both nonnegative, (1) becomes a nonnegative\nmatrix factorization (NMF) model \u2013 and many early works tried to use NMF and variants to deal with\nthis problem [16]. However, NMF does not admit a unique solution in general, unless both C and W\nsatisfy some sparsity-related conditions [17]. In recent years, much effort has been put in devising\npolynomial time solvable algorithms for NMF models that admit unique factorization. Such models\nand algorithms usually rely on an assumption called \u201cseparability\u201d in the NMF literature [12]:\nAssumption 1 (Separability / Anchor-Word Assumption) There exists a set of indices \u039b =\n{v1, . . . , vF} such that C(\u039b, :) = Diag(c), where c \u2208 RF .\n\nD \u2248 CW ,\n\n2\n\n\fIn topic modeling, it turns out that the separability condition has a nice physical interpretation, i.e.,\nevery topic f for f = 1, . . . , F has a \u2018special\u2019 word that has nonzero probability of appearing in topic\nf and zero probability of appearing in other topics. These words are called \u2018anchor words\u2019 in the\ntopic modeling literature. Under Assumption 1, the task of matrix factorization boils down to \ufb01nding\nthese anchor words v1, . . . , vF since D(\u039b, :) = Diag(c)W \u2014 which is already a scaled version of\nW \u2014 and then C can be estimated via (constrained) least squares.\n\n: D; F .\n\n\u02c6vf \u2190 arg maxv\u2208{1,...,V } (cid:107)X(:, v)(cid:107)2;\n\u039b \u2190 [\u039b, \u02c6vf ];\n\u0398 \u2190 arg min\u0398 (cid:107)X \u2212 X(:, \u039b)\u0398(cid:107)2\nF ;\nX \u2190 X \u2212 X(:, \u039b)\u0398;\n\nend\noutput : \u039b\n\nAlgorithm 1:\nSuccessive Projection Algorithm [6]\ninput\n\u03a3 = 1T DT\nX = DT \u03a3\u22121 (normalization);\n\u039b = \u2205;\nfor f = 1, . . . , F do\n\nMany algorithms have been proposed to tackle this index-\npicking problem in the context of separable NMF, hyper-\nspectral unmixing, and text mining. The arguably simplest\nalgorithm is the so-called successive projection algorithm\n(SPA) [6] that is presented in Algorithm 1. SPA-like algo-\nrithms \ufb01rst de\ufb01ne a normalized matrix X = DT \u03a3\u22121 where\n\u03a3 = Diag(1T DT ) [11]. Note that X = GS where G(:\n, f ) = W T(f,:)/(cid:107)W (f,:)(cid:107)1 and S(f, v) = C(v,f )(cid:107)W (f,:)(cid:107)1\n.\n(cid:107)C(v,:)(cid:107)1(cid:107)D(v,:)(cid:107)1\nConsequently, we have 1T S = 1T if W \u2265 0, meaning the\ncolumns of X all lie on the simplex spanned by the columns\nof G, and the vertices of the simplex correspond to the anchor\nwords. Also, the columns of S all live in the unit simplex.\nAfter normalization, SPA sequentially identi\ufb01es the vertices of the data simplex, in conjunction with\na de\ufb02ation procedure. The algorithms in [8, 10, 11] can also be considered variants of SPA, with\ndifferent de\ufb02ation procedures and pre-/post-processing. In particular, the algorithm in [8] avoids\nnormalization \u2014 for real-word data, normalization at the factorization stage may amplify noise\nand damage the good conditioning of the data matrix brought by pre-processing, e.g., the tf-idf\nprocedure [8]. To pick out vertices, there are also algorithms using linear programming and sparse\noptimization [7, 5], but these have serious scalability issues and thus are less appealing.\nIn practice D may contain considerable noise, and this has been noted in the literature. In [9, 10,\n14, 15], the authors proposed to use second and higher-order statistics for topic mining. Particularly,\nArora et al. [9, 10] proposed to work with the following matrix:\nP = E{DDT} = CEC T ,\n\n(2)\nwhere E = E{W W T} can be interpreted as a topic-topic correlation matrix. The matrix P is by\nde\ufb01nition a word-word correlation matrix, but also has a nice interpretation: if D(v, d) denotes the\nfrequency of word v occurring in document d, P (i, j) is the likelihood that term i and j co-occur\nin a document [9, 10]. There are two advantages in using P : i) if there is zero-mean white noise, it\nwill be signi\ufb01cantly suppressed through the averaging process; and ii) the size of P does not grow\nwith the size of the data if the vocabulary is \ufb01xed. The latter is a desired property when the number\nof documents is very large, and we pick a (possibly limited but) manageable vocabulary to work\nwith. Problems with similar structure to that of P also arise in the context of graph models, where\ncommunities and correlations appear as the underlying factors. The algorithm proposed in [10] also\nmakes use of Assumption 1 and is conceptually close to Algorithm 1. The work in [13, 14, 15]\nrelaxed the anchor-word assumption. The methods there make use of three or higher-order statistics,\ne.g., P \u2208 RV \u00d7V \u00d7V whose (i, j, k)th entry represents the co-occurrence of three terms. The work in\n[13, 14] showed that P is a tensor satisfying the parallel factor analysis (PARAFAC) model and thus\nC is uniquely identi\ufb01able, if the topics are uncorrelated, which is a restrictive assumption (a counter\nexample would be politics and economy). When the topics are correlated, additional assumptions like\nsparsity are needed to restore identi\ufb01ability [15]. Another important concern is that reliable estimates\nof higher-order statistics require much larger data sizes, and tensor decomposition is computationally\ncumbersome as well.\nRemark 1 Among all the aforementioned methods, the de\ufb02ation-based methods are seemingly more\nef\ufb01cient. However, if the de\ufb02ation procedure in Algorithm 1 (the update of \u0398) has constraints like\nin [8, 11], there is a serious complexity issue: solving a constrained least squares problem with\nF V variables is not an easy task. Data sparsity is destroyed after the \ufb01rst de\ufb02ation step, and thus\neven \ufb01rst-order methods or coordinate descent as in [8, 11] do not really help. This point will be\nexempli\ufb01ed in our experiments.\n3 Anchor-Free Identi\ufb01able Topic Mining\nIn this work, we are primarily interested in mining topics from the matrix P because of its noise\nrobustness and scalability. We will formulate topic modeling as an optimization problem, and show\n\n3\n\n\fthat the word-topic matrix C can be identi\ufb01ed under a much more relaxed condition, which includes\nthe relatively strict anchor-word assumption as a special case.\n3.1 Problem Formulation\nLet us begin with the model P = CEC T , subject to the constraint that each column of C represents\nthe PMF of words appearing in a speci\ufb01c topic, such that CT 1 = 1, C \u2265 0. Such a symmetric\nmatrix decomposition is in general not identi\ufb01able, as we can always pick a non-singular matrix\nA \u2208 RF\u00d7F such that AT 1 = 1, A \u2265 0, and de\ufb01ne \u02dcC = CA, \u02dcE = A\u22121CA\u22121, and then\n1 = 1, \u02dcC \u2265 0. We wish to \ufb01nd an identi\ufb01cation criterion such that under\nP = \u02dcC \u02dcE \u02dcC\nsome mild conditions the corresponding solution can only be the ground-truth E and C up to some\ntrivial ambiguities such as a common column permutation. To this end, we propose the following\ncriterion:\n\nT with \u02dcC\n\nT\n\n| det E|,\n\nsubject to P = CEC T , CT 1 = 1, C \u2265 0.\n\n(3)\n\nminimize\n\nE\u2208RF\u00d7F ,C\u2208RV \u00d7F\n\nThe \ufb01rst observation is that if the anchor-word assumption is satis\ufb01ed, the optimal solutions of the\nabove identi\ufb01cation criterion are the ground-truth C and E and their column-permuted versions.\nFormally, we show that:\nProposition 1 Let (C(cid:63), E(cid:63)) be an optimal solution of (3). If the separability / anchor-word assump-\ntion (cf. Assumption 1) is satis\ufb01ed and rank(P ) = F , then C(cid:63) = C\u03a0 and E(cid:63) = \u03a0 T E\u03a0, where\n\u03a0 is a permutation matrix.\n\nThe proof of Proposition 1 can be found in the supplementary material. Proposition 1 is merely a\n\u2018sanity check\u2019 of the identi\ufb01cation criterion in (3): It shows that the criterion is at least a sound one\nunder the anchor-word assumption. Note that, when the anchor-word assumption is satis\ufb01ed, SPA-\ntype algorithms are in fact preferable over the identi\ufb01cation criterion in (3), due to their simplicity.\nThe point of the non-convex formulation in (3) is that it can guarantee identi\ufb01ability of C and E\neven when the anchor-word assumption is grossly violated. To explain, we will need the following.\nAssumption 2 (suf\ufb01ciently scattered) Let cone(CT )\u2217 denote the polyhedral cone {x : Cx \u2265 0},\nand K denote the second-order cone {x : (cid:107)x(cid:107)2 \u2264 1T x}. Matrix C is called suf\ufb01ciently scattered\nif it satis\ufb01es that: (i) cone(CT )\u2217 \u2286 K, and (ii) cone(CT )\u2217 \u2229 bdK = {\u03bbef : \u03bb \u2265 0, f = 1, . . . , F},\nwhere bdK denotes the boundary of K, i.e., bdK = {x : (cid:107)x(cid:107)2 = 1T x}.\nOur main result is based on this assumption, whose \ufb01rst consequence is as follows:\nLemma 1 If C \u2208 RV \u00d7F is suf\ufb01ciently scattered, then rank(C) = F . In addition, given rank(P ) =\nF , any feasible solution \u02dcE \u2208 RF\u00d7F of Problem (3) has full rank and thus | det \u02dcE| > 0.\nLemma 1 ensures that any feasible solution pair ( \u02dcC, \u02dcE) of Problem (3) has full rank F when the\nground-truth C is suf\ufb01ciently scattered, which is important from the optimization perspective \u2013\notherwise | det \u02dcE| can always be zero which is a trivial optimal solution of (3). Based on Lemma 1,\nwe further show that:\nTheorem 1 Let (C(cid:63), E(cid:63)) be an optimal solution of (3). If the ground truth C is suf\ufb01ciently scattered\n(cf. Assumption 2) and rank(P ) = F , then C(cid:63) = C\u03a0 and E(cid:63) = \u03a0 T E\u03a0, where \u03a0 is a\npermutation matrix.\n\nThe proof of Theorem 1 is relegated to the supplementary material. In words, for a suf\ufb01ciently\nscattered C and an arbitrary square matrix E, given P = CEC T , C and E can be identi\ufb01ed up to\npermutation via solving (3). To understand the suf\ufb01ciently scattered condition and Theorem 2, it is\nbetter to look at the dual cones. The notation cone(CT )\u2217 = {x : Cx \u2265 0} comes from the fact that\nit is the dual cone of the conic hull of the row vectors of C, i.e., cone(CT ) = {CT \u03b8 : \u03b8 \u2265 0}. A\nuseful property of dual cone is that for two convex cones, if K1 \u2286 K2, then K\u2217\n1, which means\nthe \ufb01rst requirement of Assumption 2 is equivalent to\n\n2 \u2286 K\u2217\n\n(4)\nNote that the dual cone of K is another second-order cone [12], i.e., K\u2217 = {x|xT 1 \u2265 \u221a\nF \u2212 1(cid:107)x(cid:107)2},\nwhich is tangent to and contained in the nonnegative orthant. Eq. (4) and the de\ufb01nition of K\u2217 in\n\nK\u2217 \u2286 cone(CT ).\n\n4\n\n\fFigure 1: A graphical view of rows of C (blue dots) and various cones in R3, sliced at the plane\n1T x = 1. The triangle indicates the non-negative orthant, the enclosing circle is K, and the smaller\ncircle is K\u2217. The shaded region is cone(CT ), and the polygon with dashed sides is cone(CT )\u2217. The\nmatrix C can be identi\ufb01ed up to column permutation in the left two cases, and clearly separability is\nmore restrictive than (and a special case of) suf\ufb01ciently scattered.\n\nfact give a straightforward comparison between the proposed suf\ufb01ciently scattered condition and\nthe existing anchor-word assumption. An illustration of Assumptions 1 and 2 is shown in Fig. 1\n(a)-(b) using an F = 3 case, where one can see that suf\ufb01ciently scattered is much more relaxed\ncompared to the anchor-word assumption: if the rows of the word-topic matrix C are geometrically\nscattered enough so that cone(CT ) contains the inner circle (i.e., the second-order cone K\u2217), then\nthe identi\ufb01ability of the criterion in (3) is guaranteed. However, the anchor-word assumption requires\nthat cone(CT ) ful\ufb01lls the entire triangle, i.e., the nonnegative orthant, which is far more restrictive.\nFig. 1(c) shows a case where rows of C are not \u201cwell scattered\u201d in the non-negative orthant, and\nindeed such a matrix C cannot be identi\ufb01ed via solving (3).\n\nRemark 2 A salient feature of the criterion in (3) is that it does not need to normalize the data\ncolumns to a simplex \u2014 all the arguments in Theorem 1 are cone-based. The upshot is clear: there is\nno risk of amplifying noise or changing the conditioning of P at the factorization stage. Furthermore,\nmatrix E can be any symmetric matrix; it can contain negative values, which may cover more\napplications beyond topic modeling where E is always nonnegative and positive semide\ufb01nite. This\nshows the surprising effectiveness of the suf\ufb01ciently scattered condition.\n\nThe suf\ufb01ciently scattered assumption appeared in identi\ufb01ability proofs of several matrix factorization\nmodels [17, 18, 19] with different identi\ufb01cation criteria. Huang et al. [17] used this condition\nto show the identi\ufb01ability of plain NMF, while Fu et al. [19] related the suf\ufb01ciently scattered\ncondition to the so-called volume-minimization criterion for blind source separation. Note that volume\nminimization also minimizes a determinant-related cost function. Like the SPA-type algorithms,\nvolume minimization works with data that live in a simplex, therefore applying it still requires data\nnormalization, which is not desired in practice. Theorem 1 can be considered as a more natural\napplication of the suf\ufb01ciently scattered condition to co-occurrence/correlation based topic modeling,\nwhich explores the symmetry of the model and avoids normalization.\n3.2 AnchorFree: A Simple and Scalable Algorithm\nThe identi\ufb01cation criterion in (3) imposes an interesting yet challenging optimization problem. One\nway to tackle it is to consider the following approximation:\n\nminimize\n\nE,C\n\nF + \u00b5| det E|, subject to C \u2265 0, CT 1 = 1,\n\n(5)\nwhere \u00b5 \u2265 0 balances the data \ufb01delity and the minimal determinant criterion. The dif\ufb01culty is\nthat the term CEC T makes the problem tri-linear and not easily decoupled. Plus, tuning a good\n\u00b5 may also be dif\ufb01cult. In this work, we propose an easier procedure of handling the determinant-\nminimization problem in (3), which is summarized in Algorithm 2, and referred to as AnchorFree.\nTo explain the procedure, \ufb01rst notice that P is symmetric and positive semide\ufb01nite. Therefore, one\ncan apply square root decomposition to P = BBT , where B \u2208 RV \u00d7F . We can take advantage\nof well-established tools for eigen-decomposition of sparse matrices, and there is widely available\nsoftware that can compute this very ef\ufb01ciently. Now, we have B = CE1/2Q,QT Q = QQT = I,\nand E = E1/2E1/2; i.e., the representing coef\ufb01cients of CE1/2 in the range space of B must be\northonormal because of the symmetry of P . We also notice that\n\n(cid:13)(cid:13)P \u2212 CEC T(cid:13)(cid:13)2\n\nminimize\n\nE,C,Q\n\n| det E1/2Q|, subject to B = CE1/2Q, CT 1 = 1, C \u2265 0, QT Q = I,\n\n(6)\n\n5\n\n(a) separable / anchor word(b) sufficiently scattered(c) not identifiable\fhas the same optimal solutions as (3). Since Q is unitary, it does not affect the determinant, so we\nfurther let M = QT E\u22121/2 and obtain the following optimization problem\n\nmaximize\n\nM\n\n| det M|, subject to M T BT 1 = 1, BM \u2265 0.\n\n(7)\n\nwith respect to M (:, f ), i.e., det M = (cid:80)F\n\nBy our reformulation, C has been marginalized and we have only F 2 variables left, which is\nsigni\ufb01cantly smaller compared to the variable size of the original problem V F + F 2, where V is\nthe vocabulary size. Problem (7) is still non-convex, but can be handled very ef\ufb01ciently. Here, we\npropose to employ the solver proposed in [18], where the same subproblem (7) was used to solve\na dynamical system identi\ufb01cation problem. The idea is to apply the co-factor expansion to deal\nwith the determinant objective function, \ufb01rst proposed in the context of non-negative blind source\nseparation [20]: if we \ufb01x all the columns of M except the fth one, det M becomes a linear function\nk=1(\u22121)f +kM (k, f ) det \u00afM k,f = aT M (:, f ), where\na = [a1, . . . , aF ]T , ak = (\u22121)f +k det \u00afM k,f , \u2200 k = 1, ..., F , and \u00afM k,f is a matrix obtained by\nremoving the kth row and fth column of M. Maximizing |aT x| subject to linear constraints is\nstill a non-convex problem, but we can solve it via maximizing both aT x and \u2212aT x, followed by\npicking the solution that gives larger absolute objective. Then, cyclically updating the columns of M\nresults in an alternating optimization (AO) algorithm. The algorithm is computationally lightweight:\neach linear program only involves F variables, leading to a worst-case complexity of O(F 3.5) \ufb02ops\neven when the interior-point method is employed, and empirically it takes 5 or less AO iterations\nto converge. In the supplementary material, simulations on synthetic data are given, showing that\nAlgorithm 2 can indeed recover the ground truth matrix C and E even when matrix C grossly\nviolates the separability / anchor-word assumption.\n\n: D, F .\n\nAlgorithm 2: AnchorFree\ninput\nP \u2190 Co-Occurrence(D);\nP = BBT , M \u2190 I;\nrepeat\n\nfor f = 1, . . . , F do\n\nend\n\nuntil convergence;\nC (cid:63) = BM;\nE(cid:63) = (C T\noutput : C (cid:63), E(cid:63)\n\n(cid:63) C (cid:63))\u22121C T\n\nak = (\u22121)f +k det \u00afM k,f , \u2200 k = 1, ..., F ;\n// remove k-th row and f-th column of M to obtain \u00afM k,f\nmmax = arg maxx aT x s.t. Bx \u2265 0, 1T Bx = 1;\nmmin = arg minx aT x s.t. Bx \u2265 0, 1T Bx = 1;\nM (:, f ) = arg maxmmax,mmin (|aT mmax|,|aT mmin|);\n\n(cid:63) P C (cid:63)(C T\n\n(cid:63) C (cid:63))\u22121;\n\n4 Experiments\nData In this section, we apply the proposed algorithm and the baselines to two popular text mining\ndatasets, namely, the NIST Topic Detection and Tracking (TDT2) and the Reuters-21578 corpora.\nWe use a subset of the TDT2 corpus consisting of 9,394 documents which are single-category\narticles belonging to the largest 30 categories. The Reuters-21578 corpus is the ModApte version\nwhere 8,293 single-category documents are kept. The original vocabulary sizes of the TDT2 and\nthe Reuters dataset are 36, 771 and 18, 933, respectively, and stop words are removed for each trial\nof the experiments. We use the standard tf-idf data as the D matrix, and estimate the correlation\nmatrix using the biased estimator suggested in [9]. A standard pre-processing technique, namely,\nnormalized-cut weighted (NCW) [21], is applied to D; NCW is a well-known trick for handling the\nunbalanced-cluster-size problem. For each trial of our experiment, we randomly draw F categories\nof documents, form the P matrix, and apply the proposed algorithm and the baselines.\nBaselines We employ several popular anchor word-based algorithms as baselines. Speci\ufb01cally,\nthe successive projection algorithm (SPA) [6], the successive nonnegative projection algorithm\n(SNPA) [11], the XRAY algorithm [8], and the fast anchor words (FastAnchor) [10] algorithm.\nSince we are interested in word-word correlation/co-occurrence based mining, all the algorithms are\n\n6\n\n\fTable 1: Experiment results on the TDT2 corpus.\n\nSPA\n\nSNPA XRAY AnchorFree FastAchor SPA SNPA XRAY AnchorFree FastAchor SPA SNPA XRAY AnchorFree\n\nF FastAchor\n-612.72\n3\n-648.20\n4\n-641.79\n5\n-654.18\n6\n-668.92\n7\n8\n-681.35\n-688.54\n9\n-732.39\n10\n-734.13\n15\n-756.90\n20\n25\n-792.92\n\nF FastAchor\n-652.67\n3\n-633.69\n4\n-650.49\n5\n6\n-654.74\n-733.73\n7\n-735.23\n8\n-761.27\n9\n-764.18\n10\n-800.51\n15\n20\n-859.48\n-889.55\n25\n\n-613.43 -613.43 -597.16\n-648.04 -648.04 -657.51\n-643.91 -643.91 -665.20\n-645.68 -645.68 -674.30\n-665.55 -665.55 -664.38\n-674.45 -674.45 -657.78\n-671.81 -671.81 -690.39\n-724.64 -724.64 -698.59\n-730.19 -730.19 -773.17\n-747.99 -747.99 -819.36\n-792.29 -792.29 -876.28\n\n-433.87\n-430.07\n-405.19\n-432.96\n-397.77\n-450.63\n-416.44\n-421.25\n-445.30\n-461.64\n-473.95\n\nCoh\n\nCoh\n\n-647.28 -647.28 -574.72\n-637.89 -637.89 -586.41\n-652.53 -652.53 -581.73\n-644.34 -644.34 -586.00\n-732.01 -732.01 -612.97\n-738.54 -738.54 -616.32\n-755.46 -755.46 -640.36\n-759.40 -759.40 -656.71\n-801.17 -801.17 -585.18\n-860.70 -860.70 -615.62\n-890.16 -890.16 -633.75\n\n-830.24\n-741.35\n-762.64\n-705.60\n-692.12\n-726.37\n-713.81\n-709.48\n-688.39\n-683.64\n-672.44\n\nSimCount\n\n7.98\n\n7.98\n8.94\n11.18 11.18 13.70\n13.36 13.36 22.56\n18.10 18.10 31.56\n18.84 18.84 39.06\n25.14 25.14 40.30\n29.10 29.10 53.68\n29.86 29.86 53.16\n52.62 52.62 59.96\n65.00 65.00 82.92\n66.00 66.00 101.52\n\nSimCount\n\n11.02\n16.92\n21.66\n39.54\n45.24\n83.86\n\n3.86\n11.02\n9.92\n16.92\n13.06\n21.66\n27.42\n39.54\n34.64\n45.24\n83.86\n82.52\n118.98 118.98 119.28\n121.74 121.74 130.82\n309.7\n309.7 227.02\n538.54 538.54 502.82\n650.96\n\n673\n\n673\n\n7.98\n10.60\n13.06\n18.94\n20.14\n24.82\n27.50\n31.08\n51.62\n66.26\n69.46\n\n10.98\n16.74\n21.74\n39.9\n47.02\n85.04\n117.48\n119.54\n307.86\n539.58\n674.78\n\n1.84\n2.88\n4.40\n7.18\n4.48\n9.12\n9.70\n13.02\n41.88\n79.60\n133.42\n\n7.36\n12.66\n15.48\n19.98\n35.62\n62.02\n72.38\n86.02\n124.6\n225.6\n335.24\n\nClustAcc\n\n0.74 0.75\n0.69 0.69\n0.63 0.62\n0.58 0.59\n0.60 0.59\n0.56 0.58\n0.58 0.58\n0.55 0.54\n0.50 0.50\n0.47 0.47\n0.47 0.47\n\n0.69 0.69\n0.61 0.61\n0.55 0.55\n0.49 0.50\n0.57 0.57\n0.53 0.54\n0.56 0.56\n0.52 0.52\n0.40 0.40\n0.36 0.36\n0.33 0.32\n\n0.73\n0.69\n0.64\n0.60\n0.58\n0.57\n0.53\n0.49\n0.42\n0.38\n0.37\n\n0.66\n0.60\n0.52\n0.46\n0.54\n0.47\n0.47\n0.42\n0.42\n0.38\n0.37\n\nClustAc\n\n0.71\n0.70\n0.63\n0.65\n0.62\n0.57\n0.61\n0.59\n0.51\n0.47\n0.46\n\n0.66\n0.51\n0.51\n0.47\n0.43\n0.40\n0.37\n0.35\n0.33\n0.31\n0.26\n\n0.98\n0.94\n0.92\n0.91\n0.90\n0.87\n0.86\n0.85\n0.80\n0.77\n0.74\n\n0.79\n0.73\n0.65\n0.64\n0.65\n0.61\n0.59\n0.59\n0.53\n0.52\n0.47\n\nTable 2: Experiment results on the Reuters-21578 corpus.\n\nSPA\n\nSNPA XRAY AnchorFree FastAchor SPA SNPA XRAY AnchorFree FastAchor SPA SNPA XRAY AnchorFree\n\n(cid:80)\n\ncombined with the framework provided in [10] and the ef\ufb01cient RecoverL2 process is employed for\nestimating the topics after the anchors are identi\ufb01ed.\nEvaluation To evaluate the results, we employ several metrics. First, coherence (Coh) is used\nto measure the single-topic quality. For a set of words V, the coherence is de\ufb01ned as Coh =\nv1,v2\u2208V log (freq(v1,v2)+\u0001/freq(v2)) , where v1 and v2 denote the indices of two words in the vocab-\nulary, freq(v2) and freq(v1, v2) denote the numbers of documents in which v1 appears and v1 and v2\nco-occur, respectively, and \u0001 = 0.01 is used to prevent taking log of zero. Coherence is considered\nwell-aligned to human judgment when evaluating a single topic \u2014 a higher coherence score means\nbetter quality of a mined topic. However, coherence does not evaluate the relationship between\ndifferent mined topics; e.g., if the mined F topics are identical, the coherence score can still be high\nbut meaningless. To alleviate this, we also use the similarity count (SimCount) that was adopted in\n[10] \u2014 for each topic, the similarity count is obtained simply by adding up the overlapped words of\nthe topics within the leading N words, and a smaller SimCount means the mined topics are more\ndistinguishable. When the topics are very correlated (but different), the leading words of the topics\nmay overlap with each other, and thus using SimCount might still not be enough to evaluate the\nresults. We also include clustering accuracy (ClustAcc), obtained by using the mined C(cid:63) matrix\nto estimate the weights W of the documents, and applying k-means to W . Since the ground-truth\nlabels of TDT2 and Reuters are known, clustering accuracy can be calculated, and it serves as a good\nindicator of topic mining results.\nTable 1 shows the experiment results on the TDT2 corpus. From F = 3 to 25, the proposed algorithm\n(AnchorFree) gives very promising results: for the three considered metrics, AnchorFree consistently\ngives better results compared to the baselines. Particularly, the ClustAcc\u2019s obtained by AnchorFree\nare at least 30% higher compared to the baselines for all cases. In addition, the single-topic quality of\nthe topics mined by AnchorFree is the highest in terms of coherence scores; the overlaps between\ntopics are the smallest except for F = 20 and 25.\nTable 2 shows the results on the Reuters-21578 corpus. In this experiment, we can see that XRAY is\nbest in terms of single-topic quality, while AnchorFree is second best when F > 6. For SimCount,\nAnchorFree gives the lowest values when F > 6. In terms of clustering accuracy, the topics obtained\nby AnchorFree again lead to much higher clustering accuracies in all cases.\nIn terms of the runtime performance, one can see from Fig. 2(a) that FastAnchor, SNPA, XRAY and\nAnchorFree perform similarly on the TDT2 dataset. SPA is the fastest algorithm since it has a recursive\nupdate [6]. The SNPA and XRAY both perform nonnegative least squares-based de\ufb02ation, which is\ncomputationally heavy when the vocabulary size is large, as mentioned in Remark 1. AnchorFree\nuses AO and small-scale linear programming, which is conceptually more dif\ufb01cult compared to SNPA\nand XRAY. However, since the linear programs involved only have F variables and the number of AO\niterations is usually small (smaller than 5 in practice), the runtime performance is quite satisfactory\n\n7\n\n\f(a) TDT2\n\n(b) Reuters-21578\n\nFigure 2: Runtime performance of the algorithms under various settings.\n\nTable 3: Twenty leading words of mined topics from an F = 5 case of the TDT2 experiment.\n\npredicts\n\nallegations\nlewinsky\nclinton\n\nlady\nwhite\nhillary\nmonica\n\nstarr\nhouse\n\nhusband\ndissipate\npresident\n\nintern\naffair\n\nin\ufb01delity\n\ngrand\njury\nsexual\njustice\n\nobstruction\n\nslipping\n\npoll\n\ncnnusa\ngallup\n\nallegations\n\nclinton\n\npresidents\n\nrating\n\nlewinsky\npresident\napproval\n\nstarr\nwhite\nmonica\nhouse\nhurting\nslipping\namericans\n\npublic\nsexual\naffair\n\nFastAnchor\n\nanchor\ncleansing\ncolumbia\n\nstrangled\n\ngm\n\nshuttle\nspace\ncrew\n\nmotors\nplants\nworkers\nastronauts michigan\n\nnasa\n\nexperiments\n\n\ufb02int\n\nstrikes\nauto\nplant\nstrike\ngms\nidled\n\nproduction\nwalkouts\n\nnorth\nunion\n\nassembly\n\ntalks\nshut\n\nstriking\n\nmission\nstories\n\n\ufb01x\n\nrepair\nrats\nunit\n\naboard\nbrain\nsystem\nbroken\nnervous\ncleansing\ndioxide\n\ntenday\nbulls\njazz\nnba\nutah\n\ufb01nals\ngame\n\nchicago\njordan\nseries\nmalone\nmichael\n\ntonight\nlakers\nwin\nkarl\n\nlewinsky\n\ngames\n\nbasketball\n\nnight\n\nchampionship\n\nlewinsky\nmonica\n\nstarr\ngrand\nwhite\njury\nhouse\nclinton\ncounsel\nintern\n\nindependent\n\npresident\n\ninvestigation\n\naffair\n\nlewinskys\nrelationship\n\nsexual\n\nken\n\nformer\nstarrs\n\ngm\n\nmotors\nplants\n\ufb02int\n\nworkers\nmichigan\n\nauto\nplant\nstrikes\ngms\nstrike\nunion\nidled\n\nnorth\nshut\ntalks\n\nAnchorFree\n\nanchor\n\nshuttle\nspace\n\ncolumbia\nastronauts\n\nnasa\ncrew\n\nexperiments\n\nrats\n\nmission\nnervous\n\nbrain\naboard\nsystem\n\nearth\nmice\n\nanimals\n\n\ufb01sh\n\nbulls\njazz\nnba\n\nchicago\n\ngame\nutah\n\ufb01nals\njordan\nmalone\nmichael\nseries\n\nchampionship\n\nkarl\n\npippen\n\nbasketball\n\nwin\nnight\nsixth\ngames\n\ntitle\n\njonesboro\narkansas\nschool\nshooting\n\nboys\n\nteacher\nstudents\nwestside\nmiddle\n11year\n\n\ufb01re\ngirls\n\nmitchell\nshootings\nsuspects\nfunerals\nchildren\n\nkilled\n13year\njohnson\n\nassembly weightlessness\nproduction\n\nautoworkers\n\nwalkouts\n\nneurological\n\nseven\n\nand is close to those of SNPA and XRAY which are greedy algorithms. The runtime performance\non the Reuters dataset is shown in Fig. 2(b), where one can see that the de\ufb02ation-based methods are\nfaster. The reason is that the vocabulary size of the Reuters corpus is much smaller compared to that\nof the TDT2 corpus (18,933 v.s. 36,771).\nTable 3 shows the leading words of the mined topics by FastAnchor and AnchorFree from an F = 5\ncase using the TDT2 corpus. We only present the result of FastAnchor since it gives qualitatively\nthe best benchmark \u2013 the complete result given by all baselines can be found in the supplementary\nmaterial. We see that the topics given by AnchorFree show clear diversity: Lewinsky scandal,\nGeneral Motors strike, Space Shuttle Columbia, 1997 NBA \ufb01nals, and a school shooting in Jonesboro,\nArkansas. FastAnchor, on the other hand, exhibit great overlap on the \ufb01rst and the second mined\ntopics. Lewinsky also shows up in the \ufb01fth topic mined by FastAnchor, which is mainly about the\n1997 NBA \ufb01nals. This showcases the clear advantage of our proposed criterion in terms of giving\nmore meaningful and interpretable results, compared to the anchor-word based approaches.\n5 Conclusion\nIn this paper, we considered identi\ufb01able anchor-free correlated topic modeling. A topic estimation\ncriterion based on the word-word co-occurrence/correlation matrix was proposed and its identi\ufb01ability\nconditions were proven. The proposed approach features topic identi\ufb01ability guarantee under much\nmilder conditions compared to the anchor-word assumption, and thus exhibits better robustness to\nmodel mismatch. A simple procedure that only involves one eigen-decomposition and a few small\nlinear programs was proposed to deal with the formulated criterion. Experiments on real text corpus\ndata showcased the effectiveness of the proposed approach.\nAcknowledgment\nThis work is supported in part by the National Science Foundation (NSF) under the project numbers\nNSF-ECCS 1608961 and NSF IIS-1247632 and in part by the Digital Technology Initiative (DTI)\nSeed Grant, University of Minnesota.\n\n8\n\n510152025F100101102103Runtime (sec.)FastAnchorSPASNPAXRAYAnchorFree510152025F100101102103Runtime (sec.)FastAnchorSPASNPAXRAYAnchorFree\fReferences\n[1] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77\u201384, 2012.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[3] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences, 101(suppl 1):5228\u20135235, 2004.\n\n[4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization\u2013provably. In\n\nACM symposium on Theory of Computing, pages 145\u2013162. ACM, 2012.\n\n[5] B. Recht, C. Re, J. Tropp, and V. Bittorf. Factoring nonnegative matrices with linear programs. In Proc.\n\nNIPS 2012, pages 1214\u20131222, 2012.\n\n[6] N. Gillis and S.A. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factor-\n\nization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):698\u2013714, April 2014.\n\n[7] N. Gillis. Robustness analysis of hottopixx, a linear programming model for factoring nonnegative\n\nmatrices. SIAM Journal on Matrix Analysis and Applications, 34(3):1189\u20131212, 2013.\n\n[8] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-separable non-negative\n\nmatrix factorization. In Proc. ICML-12, 2012.\n\n[9] S. Arora, R. Ge, and A. Moitra. Learning topic models\u2013going beyond SVD. In Proc. FOCS 2012, pages\n\n1\u201310. IEEE, 2012.\n\n[10] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm\n\nfor topic modeling with provable guarantees. In Proc. ICML-13, 2013.\n\n[11] N. Gillis. Successive nonnegative projection algorithm for robust nonnegative blind source separation.\n\nSIAM Journal on Imaging Sciences, 7(2):1420\u20131450, 2014.\n\n[12] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition\n\ninto parts? In Proc. NIPS 2013, volume 16, 2003.\n\n[13] A. Anandkumar, Y.-K. Liu, D. J. Hsu, D. P. Foster, and S. M. Kakade. A spectral algorithm for latent\n\nDirichlet allocation. In Proc. NIPS 2012, pages 917\u2013925, 2012.\n\n[14] A. Anandkumar, S. M. Kakade, D. P. Foster, Y.-K. Liu, and D. Hsu. Two SVDs suf\ufb01ce: Spectral\n\ndecompositions for probabilistic topic modeling and latent Dirichlet allocation. Technical report, 2012.\n\n[15] A. Anandkumar, D. J. Hsu, M. Janzamin, and S. M. Kakade. When are overcomplete topic models\nidenti\ufb01able? uniqueness of tensor Tucker decompositions with structured sparsity. In Proc. NIPS 2013,\npages 1986\u20131994, 2013.\n\n[16] D. Cai, X. He, and J. Han. Locally consistent concept factorization for document clustering. IEEE Trans.\n\nKnowl. Data Eng., 23(6):902\u2013913, 2011.\n\n[17] K. Huang, N. Sidiropoulos, and A. Swami. Non-negative matrix factorization revisited: Uniqueness and\n\nalgorithm for symmetric decomposition. IEEE Trans. Signal Process., 62(1):211\u2013224, 2014.\n\n[18] K. Huang, N. D. Sidiropoulos, E. E. Papalexakis, C. Faloutsos, P. P. Talukdar, and T. M. Mitchell. Principled\nneuro-functional connectivity discovery. In Proc. SIAM Conference on Data Mining (SDM), 2015.\n\n[19] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos. Blind separation of quasi-stationary sources:\nExploiting convex geometry in covariance domain. IEEE Trans. Signal Process., 63(9):2306\u20132320,\nMay 2015.\n\n[20] W.-K. Ma, T.-H. Chan, C.-Y. Chi, and Y. Wang. Convex analysis for non-negative blind source separation\nwith application in imaging. In D. P. Palomar and Y. Eldar, editors, Convex Optimization in Signal\nProcessing and Communications, chapter 7, pages 229\u2013265. 2010.\n\n[21] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization.\nIn Proceedings of the 26th annual international ACM SIGIR conference on Research and\ndevelopment in informaion retrieval, pages 267\u2013273. ACM, 2003.\n\n9\n\n\f", "award": [], "sourceid": 961, "authors": [{"given_name": "Kejun", "family_name": "Huang", "institution": "University of Minnesota"}, {"given_name": "Xiao", "family_name": "Fu", "institution": "University of Minnesota"}, {"given_name": "Nikolaos", "family_name": "Sidiropoulos", "institution": "University of Minnesota"}]}