{"title": "Analysis of Variational Bayesian Latent Dirichlet Allocation: Weaker Sparsity Than MAP", "book": "Advances in Neural Information Processing Systems", "page_first": 1224, "page_last": 1232, "abstract": "Latent Dirichlet allocation (LDA) is a popular generative model of various objects such as texts and images, where an object is expressed as a mixture of latent topics. In this paper, we theoretically investigate variational Bayesian (VB) learning in LDA. More specifically, we analytically derive the leading term of the VB free energy under an asymptotic setup, and show that there exist transition thresholds in Dirichlet hyperparameters around which the sparsity-inducing behavior drastically changes. Then we further theoretically reveal the notable phenomenon that VB tends to induce weaker sparsity than MAP in the LDA model, which is opposed to other models. We experimentally demonstrate the practical validity of our asymptotic theory on real-world Last.FM music data.", "full_text": "Analysis of Variational Bayesian Latent Dirichlet\n\nAllocation: Weaker Sparsity than MAP\n\nShinichi Nakajima\n\nBerlin Big Data Center, TU Berlin\n\nBerlin 10587 Germany\n\nnakajima@tu-berlin.de\n\nIssei Sato\n\nUniversity of Tokyo\nTokyo 113-0033 Japan\n\nsato@r.dl.itc.u-tokyo.ac.jp\n\nMasashi Sugiyama\nUniversity of Tokyo\n\nTokyo 113-0033, Japan\n\nsugi@k.u-tokyo.ac.jp\n\nKazuho Watanabe\n\nToyohashi University of Technology\n\nAichi 441-8580 Japan\n\nwkazuho@cs.tut.ac.jp\n\nHiroko Kobayashi\nNikon Corporation\n\nKanagawa 244-8533 Japan\n\nhiroko.kobayashi@nikon.com\n\nAbstract\n\nLatent Dirichlet allocation (LDA) is a popular generative model of various objects\nsuch as texts and images, where an object is expressed as a mixture of latent top-\nics. In this paper, we theoretically investigate variational Bayesian (VB) learning\nin LDA. More speci\ufb01cally, we analytically derive the leading term of the VB free\nenergy under an asymptotic setup, and show that there exist transition thresholds\nin Dirichlet hyperparameters around which the sparsity-inducing behavior drasti-\ncally changes. Then we further theoretically reveal the notable phenomenon that\nVB tends to induce weaker sparsity than MAP in the LDA model, which is op-\nposed to other models. We experimentally demonstrate the practical validity of\nour asymptotic theory on real-world Last.FM music data.\n\n1 Introduction\n\nLatent Dirichlet allocation (LDA) [5] is a generative model successfully used in various applications\nsuch as text analysis [5], image analysis [15], genometrics [6, 4], human activity analysis [12],\nand collaborative \ufb01ltering [14, 20]1. Given word occurrences of documents in a corpora, LDA\nexpresses each document as a mixture of multinomial distributions, each of which is expected to\ncapture a topic. The extracted topics provide bases in a low-dimensional feature space, in which\neach document is compactly represented. This topic expression was shown to be useful for solving\nvarious tasks including classi\ufb01cation [15], retrieval [26], and recommendation [14].\nSince rigorous Bayesian inference is computationally intractable in the LDA model, various approx-\nimation techniques such as variational Bayesian (VB) learning [3, 7] are used. Previous theoretical\nstudies on VB learning revealed that VB tends to produce sparse solutions, e.g., in mixture models\n[24, 25, 13], hidden Markov models [11], Bayesian networks [23], and fully-observed matrix fac-\ntorization [17]. Here, we mean by sparsity that VB exhibits the automatic relevance determination\n1 For simplicity, we use the terminology in text analysis below. However, the range of application of our\n\ntheory given in this paper is not limited to texts.\n\n1\n\n\f(ARD) effect [19], which automatically prunes irrelevant degrees of freedom under non-informative\nor weakly sparse prior. Therefore, it is naturally expected that VB-LDA also produces a sparse solu-\ntion (in terms of topics). However, it is often observed that VB-LDA does not generally give sparse\nsolutions.\nIn this paper, we attempt to clarify this gap by theoretically investigating the sparsity-inducing mech-\nanism of VB-LDA. More speci\ufb01cally, we \ufb01rst analytically derive the leading term of the VB free\nenergy in some asymptotic limits, and show that there exist transition thresholds in Dirichlet hy-\nperparameters around which the sparsity-inducing behavior changes drastically. We then analyze\nthe behavior of MAP and its variants in a similar way, and show that the VB solution is less sparse\nthan the MAP solution in the LDA model. This phenomenon is completely opposite to other mod-\nels such as mixture models [24, 25, 13], hidden Markov models [11], Bayesian networks [23], and\nfully-observed matrix factorization [17], where VB tends to induce stronger sparsity than MAP. We\nnumerically demonstrate the practical validity of our asymptotic theory using arti\ufb01cial and real-\nworld Last.FM music data for collaborative \ufb01ltering, and further discuss the peculiarity of the LDA\nmodel in terms of sparsity.\nThe free energy of VB-LDA was previously analyzed in [16], which evaluated the advantage of\ncollapsed VB [21] over the original VB learning. However, that work focused on the difference\nbetween VB and collapsed VB, and neither the absolute free energy nor the sparsity was investigated.\nThe update rules of VB was compared with those of MAP [2]. However, that work is based on\napproximation, and rigorous analysis was not made. To the best of our knowledge, our paper is the\n\ufb01rst work that theoretically elucidates the sparsity-inducing mechanism of VB-LDA.\n\n2 Formulation\n\nIn this section, we introduce the latent Dirichlet allocation model and variational Bayesian learning.\n\n2.1 Latent Dirichlet Allocation\n\nSuppose that we observe M documents, each of which consists of N (m) words. Each word is\nincluded in a vocabulary with size L. We assume that each word is associated with one of the H\ntopics, which is not observed. We express the word occurrence by an L-dimensional indicator vector\nw, where one of the entries is equal to one and the others are equal to zero. Similarly, we express\nthe topic occurrence as an H-dimensional indicator vector z. We de\ufb01ne the following functions that\ngive the item numbers chosen by w and z, respectively:\n\n\u00b4l(w) = l if wl = 1 and wl\u2032 = 0 for l\u2032 \u0338= l,\n\n\u00b4h(z) = h if zh = 1 and zh\u2032 = 0 for h\u2032 \u0338= h.\n\nIn the latent Dirichlet allocation (LDA) model [5], the word occurrence w(n,m) of the n-th position\nin the m-th document is assumed to follow the multinomial distribution:\n\np(w(n,m)|\u0398, B) =!L\n\n(1)\nwhere \u0398 \u2208 [0, 1]M\u00d7H and B \u2208 [0, 1]L\u00d7H are parameter matrices to be estimated. The rows of \u0398\nand the columns of B are probability mass vectors that sum up to one. We denote a column vector\nof a matrix by a bold lowercase letter, and a row vector by a bold lowercase letter with a tilde, i.e.,\n\nl=1\"(B\u0398\u22a4)l,m#w(n,m)\n\n= (B\u0398\u22a4)\u00b4l(w(n,m)),m,\n\nl\n\n\u0398 = (\u03b81, . . . , \u03b8H) = ($\u03b81, . . . ,$\u03b8M )\u22a4,\n\nB = (\u03b21, . . . , \u03b2H) =%$\u03b21, . . . ,$\u03b2L&\u22a4.\n\ndistribution of the h-th topic.\nGiven the topic occurrence latent variable z(n,m), the complete likelihood is written as\n\nWith this notation,$\u03b8m denotes the topic distribution of the m-th document, and \u03b2h denotes the word\np(w(n,m), z(n,m)|\u0398, B) = p(w(n,m)|z(n,m), B)p(z(n,m)|\u0398),\nwhere p(w(n,m)|z(n,m), B) =!L\nm=1!H\n\n, p(z(n,m)|\u0398) =!H\n\np(\u0398|\u03b1) \u221d!M\n\np(B|\u03b7) \u221d!H\n\nWe assume the Dirichlet prior on \u0398 and B:\n\n(2)\nh=1(\u0398m,h)z(n,m)\n.\n\nh\n\nh=1!L\n\nl=1!H\n\nh=1(Bl,h)w(n,m)\n\nl\n\nz(n,m)\nh\n\nh=1(\u0398m,h)\u03b1\u22121,\n\nl=1(Bl,h)\u03b7\u22121,\n\n(3)\n\n2\n\n\fOutlined font\n\nFigure 1: Graphical model of LDA.\n\nwhere \u03b1 and \u03b7 are hyperparameters that control the prior sparsity. We can make \u03b1 dependent on m\nand/or h, and \u03b7 dependent on l and/or h, and they can be estimated from observation. However, we\n\ufb01x those hyperparameters as given constants for simplicity in our analysis below. Figure 1 shows\nthe graphical model of LDA.\n\n2.2 Variational Bayesian Learning\n\nThe Bayes posterior of LDA is written as\n\np(\u0398, B,{z(n,m)}|{w(n,m)},\u03b1,\u03b7 ) = p({w(n,m)},{z(n,m)}|\u0398,B)p(\u0398|\u03b1)p(B|\u03b7)\n\np({w(n,m)})\n\nwhere p({w(n,m)}) = \u2019 p({w(n,m)},{z(n,m)}|\u0398, B)p(\u0398|\u03b1)p(B|\u03b7)d\u0398dBd{z(n,m)} is in-\n\ntractable to compute and thus requires some approximation method.\nthe variational Bayesian (VB) approximation and investigate its behavior theoretically.\nIn the VB approximation, we assume that our approximate posterior is factorized as\n\nIn this paper, we focus on\n\n,\n\n(4)\n\nand minimize the free energy:\n\nq(\u0398, B,{z(n,m)}) = q(\u0398, B)q({z(n,m)}),\np({w(n,m)},{z(n,m)}|\u0398,B)p(\u0398|\u03b1)p(B|\u03b7))q(\u0398,B,{z(n,m)})\n\nq(\u0398,B,{z(n,m)})\n\n(5)\n\n(6)\n\n,\n\nF =(log\n\nwhere \u27e8\u00b7\u27e9p denotes the expectation over the distribution p. This amounts to \ufb01nding the distribution\nthat is closest to the Bayes posterior (4) under the constraint (5). Using the variational method, we\ncan obtain the following stationary condition:\n\nq(\u0398) \u221d p(\u0398|\u03b1) exp(log p({w(n,m)},{z(n,m)}|\u0398, B))q(B)q({z(n,m)})\nq(B) \u221d p(B|\u03b7) exp(log p({w(n,m)},{z(n,m)}|\u0398, B))q(\u0398)q({z(n,m)})\n\nq({z(n,m)}) \u221d exp(log p({w(n,m)},{z(n,m)}|\u0398, B))q(\u0398)q(B)\n\n.\n\n,\n\n,\n\n(7)\n\n(8)\n\n(9)\n\nm=1!H\n\nq(\u0398) \u221d!M\n\nFrom this, we can con\ufb01rm that {q($\u03b8m)} and {q(\u03b2h)} follow the Dirichlet distribution and\n{q(z(n,m))} follows the multinomial distribution:\nh=1(\u0398m,h) \u02d8\u0398m,h\u22121,\nq({z(n,m)}) =!M\nm=1!N (m)\nn=1 !H\n\u02d8\u0398m,h = \u03b1 ++N (m)\nn=1 *z(n,m)\nexp!\u03a8 ( \u02d8\u0398m,h)+\"L\n*z(n,m)\nh\u2032=1 exp!\u03a8 ( \u02d8\u0398m,h\u2032 )+\"L\n\"H\n\nwhere, for \u03c8(\u00b7) denoting the Digamma function, the variational parameters satisfy\nm=1+N (m)\nn=1 w(n,m)\n\u02d8Bl\u2032,h))#\n\u02d8Bl\u2032,h\u2032))# .\n\nh=1(*z(n,m)\n\u02d8Bl,h = \u03b7 ++M\n(\u03a8 ( \u02d8Bl,h)\u2212\u03a8(\"L\n(\u03a8 ( \u02d8Bl,h\u2032 )\u2212\u03a8(\"L\n\nq(B) \u221d!H\n\nh=1!L\n\nl=1(Bl,h) \u02d8Bl,h\u22121,\n\nh\n\n*z(n,m)\n\n2.3 Partially Bayesian Learning and MAP Estimation\n\nl=1 w(n,m)\n\nl=1 w(n,m)\n\nl\n\n)z(n,m)\n\nh\n\n,\n\n(10)\n\n(11)\n\nl\u2032=1\n\n(12)\n\n(13)\n\n,\n\nh\n\nh\n\n=\n\nl\n\n,\n\nl\n\nl\u2032=1\n\nh\n\nWe can partially apply VB learning by approximating the posterior of \u0398 or B by the delta function.\nThis approach is called the partially Bayesian (PA) learning [18], whose behavior was analyzed\n\n3\n\n\fand compared with VB in fully-observed matrix factorization. We call it PBA learning if \u0398 is\nmarginalized and B is point-estimated, and PBB learning if B is marginalized and \u0398 is point-\nestimated. Note that the original VB algorithm for LDA proposed by [5] corresponds to PBA in\nour terminology. We also analyze the behavior of MAP estimation, where both of \u0398 and B are\npoint-estimated. This corresponds to the probabilistic latent semantic analysis (pLSA) model [10],\nif we assume the \ufb02at prior \u03b1 = \u03b7 = 1 [8].\n\n3 Theoretical Analysis\n\nIn this section, we \ufb01rst give an explicit form of the free energy in the LDA model. We then investigate\nits asymptotic behavior for VB learning, and further conduct similar analyses to the PBA, PBB, and\nMAP methods. Finally, we discuss the sparsity-inducing mechanism of these learning methods, and\nthe relation to previous theoretical studies.\n\n3.1 Explicit Form of Free Energy\n\nWe \ufb01rst express the free energy (6) as a function of the variational parameters \u02d8\u0398 and \u02d8B:\n\nF = R + Q,\n\nwhere\n\n(14)\n\nh\u2032=1\n\nl\u2032=1\n\n1\n\n(15)\n\n(16)\n\nh=1\n\nh\u2032=1\n\n\u02d8\u0398m,h\u2032 ))\n\nl\u2032=1\n\nq(\u0398)q(B)\n\n\u0393 (\u03b1)H\n\n\u0393 (\u03b7)L\n\nexp(\u03a8 ( \u02d8\u0398m,h))\n\nexp(\u03a8 ( \u02d8Bl,h))\n\nl=1\n\n\u02d8Bl,h)\nl=1 \u0393 ( \u02d8Bl,h)\n\nh=1\n\n\u02d8\u0398m,h)\nh=1 \u0393 ( \u02d8\u0398m,h)\n\nh=1\" \u02d8\u0398m,h \u2212 \u03b1#\"\u03a8 ( \u02d8\u0398m,h) \u2212 \u03a8 (+H\nl=1\" \u02d8Bl,h \u2212 \u03b7#\"\u03a8 ( \u02d8Bl,h) \u2212 \u03a8 (+L\n\np(\u0398|\u03b1)p(B|\u03b7))q(\u0398,B)\nR =(log\nm=1\"log \u0393 (\"H\n=+M\n$H\nh=1\"log \u0393 (\"L\n++H\n$L\np({w(n,m)},{z(n,m)}|\u0398,B))q(\u0398,B,{z(n,m)})\nQ =(log\nq({z(n,m)})\n= \u2212+M\nm=1 N (m)+L\nexp(\u03a8 (\"H\nHere, V \u2208 RL\u00d7M is the empirical word distribution matrix with its entries given by Vl,m =\nN (m)+N (m)\noccurrence latent variables by using the stationary condition (13).\n\n\u0393 (H\u03b1) ++H\n\u0393 (L\u03b7) ++L\nl=1 Vl,m log,+H\n. Note that we have eliminated the variational parameters {*z(n,m)} for the topic\n\n\u02d8Bl\u2032,h)## ,\n\u02d8Bl\u2032,h))- .\n\n\u02d8\u0398m,h\u2032)##\n\nexp(\u03a8 (\"L\n\n3.2 Asymptotic Analysis of VB Solution\nBelow, we investigate the leading term of the free energy in the asymptotic limit when N \u2261\nminm N (m) \u2192 \u221e. Unlike the previous analysis for latent variable models [24], we do not as-\nsume L, M \u226a N, but 1 \u226a L, M, N at this point. This amounts to considering the asymptotic\nlimit when L, M, N \u2192 \u221e with a \ufb01xed mutual ratio, or equivalently, assuming L, M \u223c O(N ).\nThroughout the paper, H is set at H = min(L, M ) (i.e., the matrix B\u0398\u22a4 can express any multino-\nmial distribution). We assume that the word distribution matrix V is a sample from the multinomial\ndistribution with the true parameter U\u2217 \u2208 RL\u00d7M whose rank is H\u2217 \u223c O(1), i.e., U\u2217 = B\u2217\u0398\u2217\u22a4\nwhere \u0398\u2217 \u2208 RM\u00d7H\u2217 and B\u2217 \u2208 RL\u00d7H\u2217.2 We assume that \u03b1, \u03b7 \u223c O(1).\nThe stationary condition (12) leads to the following lemma (the proof is given in Appendix A):\n\nn=1 w(n,m)\n\nl\n\nLemma 1 Let *B*\u0398\u22a4 = \u27e8B\u0398\u22a4\u27e9q(\u0398,B). Then, it holds that\n\nm=1 N (m)+L\nwhere Op(\u00b7) denotes the order in probability.\n\n\u27e8(B\u0398\u22a4 \u2212 *B*\u0398\u22a4)2\nQ = \u2212+M\n\nl,m\u27e9q(\u0398,B) = Op(N\u22122),\n\nl=1 Vl,m log(*B*\u0398\u22a4)l,m + Op(M ),\n\n2 More precisely, U\u2217 = B\u2217\u0398\u2217\u22a4 + O(N\u22121) is suf\ufb01cient.\n\n4\n\n(17)\n\n(18)\n\n\fEq.(17) implies the convergence of the posterior. Let\n\n(19)\n\n*J =+L\n\nl=1+M\n\nm=1 \u03ba\"(*B*\u0398\u22a4)l,m \u0338= (B\u2217\u0398\u2217\u22a4)l,m + Op(N\u22121)#\n\nwhere\n\nthe indicator function equal to one if the event is true, and zero otherwise. Then, Eq.(18) leads to\nthe following lemma:\n\nbe the number of entries of *B*\u0398\u22a4 that do not converge to the true values. Here, we denote by \u03ba(\u00b7)\nLemma 2 Q is minimized when *B*\u0398\u22a4 = B\u2217\u0398\u2217\u22a4 + Op(N\u22121), and it holds that\nQ = S + Op(*JN + M ),\nS = \u2212 log p({w(n,m)},{z(n,m)}|\u0398\u2217, B\u2217) = \u2212+M\n\nLemma 2 simply states that Q/N converges to the normalized entropy S/N of the true distribution\n(which is the lowest achievable value with probability 1), if and only if VB converges to the true\n\nm=1 N (m)+L\n\nl=1 Vl,m log(B\u2217\u0398\u2217)l,m.\n\nh=1 \u03ba( 1\n\ndistribution (i.e., *J = 0).\nLet *H = +H\nM+M\n.M (h) = +M\n*L(h) =+L\n\nm=1 *\u0398m,h \u223c Op(1)) be the number of topics used in the whole corpus,\nm=1 \u03ba(*\u0398m,h \u223c Op(1)) be the number of documents that contain the h-th topic, and\nl=1 \u03ba(*Bl,h \u223c Op(1)) be the number of words of which the h-th topic consist. We have\n\nthe following lemma (the proof is given in Appendix B):\nLemma 3 R is written as follows:\n\nR =/M%H\u03b1 \u2212 1\n\n2& + *H%L\u03b7 \u2212 1\n2& \u2212+%H\n+ (H \u2212 *H)%L\u03b7 \u2212 1\n\nh=1\".M (h)%\u03b1 \u2212 1\n2& log L + Op(H(M + L)).\n\n2& +*L(h)%\u03b7 \u2212 1\n\n2� log N\n\n(20)\n\nSince we assumed that the true matrices \u0398\u2217 and B\u2217 are of the rank of H\u2217, *H = H\u2217 \u223c O(1) is\nsuf\ufb01cient for the VB posterior to converge to the true distribution. However, *H can be much larger\nthan H\u2217 with \u27e8B\u0398\u22a4\u27e9q(\u0398,B) unchanged because of the non-identi\ufb01ability of matrix factorization\u2014\nduplicating topics with divided weights, for example, does not change the distribution.\nBased on Lemma 2 and Lemma 3, we obtain the following theorem (the proof is given in Ap-\npendix C):\n\nand\n\nTheorem 1 In the limit when N \u2192 \u221e with L, M \u223c O(1), it holds that *J = 0 with probability 1,\nF = S +/M%H\u03b1 \u2212 1\n2� log N\n\n+ Op(1).\n\nIn the limit when N, M \u2192 \u221e with M\n\nF = S +/M%H\u03b1 \u2212 1\n\nIn the limit when N, L \u2192 \u221e with L\n\nIn the limit when N, L, M \u2192 \u221e with L\n\nh=1\".M (h)%\u03b1 \u2212 1\n\n2& + *H%L\u03b7 \u2212 1\n2& +*L(h)%\u03b7 \u2212 1\nN , L \u223c O(1), it holds that *J = op(log N ), and\n2&0 log N + op(N log N ).\n2& \u2212+%H\nN , M \u223c O(1), it holds that *J = op(log N ), and\n\n2& \u2212+%H\nh=1.M (h)%\u03b1 \u2212 1\nN \u223c O(1), it holds that *J = op(N log N ), and\n\nF = S + H(M\u03b1 + L\u03b7) log N + op(N 2 log N ).\n\nF = S + HL\u03b7 log N + op(N log N ).\n\nN , M\n\nSince Eq.(17) was shown to hold, the predictive distribution converges to the true distribution if\n\n*J = 0. Accordingly, Theorem 1 states that the consistency holds in the limit when N \u2192 \u221e with\nL, M \u223c O(1).\nTheorem 1 also implies that, in the asymptotic limits with small L \u223c O(1), the leading term depends\non *H, meaning that it dominates the topic sparsity of the VB solution. We have the following\ncorollary (the proof is given in Appendix D):\n\n5\n\n\fTable 1: Sparsity thresholds of VB, PBA, PBB, and MAP methods (see Theorem 2). The \ufb01rst four\ncolumns show the thresholds (\u03b1sparse,\u03b1 dense), of which the function forms depend on the range of\n\u03b7, in the limit when N \u2192 \u221e with L, M \u223c O(1). A single value is shown if \u03b1sparse = \u03b1dense. The\nlast column shows the threshold \u03b1M\u2192\u221e\n\nN , L \u223c O(1).\n\nin the limit when N, M \u2192 \u221e with M\n!\u03b1sparse,\u03b1 dense\"\n\n2\n\n0 <\u03b7 \u2264 1\n2L\n2 \u2212L\u03b7\n2 \u2212\n\nminh M\u2217(h)\n\n1\n\n1\n\n1\n\n\u03b7 range\n\nVB\nPBA\n\nPBB\nMAP\n\n1\n\n1\n\n2L <\u03b7 \u2264 1\n2 + L\u03b7\u2212 1\nmaxh M\u2217(h)\n\u2014\n\n2\n\n1 + L\u03b7\u2212 1\n\n2\n\nmaxh M\u2217(h)\n\u2014\n\n1\n2 <\u03b7< 1\n2 +\n\n# 1\n#1 +\n\n\u03b1M\u2192\u221e\n0 <\u03b7< \u221e\n\n2\n\n2\n\n1\n2\n\n1\n2\n\nL\u22121\n\nL\u22121\n\n2 maxh M\u2217(h) , 1\n2 , 1\n\n1 \u2264 \u03b7< \u221e\nminh M\u2217(h)$\n2 + L\u03b7\u2212 1\nminh M\u2217(h)$\n# 1\n2 + L(\u03b7\u22121)\nminh M\u2217(h)$\n2 maxh M\u2217(h) , 1 + L\u03b7\u2212 1\n#1, 1 + L(\u03b7\u22121)\nminh M\u2217(h)$\nl=1 \u03ba(B\u2217l,h \u223c O(1)). Consider\n2L, the VB solution is sparse (i.e., *H \u226a\n\n2 \u2212\n\n1\n\n1\n\n1\n\n1\n\n2\u2212L\u03b7\n\n2 \u2212\n\n2, the VB solution is sparse if \u03b1< 1\n2, the VB solution is sparse if \u03b1< 1\n\nm=1 \u03ba(\u0398\u2217m,h \u223c O(1)) and L\u2217(h) =+L\nminh M\u2217(h) , and dense (i.e., *H \u2248 H) if \u03b1> 1\n\nCorollary 1 Let M\u2217(h) =+M\nthe limit when N \u2192 \u221e with L, M \u223c O(1). When 0 <\u03b7 \u2264 1\nH = min(L, M )) if \u03b1< 1\n1\n2L <\u03b7 \u2264 1\nWhen \u03b7> 1\nIn the limit when N, M \u2192 \u221e with M\n2.\n\u03b1> 1\nIn the case when L, M \u226a N and in the case when L \u226a M, N, Corollary 1 provides information\non the sparsity of the VB solution, which will be compared with other methods in Section 3.3. On\nthe other hand, although we have successfully derived the leading term of the free energy also in the\ncase when M \u226a L, N and in the case when 1 \u226a L, M, N, it unfortunately provides no information\non sparsity of the solution.\n\nmaxh M\u2217(h) , and dense if \u03b1> 1\n2 maxh M\u2217(h) , and dense if \u03b1> 1\nN , L \u223c O(1), the VB solution is sparse if \u03b1< 1\n\n2\u2212L\u03b7\nminh M\u2217(h) . When\n2 + L\u03b7\u2212 1\nmaxh M\u2217(h) .\n2 + L\u03b7\u2212 1\nminh M\u2217(h) .\n2, and dense if\n\n2 + L\u03b7\u2212 1\nL\u22121\n2 +\n\n2\n\n2\n\n2\n\n3.3 Asymptotic Analysis of PBA, PBB, and MAP\n\nBy applying similar analysis to PBA learning, PBB learning, and MAP estimation, we can obtain\nthe following theorem (the proof is given in Appendix E):\nTheorem 2 In the limit when N \u2192 \u221e with L, M \u223c O(1), the solution is sparse if \u03b1<\u03b1 sparse,\nand dense if \u03b1>\u03b1 dense. In the limit when N, M \u2192 \u221e with M\nN , L \u223c O(1), the solution is sparse if\n\u03b1<\u03b1 M\u2192\u221e\nA notable \ufb01nding from Table 1 is that the threshold that determines the topic sparsity of PBB-LDA\n2 larger than the threshold of VB-LDA. The same relation is observed\nis (most of the case exactly) 1\nbetween MAP-LDA and PBA-LDA. From these, we can conclude that point-estimating \u0398, instead\n2 in the LDA model. We will validate this observation\nof integrating it out, increases the threshold by 1\nby numerical experiments in Section 4.\n\n. Here, \u03b1sparse, \u03b1dense, and \u03b1M\u2192\u221e\n\n, and dense if \u03b1>\u03b1 M\u2192\u221e\n\nare given in Table 1.\n\n3.4 Discussion\n\nThe above theoretical analysis (Thereom 2) showed that VB tends to induce weaker sparsity than\nMAP in the LDA model3, i.e., VB requires sparser prior (smaller \u03b1) than MAP to give a sparse\nsolution (mean of the posterior). This phenomenon is completely opposite to other models such\nas mixture models [24, 25, 13], hidden Markov models [11], Bayesian networks [23], and fully-\nobserved matrix factorization [17], where VB tends to induce stronger sparsity than MAP. This\nphenomenon might be partly explained as follows: In the case of mixture models, the sparsity\n3 Although this tendency was previously pointed out [2] by using the approximation exp(\u03c8(n)) \u2248 n \u2212 1\n2\nand comparing the stationary condition, our result has \ufb01rst clari\ufb01ed the sparsity behavior of the solution based\non the asymptotic free energy analysis without using such an approximation.\n\n6\n\n\f\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(a) VB\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(b) PBA\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(c) PBB\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(d) MAP\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\nFigure 2: Estimated number *H of topics by (a) VB, (b) PBA, (c) PBB, and (d) MAP, for the arti\ufb01cial\ndata with L = 100, M = 100, H\u2217 = 20, and N \u223c 10000.\n\n \n\n \n\n \n\n \n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(a) VB\n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(b) PBA\n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(c) PBB\n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n\u03b1\n\n0.5\n1\n(d) MAP\n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\nFigure 3: Estimated number *H of topics for the Last.FM data with L = 100, M = 100, and\nN \u223c 700.\nthreshold depends on the degree of freedom of a single component [24]. This is reasonable because\nadding a single component increases the model complexity by this amount. Also, in the case of\nLDA, adding a single topic requires additional L + 1 parameters. However, the added topic is\nshared over M documents, which could discount the increased model complexity relative to the\nincreased data \ufb01delity. Corollary 1, which implies the dependency of the threshold for \u03b1 on L and\nM, might support this conjecture. However, the same applies to the matrix factorization, where VB\nwas shown to give a sparser solution than MAP [17]. Investigation on related models, e.g., Poisson\nMF [9], would help us fully explain this phenomenon.\nTechnically, our theoretical analysis is based on the previous asymptotic studies on VB learning con-\nducted for latent variable models [24, 25, 13, 11, 23]. However, our analysis is not just a straight-\nforward extension of those works to the LDA model. For example, the previous analysis either\nimplicitly [24] or explicitly [13] assumed the consistency of VB learning, while we also analyzed\nthe consistency of VB-LDA, and showed that the consistency does not always hold (see Theorem 1).\nMoreover, we derived a general form of the asymptotic free energy, which can be applied to different\nasymptotic limits. Speci\ufb01cally, the standard asymptotic theory requires a large number N of words\nper document, compared to the number M of documents and the vocabulary size L. This may be\nreasonable in some collaborative \ufb01ltering data such as the Last.FM data used in our experiments in\nSection 4. However, L and/or M would be comparable to or larger than N in standard text analysis.\nOur general form of the asymptotic free energy also allowed us to elucidate the behavior of the\nVB free energy when L and/or M diverges with the same order as N. This attempt successfully\nrevealed the sparsity of the solution for the case when M diverges while L \u223c O(1). However, when\nL diverges, we found that the leading term of the free energy does not contain interesting insight into\nthe sparsity of the solution. Higher-order asymptotic analysis will be necessary to further understand\nthe sparsity-inducing mechanism of the LDA model with large vocabulary.\n\n4 Numerical Illustration\n\nIn this section, we conduct numerical experiments on arti\ufb01cial and real data for collaborative \ufb01lter-\ning.\nThe arti\ufb01cial data were created as follows. We \ufb01rst sample the true document matrix \u0398\u2217 of size\nm of \u0398\u2217 follows\nthe Dirichlet distribution with \u03b1\u2217 = 1/H\u2217, while each column \u03b2\u2217h of B\u2217 follows the Dirichlet\ndistribution with \u03b7\u2217 = 1/L. The document length N (m) is sampled from the Poisson distribution\nwith its mean N. The word histogram N (m)vm for each document is sampled from the multinomial\n\nM \u00d7 H\u2217 and the true topic matrix B\u2217 of size L \u00d7 H\u2217. We assume that each row$\u03b8\u2217\n\n7\n\n\f\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n0.5\n\n\u03b1\n\n1\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n\u03b7\n\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n \n0\n\n \n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n\n0.5\n\n\u03b1\n\n1\n\n0.5\n\n\u03b1\n\n1\n\n0.5\n\n\u03b1\n\n1\n\n(a) L = 100, M = 100\n\n(b) L = 100, M = 1000\n\n(c) L = 500, M = 100\n\n(d) L = 500, M = 1000\n\nFigure 4: Estimated number *H of topics by VB-LDA for the arti\ufb01cial data with H\u2217 = 20 and\nN \u223c 10000. For the case when L = 500, M = 1000, the maximum estimated rank is limited to\n100 for computational reason.\n\ndistribution with the parameter speci\ufb01ed by the m-th row vector of B\u2217\u0398\u2217\u22a4. Thus, we obtain the\nL \u00d7 M matrix V , which corresponds to the empirical word distribution over M documents.\nAs a real-world dataset, we used the Last.FM dataset.4 Last.FM is a well-known social music web\nsite, and the dataset includes the triple (\u201cuser,\u201d \u201cartist,\u201d \u201cFreq\u201d) which was collected from the play-\nlists of users in the community by using a plug-in in users\u2019 media players. This triple means that\n\u201cuser\u201d played \u201cartist\u201d music \u201cFreq\u201d times, which indicates users\u2019 preferred artists. A user and a\nplayed artist are analogous to a document and a word, respectively. We randomly chose L artists\nfrom the top 1000 frequent artists, and M users who live in the United States. To \ufb01nd a better local\nsolution (which hopefully is close to the global solution), we adopted a split and merge strategy [22],\nand chose the local solution giving the lowest free energy among different initialization schemes.\n\nFigure 2 shows the estimated number *H of topics by different approximation methods, i.e., VB,\nPBA, PBB, and MAP, for the Arti\ufb01cial data with L = 100, M = 100, H\u2217 = 20, and N \u223c 10000.\nWe can clearly see that the sparsity threshold in PBB and MAP, where \u0398 is point-estimated, is\nlarger than that in VB and PBA, where \u0398 is marginalized. This result supports the statement by\nTheorem 2. Figure 3 shows results on the Last.FM data with L = 100, M = 100 and N \u223c 700. We\nsee a similar tendency to Figure 2 except the region where \u03b7< 1 for PBA, in which our theory does\nnot predict the estimated number of topics.\nFinally, we investigate how different asymptotic settings affect the topic sparsity. Figure 4 shows\nthe sparsity dependence on L and M for the arti\ufb01cial data. The graphs correspond to the four\ncases mentioned in Theorem 1, i.e, (a) L, M \u226a N, (b) L \u226a N, M, (c) M \u226a N, L, and (d)\n1 \u226a N, L, M. Corollary 1 explains the behavior in (a) and (b), and further analysis is required to\nexplain the behavior in (c) and (d).\n\n5 Conclusion\n\nIn this paper, we considered variational Bayesian (VB) learning in the latent Dirichlet allocation\n(LDA) model and analytically derived the leading term of the asymptotic free energy. When the\nvocabulary size is small, our result theoretically explains the phase-transition phenomenon. On the\nother hand, when vocabulary size is as large as the number of words per document, the leading term\ntells nothing about sparsity. We need more accurate analysis to clarify the sparsity in such cases.\nThroughout the paper, we assumed that the hyperparameters \u03b1 and \u03b7 are pre-\ufb01xed. However, \u03b1\nwould often be estimated for each topic h, which is one of the advantages of using the LDA model\nin practice [5].\nIn the future work, we will extend the current line of analysis to the empirical\nBayesian setting where the hyperparameters are also learned, and further elucidate the behavior of\nthe LDA model.\n\nAcknowledgments\nThe authors thank the reviewers for helpful comments. Shinichi Nakajima thanks the support\nfrom Nikon Corporation, MEXT Kakenhi 23120004, and the Berlin Big Data Center project (FKZ\n01IS14013A). Masashi Sugiyama thanks the support from the JST CREST program. Kazuho Watan-\nabe thanks the support from JSPS Kakenhi 23700175 and 25120014.\n\n4http://mtg.upf.edu/node/1671\n\n8\n\n\fReferences\n[1] H. Alzer. On some inequalities for the Gamma and Psi functions. Mathematics of Computation,\n\n66(217):373\u2013389, 1997.\n\n[2] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In\n\nProc. of UAI, pages 27\u201334, 2009.\n\n[3] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. In Proc. of\n\nUAI, pages 21\u201330, 1999.\n\n[4] M. Bicego, P. Lovato, A. Ferrarini, and M. Delledonne. Biclustering of expression microarray data with\n\ntopic models. In Proc. of ICPR, pages 2728\u20132731, 2010.\n\n[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[6] X. Chen, X. Hu, X. Shen, and G. Rosen. Probabilistic topic modeling for genomic data interpretation. In\n2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 149\u2013152, 2010.\nIn Advanced Mean Field\n\n[7] Z. Ghahramani and M. J. Beal. Graphical models and variational methods.\n\nMethods, pages 161\u2013177. MIT Press, 2001.\n\n[8] M. Girolami and A. Kaban. On an equivalence between PLSI and LDA. In Proc. of SIGIR, pages 433\u2013\n\n434, 2003.\n\n[9] P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with Poisson factorization.\n\narXiv:1311.1704 [cs.IR], 2013.\n\n[10] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:177\u2013\n\n196, 2001.\n\n[11] T. Hosino, K. Watanabe, and S. Watanabe. Stochastic complexity of hidden markov models on the varia-\n\ntional Bayesian learning. IEICE Trans. on Information and Systems, J89-D(6):1279\u20131287, 2006.\n\n[12] T. Huynh, Mario F., and B. Schiele. Discovery of activity patterns using topic models. In International\n\nConference on Ubiquitous Computing (UbiComp), 2008.\n\n[13] D. Kaji, K. Watanabe, and S. Watanabe. Phase transition of variational Bayes learning in Bernoulli\n\nmixture. Australian Journal of Intelligent Information Processing Systems, 11(4):35\u201340, 2010.\n\n[14] R. Krestel, P. Fankhauser, and W. Nejdl. Latent dirichlet allocation for tag recommendation. In Proceed-\n\nings of the Third ACM Conference on Recommender Systems, pages 61\u201368, 2009.\n\n[15] F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proc. of\n\nCVPR, pages 524\u2013531, 2005.\n\n[16] I. Mukherjee and D. M. Blei. Relative performance guarantees for approximate inference in latent Dirich-\n\nlet allocation. In Advances in NIPS, 2008.\n\n[17] S. Nakajima and M. Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine\n\nLearning Research, 12:2579\u20132644, 2011.\n\n[18] S. Nakajima, M. Sugiyama, and S. D. Babacan. On Bayesian PCA: Automatic dimensionality selection\n\nand analytic solution. In Proc. of ICML, pages 497\u2013504, 2011.\n\n[19] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.\n[20] S. Purushotham, Y. Liu, and C. C. J. Kuo. Collaborative topic regression with social matrix factorization\n\nfor recommendation systems. In Proc. of ICML, 2012.\n\n[21] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent\n\nDirichlet allocation. In Advances in NIPS, 2007.\n\n[22] N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. SMEM algorithm for mixture models. Neural\n\nComputation, 12(9):2109\u20132128, 2000.\n\n[23] K. Watanabe, M. Shiga, and S. Watanabe. Upper bound for variational free energy of Bayesian networks.\n\nMachine Learning, 75(2):199\u2013215, 2009.\n\n[24] K. Watanabe and S. Watanabe. Stochastic complexities of Gaussian mixtures in variational Bayesian\n\napproximation. Journal of Machine Learning Research, 7:625\u2013644, 2006.\n\n[25] K. Watanabe and S. Watanabe. Stochastic complexities of general mixture models in variational Bayesian\n\nlearning. Neural Networks, 20(2):210\u2013219, 2007.\n\n[26] X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Prof. of SIGIR, pages\n\n178\u2013185, 2006.\n\n9\n\n\f", "award": [], "sourceid": 694, "authors": [{"given_name": "Shinichi", "family_name": "Nakajima", "institution": "Technische Universit\u00e4t Berlin"}, {"given_name": "Issei", "family_name": "Sato", "institution": "University of Tokyo"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "The University of Tokyo"}, {"given_name": "Kazuho", "family_name": "Watanabe", "institution": "Toyohashi Tech"}, {"given_name": "Hiroko", "family_name": "Kobayashi", "institution": "Nikon Corporation"}]}