{"title": "Supervised Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 121, "page_last": 128, "abstract": "We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.", "full_text": "Supervised topic models\n\nDavid M. Blei\n\nDepartment of Computer Science\n\nPrinceton University\n\nPrinceton, NJ\n\nblei@cs.princeton.edu\n\nJon D. McAuliffe\n\nDepartment of Statistics\n\nUniversity of Pennsylvania,\n\nWharton School\nPhiladelphia, PA\n\nmcjon@wharton.upenn.edu\n\nAbstract\n\nWe introduce supervised latent Dirichlet allocation (sLDA), a statistical model of\nlabelled documents. The model accommodates a variety of response types. We\nderive a maximum-likelihood procedure for parameter estimation, which relies on\nvariational approximations to handle intractable posterior expectations. Prediction\nproblems motivate this research: we use the \ufb01tted model to predict response values\nfor new documents. We test sLDA on two real-world problems: movie ratings\npredicted from reviews, and web page popularity predicted from text descriptions.\nWe illustrate the bene\ufb01ts of sLDA versus modern regularized regression, as well\nas versus an unsupervised LDA analysis followed by a separate regression.\n\n1 Introduction\n\nThere is a growing need to analyze large collections of electronic text. The complexity of document\ncorpora has led to considerable interest in applying hierarchical statistical models based on what are\ncalled topics. Formally, a topic is a probability distribution over terms in a vocabulary. Informally,\na topic represents an underlying semantic theme; a document consisting of a large number of words\nmight be concisely modelled as deriving from a smaller number of topics. Such topic models provide\nuseful descriptive statistics for a collection, which facilitates tasks like browsing, searching, and\nassessing document similarity.\nMost topic models, such as latent Dirichlet allocation (LDA) [4], are unsupervised: only the words\nin the documents are modelled. The goal is to infer topics that maximize the likelihood (or the pos-\nterior probability) of the collection. In this work, we develop supervised topic models, where each\ndocument is paired with a response. The goal is to infer latent topics predictive of the response.\nGiven an unlabeled document, we infer its topic structure using a \ufb01tted model, then form its pre-\ndiction. Note that the response is not limited to text categories. Other kinds of document-response\ncorpora include essays with their grades, movie reviews with their numerical ratings, and web pages\nwith counts of how many online community members liked them.\nUnsupervised LDA has previously been used to construct features for classi\ufb01cation. The hope was\nthat LDA topics would turn out to be useful for categorization, since they act to reduce data di-\nmension [4]. However, when the goal is prediction, \ufb01tting unsupervised topics may not be a good\nchoice. Consider predicting a movie rating from the words in its review. Intuitively, good predictive\ntopics will differentiate words like \u201cexcellent\u201d, \u201cterrible\u201d, and \u201caverage,\u201d without regard to genre.\nBut topics estimated from an unsupervised model may correspond to genres, if that is the dominant\nstructure in the corpus.\nThe distinction between unsupervised and supervised topic models is mirrored in existing\ndimension-reduction techniques. For example, consider regression on unsupervised principal com-\nponents versus partial least squares and projection pursuit [7], which both search for covariate linear\ncombinations most predictive of a response variable. These linear supervised methods have non-\n\n1\n\n\fparametric analogs, such as an approach based on kernel ICA [6]. In text analysis, McCallum et al.\ndeveloped a joint topic model for words and categories [8], and Blei and Jordan developed an LDA\nmodel to predict caption words from images [2]. In chemogenomic pro\ufb01ling, Flaherty et al. [5]\nproposed \u201clabelled LDA,\u201d which is also a joint topic model, but for genes and protein function\ncategories. It differs fundamentally from the model proposed here.\nThis paper is organized as follows. We \ufb01rst develop the supervised latent Dirichlet allocation model\n(sLDA) for document-response pairs. We derive parameter estimation and prediction algorithms for\nthe real-valued response case. Then we extend these techniques to handle diverse response types,\nusing generalized linear models. We demonstrate our approach on two real-world problems. First,\nwe use sLDA to predict movie ratings based on the text of the reviews. Second, we use sLDA to\npredict the number of \u201cdiggs\u201d that a web page will receive in the www.digg.com community, a\nforum for sharing web content of mutual interest. The digg count prediction for a page is based\non the page\u2019s description in the forum. In both settings, we \ufb01nd that sLDA provides much more\npredictive power than regression on unsupervised LDA features. The sLDA approach also improves\non the lasso, a modern regularized regression technique.\n\n2 Supervised latent Dirichlet allocation\n\nIn topic models, we treat the words of a document as arising from a set of latent topics, that is, a\nset of unknown distributions over the vocabulary. Documents in a corpus share the same set of K\ntopics, but each document uses a mix of topics unique to itself. Thus, topic models are a relaxation\nof classical document mixture models, which associate each document with a single unknown topic.\nHere we build on latent Dirichlet allocation (LDA) [4], a topic model that serves as the basis for\nmany others. In LDA, we treat the topic proportions for a document as a draw from a Dirichlet\ndistribution. We obtain the words in the document by repeatedly choosing a topic assignment from\nthose proportions, then drawing a word from the corresponding topic.\nIn supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associated\nwith each document. As mentioned, this variable might be the number of stars given to a movie, a\ncount of the users in an on-line community who marked an article interesting, or the category of a\ndocument. We jointly model the documents and the responses, in order to \ufb01nd latent topics that will\nbest predict the response variables for future unlabeled documents.\nWe emphasize that sLDA accommodates various types of response: unconstrained real values, real\nvalues constrained to be positive (e.g., failure times), ordered or unordered class labels, nonnegative\nintegers (e.g., count data), and other types. However, the machinery used to achieve this generality\ncomplicates the presentation. So we \ufb01rst give a complete derivation of sLDA for the special case\nof an unconstrained real-valued response. Then, in Section 2.3, we present the general version of\nsLDA, and explain how it handles diverse response types.\nFocus now on the case y \u2208 R. Fix for a moment the model parameters: the K topics \u03b21:K (each\n\u03b2k a vector of term probabilities), the Dirichlet parameter \u03b1, and the response parameters \u03b7 and \u03c3 2.\nUnder the sLDA model, each document and response arises from the following generative process:\n\n1. Draw topic proportions \u03b8 | \u03b1 \u223c Dir(\u03b1).\n2. For each word\n\n(a) Draw topic assignment zn | \u03b8 \u223c Mult(\u03b8 ).\n3. Draw response variable y | z1:N , \u03b7, \u03c3 2 \u223c N(cid:0)\u03b7>\u00afz, \u03c3 2(cid:1).\n(b) Draw word wn | zn, \u03b21:K \u223c Mult(\u03b2zn\n).\n\nHere we de\ufb01ne \u00afz := (1/N )PN\n\nn=1 zn. The family of probability distributions corresponding to this\n\ngenerative process is depicted as a graphical model in Figure 1.\nNotice the response comes from a normal linear model. The covariates in this model are the (un-\nobserved) empirical frequencies of the topics in the document. The regression coef\ufb01cients on those\nfrequencies constitute \u03b7. Note that a linear model usually includes an intercept term, which amounts\nto adding a covariate that always equals one. Here, such a term is redundant, because the compo-\nnents of \u00afz always sum to one.\n\n2\n\n\fBy regressing the response on the empirical topic frequencies, we treat the response as non-\nexchangeable with the words. The document (i.e., words and their topic assignments) is generated\n\ufb01rst, under full word exchangeability; then, based on the document, the response variable is gen-\nerated. In contrast, one could formulate a model in which y is regressed on the topic proportions\n\u03b8. This treats the response and all the words as jointly exchangeable. But as a practical matter,\nour chosen formulation seems more sensible: the response depends on the topic frequencies which\nactually occurred in the document, rather than on the mean of the distribution generating the topics.\nMoreover, estimating a fully exchangeable model with enough topics allows some topics to be used\nentirely to explain the response variables, and others to be used to explain the word occurrences.\nThis degrades predictive performance, as demonstrated in [2].\nWe treat \u03b1, \u03b21:K , \u03b7, and \u03c3 2 as unknown constants to be estimated, rather than random variables. We\ncarry out approximate maximum-likelihood estimation using a variational expectation-maximization\n(EM) procedure, which is the approach taken in unsupervised LDA as well [4].\n\n2.1 Variational E-step\n\nGiven a document and response, the posterior distribution of the latent variables is\n\n(cid:16)QN\n(cid:17)\np(\u03b8, z1:N | w1:N , y, \u03b1, \u03b21:K , \u03b7, \u03c3 2) =\n(cid:16)QN\n(cid:17)\np(y | z1:N , \u03b7, \u03c3 2)\nn=1 p(zn | \u03b8 )p(wn | zn, \u03b21:K )\nR d\u03b8 p(\u03b8 | \u03b1)P\nn=1 p(zn | \u03b8 )p(wn | zn, \u03b21:K )\n\np(\u03b8 | \u03b1)\n\nz1:N\n\np(y | z1:N , \u03b7, \u03c3 2)\n\n.\n\n(1)\n\nThe normalizing value is the marginal probability of the observed data, i.e., the document w1:N and\nresponse y. This normalizer is also known as the likelihood, or the evidence. As with LDA, it is not\nef\ufb01ciently computable. Thus, we appeal to variational methods to approximate the posterior.\nVariational objective function. We maximize the evidence lower bound (ELBO) L(\u00b7), which for a\nsingle document has the form\n\nlog p(cid:0)w1:N , y | \u03b1, \u03b21:K , \u03b7, \u03c3 2(cid:1) \u2265 L(\u03b3 , \u03c61:N; \u03b1, \u03b21:K , \u03b7, \u03c3 2) = E[log p(\u03b8 | \u03b1)] +\nNX\n\nE[log p(wn | Zn, \u03b21:K )] + E[log p(y | Z1:N , \u03b7, \u03c3 2)] + H(q) .\n\nE[log p(Zn | \u03b8 )] + NX\n\n(2)\n\nn=1\n\nn=1\n\nHere the expectation is taken with respect to a variational distribution q. We choose the fully factor-\nized distribution,\n\nq(\u03b8, z1:N | \u03b3 , \u03c61:N ) = q(\u03b8 | \u03b3 )QN\n\nn=1 q(zn | \u03c6n),\n\n(3)\n\n3\n\nFigure 1: (Left) A graphical model representation of Supervised Latent Dirichlet allocation. (Bottom) The topics of a 10-topic sLDA model \ufb01t to the movie review data of Section 3. bothmotionsimpleperfectfascinatingpowercomplexhowevercinematographyscreenplayperformancespictureseffectivepicturehistheircharactermanywhileperformancebetween!30!20!1001020!!!!!!!!!!morehasthan\ufb01lmsdirectorwillcharactersonefromtherewhichwhomuchwhatawfulfeaturingroutinedryofferedcharlieparisnotaboutmovieallwouldtheyitshavelikeyouwasjustsomeoutbadguyswatchableitsnotonemovieleastproblemunfortunatelysupposedworse\ufb02atdull\u03b8dZd,nWd,nNDK\u03b2k\u03b1Yd\u03b7,\u03c32\f(4)\n\n(5)\n\n2\u03c3 2 .\n\n(cid:17).\n\n>(cid:3)\u03b7\n\ny2 \u2212 2y\u03b7>\n\nwhere \u03b3 is a K-dimensional Dirichlet parameter vector and each \u03c6n parametrizes a categorical dis-\ntribution over K elements. Notice E[Zn] = \u03c6n.\nThe \ufb01rst three terms and the entropy of the variational distribution are identical to the corresponding\nlog(cid:0)2\u03c0 \u03c3 2(cid:1) \u2212(cid:16)\nterms in the ELBO for unsupervised LDA [4]. The fourth term is the expected log probability of the\nresponse variable given the latent topic assignments,\nE[log p(y | Z1:N , \u03b7, \u03c3 2)] = = \u22121\n2\nThe \ufb01rst expectation is E(cid:2) \u00afZ(cid:3) = \u00af\u03c6 := (1/N )PN\n(cid:16)PN\n>(cid:3) = (1/N 2)\nP\nm6=n \u03c6n\u03c6>\nn=1\nTo see (5), notice that for m 6= n, E[Zn Z>\nm ] = E[Zn]E[Zm]> = \u03c6n\u03c6>\ndistribution is fully factorized. On the other hand, E[Zn Z>\nis an indicator vector.\nFor a single document-response pair, we maximize (2) with respect to \u03c61:N and \u03b3 to obtain an\nestimate of the posterior. We use block coordinate-ascent variational inference, maximizing with\nrespect to each variational parameter vector in turn.\nOptimization with respect to \u03b3 . The terms that involve the variational Dirichlet \u03b3 are identical to\nthose in unsupervised LDA, i.e., they do not involve the response variable y. Thus, the coordinate\nascent update is as in [4],\n\nm +PN\nm because the variational\nn ] = diag(E[Zn]) = diag(\u03c6n) because Zn\n\nE(cid:2) \u00afZ \u00afZ\nE(cid:2) \u00afZ(cid:3) + \u03b7>\nn=1 diag{\u03c6n}(cid:17)\n\nn=1 \u03c6n, and the second expectation is\n.\n\nE(cid:2) \u00afZ \u00afZ\n\nn=1 \u03c6n.\n(6)\nn6= j \u03c6n. Given j \u2208 {1, . . . , N}. In [3], we\nmaximize the Lagrangian of the ELBO, which incorporates the constraint that the components of \u03c6 j\nsum to one, and obtain the coordinate update\n\n\u03b3 new \u2190 \u03b1 +PN\nOptimization with respect to \u03c6 j. De\ufb01ne \u03c6\u2212 j := P\n(cid:2)2(cid:0)\u03b7>\u03c6\u2212 j\n(cid:26)\nE[log \u03b8 | \u03b3 ] + E[log p(w j | \u03b21:K )] +(cid:16) y\nj \u221d exp\n\u03c6new\nthat E[log \u03b8i | \u03b3 ] = \u0001(\u03b3i ) \u2212 \u0001(P \u03b3 j ), where \u0001(\u00b7) is the digamma function.\n\nExponentiating a vector means forming the vector of exponentials. The proportionality symbol\nmeans the components of \u03c6new\nare computed according to (7), then normalized to sum to one. Note\n\n(cid:1)\u03b7 + (\u03b7 \u25e6 \u03b7)(cid:3)\n\n2N 2\u03c3 2\n\n.\n\n(7)\n\nj\n\n(cid:17)\n\n\u03b7 \u2212\n\nN \u03c3 2\n\n(cid:27)\n\nThe central difference between LDA and sLDA lies in this update. As in LDA, the jth word\u2019s\nvariational distribution over topics depends on the word\u2019s topic probabilities under the actual model\n(determined by \u03b21:K ). But w j \u2019s variational distribution, and those of all other words, affect the\nprobability of the response, through the expected residual sum of squares (RSS), which is the second\nterm in (4). The end result is that the update (7) also encourages \u03c6 j to decrease this expected RSS.\nThe update (7) depends on the variational parameters \u03c6\u2212 j of all other words. Thus, unlike LDA, the\n\u03c6 j cannot be updated in parallel. Distinct occurrences of the same term are treated separately.\n\n2.2 M-step and prediction\n\nThe corpus-level ELBO lower bounds the joint log likelihood across documents, which is the sum of\nthe per-document log-likelihoods. In the E-step, we estimate the approximate posterior distribution\nfor each document-response pair using the variational inference algorithm described above. In the\nM-step, we maximize the corpus-level ELBO with respect to the model parameters \u03b21:K , \u03b7, and \u03c3 2.\nFor our purposes, it suf\ufb01ces simply to \ufb01x \u03b1 to 1/K times the ones vector. In this section, we add\ndocument indexes to the previous section\u2019s quantities, so y becomes yd and \u00afZ becomes \u00afZd.\nEstimating the topics. The M-step updates of the topics \u03b21:K are the same as for unsupervised\nLDA, where the probability of a word under a topic is proportional to the expected number of times\nthat it was assigned to that topic [4],\n\u02c6\u03b2new\n\nNX\n\nk,w \u221d DX\n\n1(wd,n = w)\u03c6k\nd,n.\n\n(8)\n\nd=1\n\nn=1\n\n4\n\n\fHere again, proportionality means that each \u02c6\u03b2new\nEstimating the regression parameters. The only terms of the corpus-level ELBO involving \u03b7 and\n\u03c3 2 come from the corpus-level analog of (4).\nDe\ufb01ne y = y1:D as the vector of response values across documents. Let A be the D \u00d7 (K + 1)\nmatrix whose rows are the vectors \u00afZ>\n\nd . Then the corpus-level version of (4) is\n\nis normalized to sum to one.\n\nk\n\nh\n\ni\n(y \u2212 A\u03b7)>(y \u2212 A\u03b7)\n\n.\n\n(9)\n\nE[log p(y | A, \u03b7, \u03c3 2)] = \u2212 D\n2\n\nlog(2\u03c0 \u03c3 2) \u2212 1\n\n2\u03c3 2 E\n\n>\n\nE(cid:2)A\n\nHere the expectation is over the matrix A, using the variational distribution parameters chosen in\nthe previous E-step. Expanding the inner product, using linearity of expectation, and applying the\n\ufb01rst-order condition for \u03b7, we arrive at an expected-value version of the normal equations:\n\nA(cid:3)\u03b7 = E[A]\ny\n(10)\n(cid:3), with each term having a \ufb01xed value from the previous E-step\nstep. Also, E(cid:2)A> A(cid:3) =P\nd E(cid:2) \u00afZd\nNote that the dth row of E[A] is just \u00af\u03c6d, and all these average vectors were \ufb01xed in the previous E-\n\u00afZ>\nas well, given by (5). We caution again: formulas in the previous section, such as (5), suppress the\ndocument indexes which appear here.\nWe now apply the \ufb01rst-order condition for \u03c3 2 to (9) and evaluate the solution at \u02c6\u03b7new, obtaining:\n\n\u02c6\u03b7new \u2190(cid:16)\n\nA(cid:3)(cid:17)\u22121\n\nE(cid:2)A\n\nE[A]\n\n\u21d2\n\ny .\n\n>\n\n>\n\n>\n\nd\n\nnew \u2190 (1/D){y\n\u02c6\u03c3 2\n\n>\n\ny \u2212 y\n\n>\n\nE[A]\n\n>\n\ny} .\n\nE[A]\n\n(11)\n\n(cid:16)\n\nE(cid:2)A\n\n>\n\nA(cid:3)(cid:17)\u22121\n\nPrediction. Our focus in applying sLDA is prediction. Speci\ufb01cally, we wish to compute the ex-\npected response value, given a new document w1:N and a \ufb01tted model {\u03b1, \u03b21:K , \u03b7, \u03c3 2}:\n\nE[Y | w1:N , \u03b1, \u03b21:K , \u03b7, \u03c3 2] = \u03b7>\n\nE[ \u00afZ | w1:N , \u03b1, \u03b21:K ].\n\n(12)\nThe identity follows easily from iterated expectation. We approximate the posterior mean of \u00afZ using\nthe variational inference procedure of the previous section. But here, the terms depending on y are\nremoved from the \u03c6 j update in (7). Notice this is the same as variational inference for unsupervised\nLDA: since we averaged the response variable out of the right-hand side in (12), what remains is the\nstandard unsupervised LDA model for Z1:N and \u03b8.\nThus, given a new document, we \ufb01rst compute Eq[Z1:N ], the variational posterior distribution of the\nlatent variables Zn. Then, we estimate the response with\nE[Y | w1:N , \u03b1, \u03b21:K , \u03b7, \u03c3 2] \u2248 \u03b7>\n2.3 Diverse response types via generalized linear models\n\nEq[ \u00afZ] = \u03b7> \u00af\u03c6.\n\n(13)\n\nUp to this point, we have con\ufb01ned our attention to an unconstrained real-valued response variable.\nIn many applications, however, we need to predict a categorical label, or a non-negative integral\ncount, or a response with other kinds of constraints. Sometimes it is reasonable to apply a normal\nlinear model to a suitably transformed version of such a response. When no transformation results\nin approximate normality, statisticians often make use of a generalized linear model, or GLM [9].\nIn this section, we describe sLDA in full generality, replacing the normal linear model of the earlier\nexposition with a GLM formulation. As we shall see, the result is a generic framework which can be\nspecialized in a straightforward way to supervised topic models having a variety of response types.\nThere are two main ingredients in a GLM: the \u201crandom component\u201d and the \u201csystematic compo-\nnent.\u201d For the random component, one takes the distribution of the response to be an exponential\ndispersion family with natural parameter \u03b6 and dispersion parameter \u03b4:\n\np(y | \u03b6, \u03b4) = h(y, \u03b4) exp\n\n.\n\n(14)\n\n(cid:26) \u03b6 y \u2212 A(\u03b6 )\n\n(cid:27)\n\n\u03b4\n\nFor each \ufb01xed \u03b4, (14) is an exponential family, with base measure h(y, \u03b4), suf\ufb01cient statistic y,\nand log-normalizer A(\u03b6 ). The dispersion parameter provides additional \ufb02exibility in modeling the\nvariance of y. Note that (14) need not be an exponential family jointly in (\u03b6, \u03b4).\n\n5\n\n\fIn the systematic component of the GLM, we relate the exponential-family parameter \u03b6 of the ran-\ndom component to a linear combination of covariates \u2013 the so-called linear predictor. For sLDA,\nthe linear predictor is \u03b7>\u00afz. In fact, we simply set \u03b6 = \u03b7>\u00afz. Thus, in the general version of sLDA,\nthe previous speci\ufb01cation in step 3 of the generative process is replaced with\n\nso that\n\np(y | z1:N , \u03b7, \u03b4) = h(y, \u03b4) exp\n\ny | z1:N , \u03b7, \u03b4 \u223c GLM(\u00afz, \u03b7, \u03b4) ,\n\n(cid:26) \u03b7>(\u00afzy) \u2212 A(\u03b7>\u00afz)\n\n(cid:27)\n\n.\n\n(15)\n\n(16)\n\n\u03b4\n\n\u03b4\n\n.\n\n(17)\n\n(cid:17)\n\nn\n\n(cid:19) \u2202\n\n\u03b7\u2212(cid:18)1\n\nE[log p(y | Z1:N , \u03b7, \u03b4)] = log h(y, \u03b4) + 1\n\nThe reader familiar with GLMs will recognize that our choice of systematic component means sLDA\nuses only canonical link functions. In future work, we will relax this constraint.\nWe now have the \ufb02exibility to model any type of response variable whose distribution can be written\nin exponential dispersion form (14). As is well known, this includes many commonly used distribu-\ntions: the normal; the binomial (for binary response); the Poisson and negative binomial (for count\ndata); the gamma, Weibull, and inverse Gaussian (for failure time data); and others. Each of these\n\u221a\ndistributions corresponds to a particular choice of h(y, \u03b4) and A(\u03b6 ). For example, it is easy to show\nthat the normal distribution corresponds to h(y, \u03b4) = (1/\n2\u03c0 \u03b4) exp{\u2212y2/(2\u03b4)} and A(\u03b6 ) = \u03b6 2/2.\nIn this case, the usual parameters \u00b5 and \u03c3 2 just equal \u03b6 and \u03b4, respectively.\n\u03b7>(cid:0)E(cid:2) \u00afZ(cid:3) y(cid:1) \u2212 E(cid:2)A(\u03b7> \u00afZ )(cid:3)i\nh\nVariational E-step. The distribution of y appears only in the cross-entropy term (4). Its form under\nthe GLM is\n= E[log \u03b8 | \u03b3 ]+E[log p(w j | \u03b21:K )]\u2212log \u03c6 j +1+(cid:16) y\n\nThis changes the coordinate ascent step for each \u03c6 j , but the variational optimization is otherwise\nunaffected. In particular, the gradient of the ELBO with respect to \u03c6 j becomes\n\u2202L\n\u2202\u03c6 j\nThus, the key to variational inference in sLDA is obtaining the gradient of the expected GLM log-\nnormalizer. Sometimes there is an exact expression, such as the normal case of Section 2. As another\nexample, the Poisson GLM leads to an exact gradient, which we omit for brevity.\nOther times, no exact gradient is available. In a longer paper [3], we study two methods for this\nsituation. First, we can replace \u2212E[A(\u03b7> \u00afZ )] with an adjustable lower bound whose gradient is\nknown exactly; then we maximize over the original variational parameters plus the parameter con-\ntrolling the bound. Alternatively, an application of the multivariate delta method for moments [1],\nplus standard exponential family theory, shows\n\nE(cid:2)A(\u03b7> \u00afZ )(cid:3)o\n\nE(cid:2)A(\u03b7> \u00afZ )(cid:3) \u2248 A(\u03b7> \u00af\u03c6) + VarGLM(Y | \u03b6 = \u03b7> \u00af\u03c6) \u00b7 \u03b7>\n\n(19)\nHere, VarGLM denotes the response variance under the GLM, given a speci\ufb01ed value of the natu-\nral parameter\u2014in all standard cases, this variance is a closed-form function of \u03c6 j . The variance-\ncovariance matrix of \u00afZ under q is already known in closed from from E[ \u00afZ] and (5). Thus, computing\n\u2202/\u2202\u03c6 j of (19) exactly is mechanical. However, using this approximation gives up the usual guaran-\ntee that the ELBO lower bounds the marginal likelihood. We forgo details and further examples due\nto space constraints.\nThe GLM contribution to the gradient determines whether the \u03c6 j coordinate update itself has a\nclosed form, as it does in the normal case (7) and the Poisson case (omitted). If the update is not\nclosed-form, we use numerical optimization, supplying a gradient obtained from one of the methods\ndescribed in the previous paragraph.\nParameter estimation (M-step). The topic parameter estimates are given by (8), as before. For the\ncorpus-level ELBO, the gradient with respect to \u03b7 becomes\n\nVarq ( \u00afZ )\u03b7 .\n\n. (18)\n\n\u03b4\n\n\u2202\u03c6 j\n\nN \u03b4\n\n(cid:18)1\n\n(cid:19) DX\n\nn\n\n\u03b7> \u00af\u03c6d yd \u2212 E(cid:2)A(\u03b7> \u00afZd )(cid:3)o =(cid:18)1\n\n(cid:19)( DX\n\n\u00af\u03c6d yd \u2212 DX\n\n\u03b4\n\nd=1\n\n(20)\nThe appearance of \u00b5(\u00b7) = EGLM[Y | \u03b6 = \u00b7] follows from exponential family properties. This GLM\nmean response is a known function of \u03b7> \u00afZd in all standard cases. However, Eq[\u00b5(\u03b7> \u00afZd ) \u00afZd] has\n\nd=1\n\nd=1\n\nEq\n\n\u03b4\n\n.\n\n(cid:2)\u00b5(\u03b7> \u00afZd ) \u00afZd\n\n(cid:3))\n\n\u2202\n\u2202\u03b7\n\n6\n\n\fFigure 2: Predictive R2 and per-word likelihood for the movie and Digg data (see Section 3).\n\nan exact solution only in some cases (e.g. normal, Poisson). In other cases, we approximate the\nexpectation with methods similar to those applied for the \u03c6 j coordinate update. Reference [3] has\ndetails, including estimation of \u03b4 and prediction, where we encounter the same issues.\nThe derivative with respect to \u03b4, evaluated at \u02c6\u03b7new, is\n\n\u00afZd ) \u00afZd\n\nnew\n\n.\n\n(21)\n\n( DX\n\nd=1\n\n\u2202h(yd , \u03b4)/\u2202\u03b4\n\nh(yd , \u03b4)\n\n)\n\n\u2212(cid:18) 1\n\n\u03b42\n\n(cid:19)( DX\n\nd=1\n\n\u00af\u03c6d yd \u2212 DX\n\nEq\n\nd=1\n\n(cid:2)\u00b5(\u02c6\u03b7>\n\n(cid:3))\n\nGiven that the rightmost summation has been evaluated, exactly or approximately, during the \u03b7\noptimization, (21) has a closed form. Depending on h(y, \u03b4) and its partial with respect to \u03b4, we\nobtain \u02c6\u03b4new either in closed form or via one-dimensional numerical optimization.\nPrediction. We form predictions just as in Section 2.2. The difference is that we now approximate\nthe expected response value of a test document as\n\nE[Y | w1:N , \u03b1, \u03b21:K , \u03b7, \u03b4] \u2248 Eq[\u00b5(\u03b7> \u00afZ )].\n\n(22)\n\nAgain, this follows from iterated expectation plus the variational approximation. When the varia-\ntional expectation cannot be computed exactly, we apply the approximation methods we relied on\nfor the GLM E-step and M-step. We defer speci\ufb01cs to [3].\n\n3 Empirical results\n\nWe evaluated sLDA on two prediction problems. First, we consider \u201csentiment analysis\u201d of news-\npaper movie reviews. We use the publicly available data introduced in [10], which contains movie\nreviews paired with the number of stars given. While Pang and Lee treat this as a classi\ufb01cation\nproblem, we treat it as a regression problem. With a 5000-term vocabulary chosen by tf-idf, the\ncorpus contains 5006 documents and comprises 1.6M words.\nSecond, we introduce the problem of predicting web page popularity on Digg.com. Digg is a com-\nmunity of users who share links to pages by submitting them to the Digg homepage, with a short\ndescription. Once submitted, other users \u201cdigg\u201d the links they like. Links are sorted on the Digg\nhomepage by the number of diggs they have received. Our Digg data set contains a year of link\ndescriptions, paired with the number of diggs each received during its \ufb01rst week on the homepage.\n(This corpus will be made publicly available at publication.) We restrict our attention to links in the\ntechnology category. After trimming the top ten outliers, and using a 4145-term vocabulary chosen\nby tf-idf, the Digg corpus contains 4078 documents and comprises 94K words.\nFor both sets of response variables, we transformed to approximate normality by taking logs. This\nmakes the data amenable to the continuous-response model of Section 2; for these two problems,\ngeneralized linear modeling turned out to be unnecessary. We initialized \u03b21:K to uniform topics, \u03c3 2\nto the sample variance of the response, and \u03b7 to a grid on [\u22121, 1] in increments of 2/K . We ran EM\nuntil the relative change in the corpus-level likelihood bound was less than 0.01%. In the E-step,\nwe ran coordinate-ascent variational inference for each document until the relative change in the\n\n7\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf241020300.000.020.040.060.080.100.12Number of topicsPredictive R2\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf24102030\u22128.6\u22128.5\u22128.4\u22128.3\u22128.2\u22128.1\u22128.0Number of topicsPer\u2212word held out log likelihood\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf5101520253035404550\u22126.42\u22126.41\u22126.40\u22126.39\u22126.38\u22126.37Number of topicsPer\u2212word held out log likelihood\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf51015202530354045500.00.10.20.30.40.5Number of topicsPredictive R2sLDALDAMovie corpusDigg corpus\fper-document ELBO was less than 0.01%. For the movie review data set, we illustrate in Figure 1 a\nmatching of the top words from each topic to the corresponding coef\ufb01cient \u03b7k.\ncaptured by the out-of-fold predictions: pR2 := 1 \u2212 (P(y \u2212 \u02c6y)2)/(P(y \u2212 \u00afy)2).\nWe assessed the quality of the predictions with \u201cpredictive R2.\u201d In our 5-fold cross-validation (CV),\nwe de\ufb01ned this quantity as the fraction of variability in the out-of-fold response values which is\nWe compared sLDA to linear regression on the \u00af\u03c6d from unsupervised LDA. This is the regression\nequivalent of using LDA topics as classi\ufb01cation features [4].Figure 2 (L) illustrates that sLDA pro-\nvides improved predictions on both data sets. Moreover, this improvement does not come at the cost\nof document model quality. The per-word hold-out likelihood comparison in Figure 2 (R) shows that\nsLDA \ufb01ts the document data as well or better than LDA. Note that Digg prediction is signi\ufb01cantly\nharder than the movie review sentiment prediction, and that the homogeneity of Digg technology\ncontent leads the model to favor a small number of topics.\nFinally, we compared sLDA to the lasso, which is L1-regularized least-squares regression. The\nlasso is a widely used prediction method for high-dimensional problems. We used each document\u2019s\nempirical distribution over words as its lasso covariates, setting the lasso complexity parameter with\n5-fold CV. On Digg data, the lasso\u2019s optimal model complexity yielded a CV pR2 of 0.088. The best\nsLDA pR2 was 0.095, an 8.0% relative improvement. On movie data, the best Lasso pR2 was 0.457\nversus 0.500 for sLDA, a 9.4% relative improvement. Note moreover that the Lasso provides only a\nprediction rule, whereas sLDA models latent structure useful for other purposes.\n\n4 Discussion\n\nWe have developed sLDA, a statistical model of labelled documents. The model accommodates the\ndifferent types of response variable commonly encountered in practice. We presented a variational\nprocedure for approximate posterior inference, which we then incorporated in an EM algorithm\nfor maximum-likelihood parameter estimation. We studied the model\u2019s predictive performance on\ntwo real-world problems. In both cases, we found that sLDA moderately improved on the lasso,\na state-of-the-art regularized regression method. Moreover, the topic structure recovered by sLDA\nhad higher hold-out likelihood than LDA on one problem, and equivalent hold-out likelihood on the\nother. These results illustrate the bene\ufb01ts of supervised dimension reduction when prediction is the\nultimate goal.\n\nAcknowledgments\n\nDavid M. Blei is supported by grants from Google and the Microsoft Corporation.\n\nReferences\n\n[1] P. Bickel and K. Doksum. Mathematical Statistics. Prentice Hall, 2000.\n[2] D. Blei and M. Jordan. Modeling annotated data. In SIGIR, pages 127\u2013134. ACM Press, 2003.\n[3] D. Blei and J. McAuliffe. Supervised topic models. In preparation, 2007.\n[4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[5] P. Flaherty, G. Giaever, J. Kumm, M. Jordan, and A. Arkin. A latent variable model for\n\nchemogenomic pro\ufb01ling. Bioinformatics, 21(15):3286\u20133293, 2005.\n\n[6] K. Fukumizu, F. Bach, and M. Jordan. Dimensionality reduction for supervised learning with\n\nreproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73\u201399, 2004.\n\n[7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. 2001.\n[8] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Genera-\n\ntive/discriminative training for clustering and classi\ufb01cation. In AAAI, 2006.\n\n[9] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, 1989.\n[10] B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization\n\nwith respect to rating scales. In Proceedings of the ACL, 2005.\n\n8\n\n\f", "award": [], "sourceid": 893, "authors": [{"given_name": "Jon", "family_name": "Mcauliffe", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}