{"title": "DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 904, "abstract": "Probabilistic topic models (and their extensions) have become popular as models of latent structures in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood estimation, an approach which may be suboptimal in the context of an overall classification problem. In this paper, we describe DiscLDA, a discriminative learning framework for such models as Latent Dirichlet Allocation (LDA) in the setting of dimensionality reduction with supervised side information. In DiscLDA, a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood using Monte Carlo EM. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroup ocument classification task.", "full_text": "DiscLDA: Discriminative Learning for\n\nDimensionality Reduction and Classi\ufb01cation\n\nSimon Lacoste-Julien\n\nComputer Science Division\n\nUC Berkeley\n\nBerkeley, CA 94720\n\nFei Sha\n\nDept. of Computer Science\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nMichael I. Jordan\n\nDept. of EECS and Statistics\n\nUC Berkeley\n\nBerkeley, CA 94720\n\nAbstract\n\nProbabilistic topic models have become popular as methods for dimensionality\nreduction in collections of text documents or images. These models are usually\ntreated as generative models and trained using maximum likelihood or Bayesian\nmethods. In this paper, we discuss an alternative: a discriminative framework in\nwhich we assume that supervised side information is present, and in which we\nwish to take that side information into account in \ufb01nding a reduced dimensional-\nity representation. Speci\ufb01cally, we present DiscLDA, a discriminative variation on\nLatent Dirichlet Allocation (LDA) in which a class-dependent linear transforma-\ntion is introduced on the topic mixture proportions. This parameter is estimated\nby maximizing the conditional likelihood. By using the transformed topic mix-\nture proportions as a new representation of documents, we obtain a supervised\ndimensionality reduction algorithm that uncovers the latent structure in a docu-\nment collection while preserving predictive power for the task of classi\ufb01cation.\nWe compare the predictive power of the latent structure of DiscLDA with unsu-\npervised LDA on the 20 Newsgroups document classi\ufb01cation task and show how\nour model can identify shared topics across classes as well as class-dependent\ntopics.\n\n1 Introduction\n\nDimensionality reduction is a common and often necessary step in most machine learning appli-\ncations and high-dimensional data analyses. There is a rich history and literature on the subject,\nranging from classical linear methods such as principal component analysis (PCA) and Fisher dis-\ncriminant analysis (FDA) to a variety of nonlinear procedures such as kernelized versions of PCA\nand FDA as well as manifold learning algorithms.\n\nA recent trend in dimensionality reduction is to focus on probabilistic models. These models, which\ninclude generative topological mapping, factor analysis, independent component analysis and prob-\nabilistic latent semantic analysis (pLSA), are generally speci\ufb01ed in terms of an underlying indepen-\ndence assumption or low-rank assumption. The models are generally \ufb01t with maximum likelihood,\nalthough Bayesian methods are sometimes used. In particular, Latent Dirichlet Allocation (LDA) is\na Bayesian model in the spirit of pLSA that models each data point (e.g., a document) as a collec-\ntion of draws from a mixture model in which each mixture component is known as a topic [3]. The\nmixing proportions across topics are document-speci\ufb01c, and the posterior distribution across these\nmixing proportions provides a reduced representation of the document. This model has been used\nsuccessfully in a number of applied domains, including information retrieval, vision and bioinfor-\nmatics [8, 1].\n\nThe dimensionality reduction methods that we have discussed thus far are entirely unsupervised.\nAnother branch of research, known as suf\ufb01cient dimension reduction (SDR), aims at making use of\n\n\fsupervisory data in dimension reduction [4, 7]. For example, we may have class labels or regression\nresponses at our disposal. The goal of SDR is then to identify a subspace or other low-dimensional\nobject that retains as much information as possible about the supervisory signal. Having reduced di-\nmensionality in this way, one may wish to subsequently build a classi\ufb01er or regressor in the reduced\nrepresentation. But there are other goals for the dimension reduction as well, including visualization,\ndomain understanding, and domain transfer (i.e., predicting a different set of labels or responses).\n\nIn this paper, we aim to combine these two lines of research and consider a supervised form of LDA.\nIn particular, we wish to incorporate side information such as class labels into LDA, while retain-\ning its favorable unsupervised dimensionality reduction abilities. The goal is to develop parameter\nestimation procedures that yield LDA topics that characterize the corpus and maximally exploit the\npredictive power of the side information.\n\nAs a parametric generative model, parameters in LDA are typically estimated with maximum like-\nlihood estimation or Bayesian posterior inference. Such estimates are not necessarily optimal for\nyielding representations for prediction and regression. In this paper, we use a discriminative learn-\ning criterion\u2014conditional likelihood\u2014to train a variant of the LDA model. Moreover, we augment\nthe LDA parameterization by introducing class-label-dependent auxiliary parameters that can be\ntuned by the discriminative criterion. By retaining the original LDA parameters and introducing\nthese auxiliary parameters, we are able to retain the advantages of the likelihood-based training\nprocedure and provide additional freedom for tracking the side information.\n\nThe paper is organized as follows. In Section 2, we introduce the discriminatively trained LDA (Dis-\ncLDA) model and contrast it to other related variants of LDA models. In Section 3, we describe our\napproach to parameter estimation for the DiscLDA model. In Section 4, we report empirical results\non applying DiscLDA to model text documents. Finally, in Section 5 we present our conclusions.\n\n2 Model\n\nWe start by reviewing the LDA model [3] for topic modeling. We then describe our extension to\nLDA that incorporates class-dependent auxiliary parameters. These parameters are to be estimated\nbased on supervised information provided in the training data set.\n\n2.1 LDA\n\nThe LDA model is a generative process where each document in the text corpus is modeled as a set\nof draws from a mixture distribution over a set of hidden topics. A topic is modeled as a probability\ndistribution over words. Let the vector wd be the bag-of-words representation of document d. The\ngenerative process for this vector is illustrated in Fig. 1 and has three steps: 1) the document\nis \ufb01rst associated with a K-dimensional topic mixing vector \u03b8d which is drawn from a Dirichlet\ndistribution, \u03b8d \u223c Dir(\u03b1); 2) each word wdn in the document is then assigned to a single topic zdn\ndrawn from the multinomial variable, zdn \u223c Multi(\u03b8d); 3) \ufb01nally, the word wdn is drawn from a\nV -dimensional multinomial variable, wdn \u223c Multi(\u03c6zdn), where V is the size of the vocabulary.\nGiven a set of documents, {wd}D\nk=1. This\ncan be done by maximum likelihood, \u03a6\u2217 = arg max\u03a6 p({wd}; \u03a6), where \u03a6 \u2208 \u211cV \u00d7K is a matrix\nparameter whose columns {\u03c6k}K\nk=1 are constrained to be members of a probability simplex. It is\nalso possible to place a prior probability distribution on the word probability vectors {\u03c6k}K\nk=1\u2014e.g.,\na Dirichlet prior, \u03c6k \u223c Dir(\u03b2)\u2014and treat the parameter \u03a6 as well as the hyperparameters \u03b1 and \u03b2\nvia Bayesian methods. In both the maximum likelihood and Bayesian framework it is necessary to\nintegrate over \u03b8d to obtain the marginal likelihood, and this is accomplished either using variational\ninference or Gibbs sampling [3, 8].\n\nd=1, the principal task is to estimate the parameters {\u03c6k}K\n\n2.2 DiscLDA\n\nIn our setting, each document is additionally associated with a categorical variable or class la-\nbel yd \u2208 {1, 2, . . . , C} (encoding, for example, whether a message was posted in the newsgroup\nalt.atheism vs. talk.religion.misc). To model this labeling information, we introduce\na simple extension to the standard LDA model. Speci\ufb01cally, for each class label y, we introduce a\nlinear transformation T y : \u211cK \u2192 \u211cL\n+, which transforms a K-dimensional Dirichlet variable \u03b8d to\n\n\f\u03b1\n\n\u03b8d\n\nzdn\n\nwdn\n\nN\n\nD\n\n\u03b2\n\n\u03a6\n\n\u03b1\n\nyd\n\n\u03b8d\n\nzdn\n\nwdn\n\nN\n\nD\n\n\u03c0\n\nT\n\n\u03b2\n\n\u03a6\n\nyd\n\n\u03b1\n\n\u03b8d\n\nzdn\n\nudn\n\nwdn N\n\nD\n\n\u03c0\n\nT\n\n\u03b2\n\n\u03a6\n\nFigure 1: LDA model.\n\nFigure 2: DiscLDA.\n\nFigure 3: DiscLDA with\nauxiliary variable u.\n\na mixture of Dirichlet distributions: T y\u03b8d \u2208 \u211cL. To generate a word wdn, we draw its topic zdn\nfrom T yd\u03b8d. Note that T y is constrained to have its columns sum to one to ensure the normalization\nof the transformed variable T y\u03b8d and is thus a stochastic matrix. Intuitively, every document in the\n\ntext corpus is represented through \u03b8d as a point in the topic simplex {\u03b8 | Pk \u03b8k = 1}, and we hope\n\nthat the linear transformation {T y} will be able to reposition these points such that documents with\nthe same class labels are represented by points nearby to each other. Note that these points can not\nbe placed arbitrarily, as all documents\u2014whether they have the same class labels or they do not\u2014\nshare the parameter \u03a6 \u2208 \u211cV \u00d7L. The graphical model in Figure 2 shows the new generative process.\nCompared to standard LDA, we have added the nodes for the variable yd (and its prior distribution\n\u03c0), the transformation matrices T y and the corresponding edges.\n\nAn alternative to DiscLDA would be a model in which there are class-dependent topic parameters\n\u03c6y\n\nk which determine the conditional distribution of the words:\nwdn | zdn, yd, \u03a6 \u223c Multi(\u03c6yd\n\nzdn).\n\nThe problem with this approach is that the posterior p(y|w, \u03a6) is a highly non-convex function of \u03a6\nwhich makes its optimization very challenging given the high dimensionality of the parameter space\nin typical applications. Our approach circumvents this dif\ufb01culty by learning a low-dimensional\ntransformation of the \u03c6k\u2019s in a discriminative manner instead. Indeed, transforming the topic mix-\nture vector \u03b8 is actually equivalent to transforming the \u03a6 matrix. To see this, note that by marginal-\nizing out the hidden topic vector z, we get the following distribution for the word wdn given \u03b8:\n\nwdn | yd, \u03b8d, T \u223c Mult (\u03a6T y\u03b8d) .\n\nBy the associativity of the matrix product, we see that we obtain an equivalent probabilistic model\nby applying the linear transformation to \u03a6 instead, and, in effect, de\ufb01ning the class-dependent topic\nparameters as follows:\n\n\u03c6lT y\nlk.\n\n\u03c6y\n\nk =Xl\n\nAnother motivation for our approach is that it gives the model the ability to distinguish topics which\nare shared across different classes versus topics which are class-speci\ufb01c. For example, this separa-\ntion can be accomplished by using the following transformations (for binary classi\ufb01cation):\n\nT 1 = I K\n\n0\n\n0\n\n0\n\n0\n\nIK ! ,\n\nT 2 = 0\n\nIK\n0\n\n0\n\n0\n\nIK !\n\n(1)\n\nwhere IK stands for the identity matrix with K rows and columns. In this case, the last K topics\nare shared by both classes, whereas the two \ufb01rst groups of K topics are exclusive to one class or the\nother. We will explore this parametric structure later in our experiments.\n\nNote that we can give a generative interpretation to the transformation by augmenting the model\nwith a hidden topic vector variable u, as shown in Fig. 3, where\np(u = k|z = l, T , y) = T y\nkl.\n\n\fIn this augmented model T can be interpreted as the probability transition matrix from z-topics to\nu-topics.\n\nBy including a Dirichlet prior on the T parameters, the DiscLDA model can be related to the author-\ntopic model [10], if we restrict to the special case in which there is only one author per document.\nIn the author-topic model, the bag-of-words representation of a document is augmented by a list\nof the authors of the document. To generate a word in a document, one \ufb01rst picks at random the\nauthor associated with this document. Given the author (y in our notation), a topic is chosen accord-\ning to corpus-wide author-speci\ufb01c topic-mixture proportions (which is a column vector T y in our\nnotation). The word is then generated from the corresponding topic distribution as usual. Accord-\ning to this analogy, we see that our model not only enables us to predict the author of a document\n(assuming a small set of possible authors), but we also capture the content of documents (using \u03b8)\nas well as the corpus-wide class properties (using T ). The focus of the author-topic model was to\nmodel the interests of authors, not the content of documents, explaining why there was no need to\nadd document-speci\ufb01c topic-mixture proportions. Because we want to predict the class for a speci\ufb01c\ndocument, it is crucial that we also model the content of a document.\n\nRecently, there has been growing interest in topic modeling with supervised information. Blei and\nMcAuliffe [2] proposed a supervised LDA model where the empirical topic vector z (sampled from\n\u03b8) is used as a covariate for a regression on y (see also [6]). Mimno and McCallum [9] proposed a\nDirichlet-multinomial regression which can handle various types of side information, including the\ncase in which this side information is an indicator variable of the class (y)1. Our work differs from\ntheirs, however, in that we train the transformation parameter by maximum conditional likelihood\ninstead of a generative criterion.\n\n3 Inference and learning\n\nGiven a corpus of documents and their labels, we estimate the parameters {T y} by maximizing\n\nthe conditional likelihoodPd log p(yd | wd; {T y}, \u03a6) while holding \u03a6 \ufb01xed. To estimate the pa-\n\nrameters \u03a6, we hold the transformation matrices \ufb01xed and maximize the posterior of the model, in\nmuch the same way as in standard LDA models. Intuitively, the two different training objectives\nhave two effects on the model: the optimization of the posterior with respect to \u03a6 captures the topic\nstructure that is shared in documents throughout a corpus, while the optimization of the conditional\nlikelihood with respect to {T y} \ufb01nds a transformation of the topics that discriminates between the\ndifferent classes within the corpus.\n\nWe use the Rao-Blackwellized version of Gibbs sampling presented in [8] to obtain samples of z\nand u with \u03a6 and \u03b8 marginalized out. Those samples can be used to estimate the likelihood of\np(w|y, T ), and thus the posterior p(y|w, T ) for prediction, by using the harmonic mean estima-\ntor [8]. Even though this estimator can be unstable in general model selection problems, we found\nthat it gave reasonably stable estimates for our purposes.\n\nWe maximize the conditional likelihood objective with respect to T by using gradient ascent, for a\n\ufb01xed \u03a6. The gradient can be estimated by Monte Carlo EM, with samples from the Gibbs sampler.\nMore speci\ufb01cally, we use the matching property of gradients in EM to write the gradient as:\n\nqy\n\nt (z)(cid:20) \u2202\n\n\u2202T\n\nlog p(w, z|y, T , \u03a6)(cid:21)\u2212E\n\nrt(z)(cid:20) \u2202\n\n\u2202T\n\nlog p(w, z|T , \u03a6)(cid:21) , (2)\n\n\u2202\n\u2202T\n\nlog p(y|w, T , \u03a6) = E\n\nwhere qy\nt (z) = p(z|w, y, T t, \u03a6), rt(z) = p(z|w, T t, \u03a6) and the derivatives are evaluated at T =\nT t. We can approximate those expectations using the relevant Gibbs samples. After a few gradient\nupdates, we re\ufb01t \u03a6 by its MAP estimate from Gibbs samples.\n\n3.1 Dimensionality reduction\n\nWe can obtain a supervised dimensionality reduction method by using the average transformed topic\nvector as the reduced representation of a test document. We estimate it using E [T y\u03b8|\u03a6, w, T ] =\n\nPy p(y|\u03a6, w, T )E [T y\u03b8|y, \u03a6, w, T ]. The \ufb01rst term on the right-hand side of this equation can\n\n1In this case, their model is actually the same as Model 1 in [5] with an additional prior on the class-\n\ndependent parameters for the Dirichlet distribution on the topics.\n\n\f\u0001\u0002\u0003\u0001\u0003\u0003\u0002\u0003\u0003\u0002\u0003\u0001\u0003\u0003\u0001\u0002\u0003\n\u0001\u0002\u0003\n\n\u0001\u0003\u0003\n\n\u0002\u0003\n\n\u0003\n\n\u0002\u0003\n\n\u0001\u0003\u0003\n\n\u0001\u0004\u0005\u0006\u0002\u0007\b\t\n\u0001\u0003\u0001\u0001\u0001\u0004\u0001\u0005\u0001\u0006\u0001\u0002\u0001\u0007\u0001\b\u0001\t\u0001\n\u0004\u0003\n\n\u0001\u0002\u0003\n\n\u000b\f\r\u000e\u000b\f\u000e\u000e\u000b\r\u000e\u000e\r\u000e\f\u000e\u000e\f\r\u000e\n\n\u000b\f\r\u000e\n\n\f\u000f\u0010\u0011\r\u0012\u0013\u0014\u0015\f\u000e\f\f\f\u000f\f\u0010\f\u0011\f\r\f\u0012\f\u0013\f\u0014\f\u0015\u000f\u000e\n\n\f\r\u000e\n\n\u000b\f\u000e\u000e\n\n\u000b\r\u000e\n\n\u000e\n\n\u000e\n\n\f\u000e\u000e\n\nt-SNE 2D embedding of the\nFigure 4:\nE [T y\u03b8|\u03a6, w, T ]\nrepresentation of News-\ngroups documents, after \ufb01tting to the Dis-\ncLDA model (T was \ufb01xed).\n\nFigure 5:\nt-SNE 2D embedding of the\nE [\u03b8|\u03a6, w, T ] representation of Newsgroups\ndocuments, after \ufb01tting to the standard unsu-\npervised LDA model.\n\nbe estimated using the harmonic mean estimator and the second term can be approximated from\nMCMC samples of z. This new representation can be used as a feature vector for another classi\ufb01er\nor for visualization purposes.\n\n4 Experimental results\n\nWe evaluated the DiscLDA model empirically on text modeling and classi\ufb01cation tasks. Our ex-\nperiments aimed to demonstrate the bene\ufb01ts of discriminative training of LDA for discovering a\ncompact latent representation that contains both predictive and shared components across different\ntypes of data. We evaluated the performance of our model by contrasting it to standard LDA models\nthat were not trained discriminatively.\n\n4.1 Text modeling\n\nThe 20 Newsgroups dataset contains postings to Usenet newsgroups. The postings are organized by\ncontent into 20 related categories and are therefore well suited for topic modeling. In this section,\nwe investigate how DiscLDA can exploit the labeling information\u2014the category\u2014in discovering\nmeaningful hidden structures that differ from those found using unsupervised techniques.\n\nWe \ufb01t the dataset to both a standard 110-topic LDA model and a DiscLDA model with restricted\nforms of the transformation matrices {T y}y=20\ny=1 . Speci\ufb01cally, the transformation matrix T y for class\nlabel c is \ufb01xed and given by the following blocked matrix\n\nT y =\n\n.\n\n(3)\n\n0\n\n...\nI K0\n...\n\n0\n\n0\n\n...\n\n0\n\n...\nI K1\n\n\uf8eb\n\n\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8f6\n\n\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nThis matrix has (C + 1) rows and two columns of block matrices. All but two block matrices\nare zero matrices. At the \ufb01rst column and the row y, the block matrix is an identity matrix with\ndimensionality of K0 \u00d7 K0. The last element of T y is another identity matrix with dimensionality\nK1. When applying the transformation to a topic vector \u03b8 \u2208 \u211cK0+K1, we obtain a transformed\ntopic vector \u03b8tr = T y\u03b8 whose nonzeros elements partition the components \u03b8tr into (C + 1) disjoint\nsets: one set of K0 elements for each class label that does not overlap with the others, and a set\nof K1 components that is shared by all class labels. Intuitively, the shared components should use\nall class labels to model common latent structures, while nonoverlapping components should model\nspeci\ufb01c characteristics of data from each class.\n\n\fClass\nalt.atheism\n\ncomp.graphics\ncomp.os.ms-windows.misc\ncomp.sys.ibmpc.hardware\ncomp.sys.mac.hardware\ncomp.windows.x\nmisc.forsale\nrec.autos\nrec.motorcycles\nrec.sport.baseball\nrec.sport.hockey\nsci.crypt\n\nsci.electronics\n\nsci.med\nsci.space\nsoc.religion.christian\n\ntalk.politics.guns\n\ntalk.politics.mideast\n\ntalk.politics.misc\n\ntalk.religion.misc\n\nShared topics\n\nMost popular words\natheism, religion, bible, god, system, moral, atheists, keith, jesus, is-\nlam,\n\ufb01les, color, images, \ufb01le, image, format, software, graphics, jpeg, gif,\ncard, \ufb01les, mouse, \ufb01le, dos, drivers, win, ms, windows, driver,\ndrive, card, drives, bus, mb, os, disk, scsi, controller, ide,\ndrive, apple, mac, speed, monitor, mb, quadra, mhz, lc, scsi,\nserver, entry, display, \ufb01le, program, output, window, motif, widget, lib,\nprice, mail, interested, offer, cover, condition, dos, sale, cd, shipping,\ncars, price, drive, car, driving, speed, engine, oil, ford, dealer,\nca, ride, riding, dog, bmw, helmet, dod, bike, motorcycle, bikes,\ngames, baseball, year, game, runs, team, hit, players, season, braves,\nca, period, play, games, game, team, win, players, season, hockey,\ngovernment, key, public, security, chip, clipper, keys, db, privacy, en-\ncryption,\ncurrent, power, ground, wire, output, circuit, audio, wiring, voltage,\namp,\ngordon, food, disease, pitt, doctor, medical, pain, health, msg, patients,\nearth, space, moon, nasa, orbit, henry, launch, shuttle, satellite, lunar,\nchristians, bible, church, truth, god, faith, christian, christ, jesus, rut-\ngers,\npeople, gun, guns, government, \ufb01le, \ufb01re, fbi, weapons, militia,\n\ufb01rearms,\npeople, turkish, government, jews, israel, israeli, turkey, armenian, ar-\nmenians, armenia,\namerican, men, war, mr, tax, government, president, health, cramer,\nstephanopoulos,\nreligion, christians, bible, god, christian, christ, morality, objective,\nsandvik, jesus,\nca, people, post, wrote, group, system, world, work, ll, make, true, uni-\nversity, great, case, number, read, day, mail, information, send, back,\narticle, writes, question, \ufb01nd, things, put, don, cs, didn, good, end, ve,\nlong, point, years, doesn, part, time, state, fact, thing, made, problem,\nreal, david, apr, give, lot, news\n\nTable 1: Most popular words from each group of class-dependent topics or a bucket of \u201cshared\u201d\ntopics learned in the 20 Newsgroups experiment with \ufb01xed T matrix.\n\nIn a \ufb01rst experiment, we examined whether the DiscLDA model can exploit the structure for T y\ngiven in (3). In this experiment, we \ufb01rst obtained an estimate of the \u03a6 matrix by setting it to the\nMAP estimate from Gibbs samples as explained in Section 3. We then estimated a new represen-\ntation for test documents by taking the conditional expectation of T y\u03b8 with y marginalized out as\nexplained in Section 3.1. Finally, we then computed a 2D-embedding of this K1-dimensional rep-\nresentation of documents. To obtain an embedding, we \ufb01rst tried standard multidimensional scaling\n(MDS), using the symmetrical KL divergence between pairs of \u03b8tr topic vectors as a dissimilarity\nmetric, but the results were hard to visualize. A more interpretable embedding was obtained using\na modi\ufb01ed version of the t-SNE stochastic neighborhood embedding presented by van der Maaten\nand Hinton [11]. Fig. 4 shows a scatter plot of the 2D\u2013embedding of the topic representation of\nthe 20 Newsgroups test documents, where the colors of the dots, each corresponding to a document,\nencode class labels. Clearly, the documents are well separated in this space. In contrast, the em-\nbedding computed from standard LDA, shown in Fig. 5, does not show a clear separation. In this\nexperiment, we have set K0 = 5 and K1 = 10 for DiscLDA, yielding 110 possible topics; hence\nwe set K = 110 for the standard LDA model for proper comparison.\nIt is also instructive to examine in detail the topic structures of the \ufb01tted DiscLDA model. Given\nthe speci\ufb01c setup of our transformation matrix T , each component of the topic vector u is either\nassociated with a class label or shared across all class labels. For each component, we can compute\nthe most popular words associated from the word-topic distribution \u03a6. In Table 1, we list these\nwords and group them under each class labels and a special bucket \u201cshared.\u201d We see that the words\nare highly indicative of their associated class labels. Additionally, the words in the \u201cshared\u201d category\nare \u201cneutral,\u201d neither positively nor negatively suggesting proper class labels where they are likely\n\n\fLDA+SVM DiscLDA+SVM discLDA alone\n\n20%\n\n17%\n\n17%\n\nTable 2: Binary classi\ufb01cation error rates for two newsgroups\n\nto appear. In fact, these words con\ufb01rm the intuition of the DiscLDA model: they re\ufb02ect common\nEnglish usage underlying different documents. We note that we had already taken out a standard list\nof stop words from the documents.\n\n4.2 Document classi\ufb01cation\n\nIt is also of interest to consider the classi\ufb01cation problem more directly and ask whether the features\ndelivered by DiscLDA are more useful for classi\ufb01cation than those delivered by LDA. Of course, we\ncan also use DiscLDA as a classi\ufb01cation method per se, by marginalizing over the latent variables\nand computing the probability of the label y given the words in a test document. Our focus in this\nsection, however, is its featural representation. We thus use a different classi\ufb01cation method (the\nSVM) to compare the features obtained by DiscLDA to those obtained from LDA.\n\nIn a \ufb01rst experiment, we returned to the \ufb01xed T setting studied in Section 4.1 and considered the\nfeatures obtained by DiscLDA for the 20 Newsgroups problem. Speci\ufb01cally, we constructed mul-\nticlass linear SVM classi\ufb01ers using the expected topic proportion vectors from unsupervised LDA\nand DiscLDA models as features as described in Section 3.1. The results were as follows. Using the\ntopic vectors from standard LDA the error rate of classi\ufb01cation was 25%. When the topic vectors\nfrom the DiscLDA model were used we obtained an error rate of 20%. Clearly the DiscLDA features\nhave retained information useful for classi\ufb01cation.\n\nWe also computed the MAP estimate of the class label y\u2217 = arg max p(y|w) from DiscLDA and\nused this estimate directly as a classi\ufb01er. The error rate was again 20%.\nIn a second experiment, we considered the fully adaptive setting in which the transformation matrix\nT y is learned in a discriminative fashion as described in Section 3. We initialized the matrix T\nto a smoothed block diagonal matrix having a pattern similar to (1), with 20 shared topics and 20\nclass-dependent topics per class. We then sampled u and z for 300 Gibbs steps to obtain an initial\nestimate of the \u03a6 vector. This was followed by the discriminative learning process in which we\niteratively ran batch gradient (in the log domain, so that T remained normalized) using Monte Carlo\nEM with a constant step size for 10 epochs. We then re-estimated \u03a6 by sampling u conditioned\non (\u03a6, T ). This discriminative learning process was repeated until there was no improvement on a\nvalidation data set. The step size was chosen by grid search.\n\nIn this experiment, we considered the binary classi\ufb01cation problem of distinguishing postings of the\nnewsgroup alt.atheism from postings of the newsgroup talk.religion.misc, a dif\ufb01cult\ntask due to the similarity in content between these two groups.\n\nTable 2 summarizes the results of our experiment, where we have used topic vectors from unsuper-\nvised LDA and DiscLDA as input features to binary linear SVM classi\ufb01ers. We also computed the\nprediction of the label of a document directly with DiscLDA. As shown in the table, the DiscLDA\nmodel clearly generates topic vectors with better predictive power than unsupervised LDA.\n\nIn Table 3 we present the ten most probable words for a subset of topics learned using the discrim-\ninative DiscLDA approach. We found that the learned T had a block-diagonal structure similar\nto (3), though differing signi\ufb01cantly in some ways. In particular, although we started with 20 shared\ntopics the learned T had only 12 shared topics. We have grouped the topics in Table 3 according to\nwhether they were class-speci\ufb01c or shared, uncovering an interesting latent structure which appears\nmore discriminating than the topics presented in Table 1.\n\n5 Discussion\n\nWe have presented DiscLDA, a variation on LDA in which the LDA parametrization is augmented\nto include a transformation matrix and in which this matrix is learned via a conditional likelihood\ncriterion. This approach allows DiscLDA to retain the ability of the LDA approach to \ufb01nd useful\n\n\fTopics for alt.atheism\ngod, atheism, religion, atheists,\nreligious, atheist, belief, exis-\ntence, strong\nargument, true, conclusion, fal-\nlacy, arguments, valid,\nform,\nfalse, logic, proof\npeace, umd, mangoe, god, thing,\nlanguage, cs, wingate, contra-\ndictory, problem\n\nTopics for talk.religion.misc\nevil, group,\nlight, read, stop,\nreligions, muslims, understand,\nexcuse\nback, gay, convenient, christian-\nity, homosexuality, long, nazis,\nlove, homosexual, david\nbible, ra, jesus, true, christ, john,\nissue, church, lds, robert\n\nShared topics\nthings, bobby, men, makes, bad,\nmozumder, bill, ultb, isc, rit\n\nsystem, don, moral, morality,\nmurder, natural, isn, claim, or-\nder, animals\nevidence, truth, statement, sim-\nply, accept, claims, explain, sci-\nence, personal, left\n\nTable 3: Ten most popular words from a random selection of different types of topics learned in the\ndiscriminative learning experiment on the binary dataset.\n\nlow-dimensional representations of documents, but to also make use of discriminative side informa-\ntion (labels) in forming these representations.\n\nAlthough we have focused on LDA, we view our strategy as more broadly useful. A virtue of the\nprobabilistic modeling framework is that it can yield complex models that are modular and can be\ntrained effectively with unsupervised methods. Given the high dimensionality of such models, it may\nbe intractable to train all of the parameters via a discriminative criterion such as conditional likeli-\nhood. In this case it may be desirable to pursue a mixed strategy in which we retain the unsupervised\ncriterion for the full parameter space but augment the model with a carefully chosen transformation\nso as to obtain an auxiliary low-dimensional optimization problem for which conditional likelihood\nmay be more effective.\n\nAcknowledgements We thank the anonymous reviewers as well as Percy Liang, Iain Murray,\nGuillaume Obozinski and Erik Sudderth for helpful suggestions. Our work was supported by Grant\n0509559 from the National Science Foundation and by a grant from Google.\n\nReferences\n[1] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y. W. Teh, E. Learned-Miller, and\nD. A. Forsyth. Names and faces in the news. In Proceedings of IEEE Conference on Computer\nVision and Pattern Recognition, Washington, DC, 2004.\n\n[2] D. Blei and J. McAuliffe. Supervised topic models.\n\nIn J. Platt, D. Koller, Y. Singer, and\nS. Roweis, editors, Advances in Neural Information Processing Systems 20, Cambridge, MA,\n2008. MIT Press.\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[4] F. Chiaromonte and R. D. Cook. Suf\ufb01cient dimension reduction and graphics in regression.\n\nAnnals of the Institute of Statistical Mathematics, 54(4):768\u2013795, 2002.\n\n[5] L. Fei-fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories.\nIn Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego,\nCA, 2005.\n\n[6] P. Flaherty, G. Giaever, J. Kumm, M. I. Jordan, and A. P. Arkin. A latent variable model for\n\nchemogenomic pro\ufb01ling. Bioinformatics, 21:3286\u20133293, 2005.\n\n[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. Annals\n\n[8] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nof Statistics, 2008. To appear.\n\nof Sciences, 101:5228\u20135235, 2004.\n\n[9] D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-\nIn Proceedings of the 24th Annual Conference on Uncertainty in\n\nmultinomial regression.\nArti\ufb01cial Intelligence, Helsinki, Finland, 2008.\n\n[10] M. Rosen-Zvi, T. Grif\ufb01ths T, M. Steyvers, and P. Smyth. The author-topic model for authors\nand documents. In Proceedings of the 20th Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence, Banff, Canada, 2004.\n\n[11] L. J. P. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine\n\nLearning Research, 9:2579\u20132605, 2008.\n\n\f", "award": [], "sourceid": 993, "authors": [{"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": null}, {"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}