{"title": "Prediction and Semantic Association", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 18, "abstract": null, "full_text": "Prediction and Semantic Association \n\nThomas L. Griffiths & Mark Steyvers \n\nDepartment of Psychology \n\nStanford University, Stanford, CA 94305-2130 \n{gruffydd,msteyver}@psych.stanford.edu \n\nAbstract \n\nWe explore the consequences of viewing semantic association as \nthe result of attempting to predict the concepts likely to arise in a \nparticular context. We argue that the success of existing accounts \nof semantic representation comes as a result of indirectly addressing \nthis problem, and show that a closer correspondence to human data \ncan be obtained by taking a probabilistic approach that explicitly \nmodels the generative structure of language. \n\n1 \n\nIntroduction \n\nMany cognitive capacities, such as memory and categorization, can be analyzed as \nsystems for efficiently predicting aspects of an organism's environment [1]. Previ(cid:173)\nously, such analyses have been concerned with memory for facts or the properties \nof objects, where the prediction task involves identifying when those facts might \nbe needed again, or what properties novel objects might possess. However, one of \nthe most challenging tasks people face is linguistic communication. Engaging in \nconversation or reading a passage of text requires retrieval of a variety of concepts \nfrom memory in response to a stream of information. This retrieval task can be \nfacilitated by predicting which concepts are likely to be needed from their context, \nhaving efficiently abstracted and stored the cues that support these predictions. \n\nIn this paper, we examine how understanding the problem of predicting words \nfrom their context can provide insight into human semantic association, exploring \nthe hypothesis that the association between words is at least partially affected \nby their statistical relationships. Several researchers have argued that semantic \nassociation can be captured using high-dimensional spatial representations, with \nthe most prominent such approach being Latent Semantic Analysis (LSA) [5]. We \nwill describe this procedure, which indirectly addresses the prediction problem. We \nwill then suggest an alternative approach which explicitly models the way language \nis generated and show that this approach provides a better account of human word \nassociation data than LSA, although the two approaches are closely related. The \ngreat promise of this approach is that it illustrates how we might begin to relax some \nof the strong assumptions about language made by many corpus-based methods. \nWe will provide an example of this, showing results from a generative model that \nincorporates both sequential and contextual information. \n\n\f2 Latent Semantic Analysis \n\nLatent Semantic Analysis addresses the prediction problem by capturing similarity \nin word usage: seeing a word suggests that we should expect to see other words \nwith similar usage patterns. Given a corpus containing W words and D documents, \nthe input to LSA is a W x D word-document co-occurrence matrix F in which fwd \ncorresponds to the frequency with which word w occurred in document d. This \nmatrix is transformed to a matrix G via some function involving the term frequency \nfwd and its frequency across documents fw .. Many applications of LSA in cognitive \nscience use the transformation \n\ngwd = IOg{fwd + 1}(1 - Hw) \n\nH \n\n- _ d-l f w \n\nw -\n\n2:D_ Wlog{W} \n' \n\nlogD \n\nf w . \n\n(1) \n\nwhere Hw is the normalized entropy of the distribution over documents for each \nword. Singular value decomposition (SVD) is applied to G to extract a lower \ndimensional linear subspace that captures much of the variation in usage across \nwords. The output of LSA is a vector for each word, locating it in the derived \nsubspace. The association between two words is typically assessed using the cosine of \nthe angle between their vectors, a measure that appears to produce psychologically \naccurate results on a variety of tasks [5] . For the tests presented in this paper, \nwe ran LSA on a subset of the TASA corpus, which contains excerpts from texts \nencountered by children between first grade and the first year of college. Our subset \nused all D = 37651 documents, and the W = 26414 words that occurred at least \nten times in the whole corpus, with stop words removed. From this we extracted a \n500 dimensional representation, which we will use throughout the paper. 1 \n\n3 The topic model \n\nLatent Semantic Analysis gives results that seem consistent with human judgments \nand extracts information relevant to predicting words from their contexts, although \nit was not explicitly designed with prediction in mind. This relationship suggests \nthat a closer correspondence to human data might be obtained by directly attempt(cid:173)\ning to solve the prediction task. In this section, we outline an alternative approach \nthat involves learning a probabilistic model of the way language is generated. One \ngenerative model that has been used to outperform LSA on information retrieval \ntasks views documents as being composed of sets of topics [2,4]. If we assume that \nthe words that occur in different documents are drawn from T topics, where each \ntopic is a probability distribution over words, then we can model the distribution \nover words in anyone document as a mixture of those topics \n\nT \n\nP(Wi) = LP(Wil zi =j)P(Zi =j) \n\nj=l \n\n(2) \n\nwhere Zi is a latent variable indicating the topic from which the ith word was drawn \nand P(wilzi = j) is the probability of the ith word under the jth topic. The words \nlikely to be used in a new context can be determined by estimating the distribution \nover topics for that context, corresponding to P(Zi). \nIntuitively, P(wlz = j) indicates which words are important to a topic, while P(z) \nis the prevalence of those topics within a document. For example, imagine a world \nwhere the only topics of conversation are love and research. We could then express \n\nIThe dimensionality of the representation is an important parameter for both models in \nthis paper. LSA performed best on the word association task with around 500 dimensions, \nso we used the same dimensionality for the topic model. \n\n\fthe probability distribution over words with two topics, one relating to love and the \nother to research. The content of the topics would be reflected in P(wlz = j): the \nlove topic would give high probability to words like JOY, PLEASURE, or HEART, while \nthe research topic would give high probability to words like SCIENCE, MATHEMATICS, \nor EXPERIMENT. Whether a particular conversation concerns love, research, or \nthe love of research would depend upon its distribution over topics, P(z), which \ndetermines how these topics are mixed together in forming documents. \n\nHaving defined a generative model, learning topics becomes a statistical problem. \nThe data consist of words w = {Wl' ... , w n }, where each Wi belongs to some doc(cid:173)\nument di , as in a word-document co-occurrence matrix. For each document we \nhave a multinomial distribution over the T topics, with parameters ()(d), so for a \nword in document d, P(Zi = j) = ();d;). The jth topic is represented by a multi(cid:173)\nnomial distribution over the W words in the vocabulary, with parameters 1/i), so \nP(wilzi = j) = 1>W. To make predictions about new documents, we need to assume \na prior distribution on the parameters (). Existing parameter estimation algorithms \nmake different assumptions about (), with varying results [2,4]. Here, we present a \nnovel approach to inference in this model, using Markov chain Monte Carlo with a \nsymmetric Dirichlet(a) prior on ()(di) for all documents and a symmetric Dirichlet(,B) \nprior on 1>(j) for all topics. In this approach we do not need to explicitly represent \nthe model parameters: we can integrate out () and 1>, defining the model simply in \nterms of the assignments of words to topics indicated by the Zi' \nMarkov chain Monte Carlo is a procedure for obtaining samples from complicated \nprobability distributions, allowing a Markov chain to converge to the taq~et dis(cid:173)\ntribution and then drawing samples from the states of that chain (see [3]). We \nuse Gibbs sampling, where each state is an assignment of values to the variables \nbeing sampled, and the next state is reached by sequentially sampling all variables \nfrom their distribution when conditioned on the current values of all other variables \nand the data. We will sample only the assignments of words to topics, Zi. The \nconditional posterior distribution for Zi is given by \n\nP( \n\n'1 \n\nZi=)Z- i ,wex \n\n) \n\nn(di) + a \nn eW;) + (3 \n-',} \n-',} \n(.) \n(d ' ) \nn_i,j + W (3 n_i,. + Ta \n\n(3) \n\nwhere Z - i is the assignment of all Zk such that k f:. i, and n~~:j is the number \nof words assigned to topic j that are the same as w, n~L is the total number of \nwords assigned to topic j, n~J,j is the number of words from document d assigned \nto topic j, and n~J. is the total number of words in document d, all not counting \nthe assignment of the current word Wi. a,,B are free parameters that determine how \nheavily these distributions are smoothed. \n\nWe applied this algorithm to our subset of the TASA corpus, which contains n = \n5628867 word tokens. Setting a = 0.1,,B = 0.01 we obtained 100 samples of 500 \ntopics, with 10 samples from each of 10 runs with a burn-in of 1000 iterations and \na lag of 100 iterations between samples. 2 Each sample consists of an assignment of \nevery word token to a topic, giving a value to each Zi. A subset of the 500 topics \nfound in a single sample are shown in Table 1. For each sample we can compute \n\n2Random numbers were generated with the Mersenne Twister, which has an extremely \ndeep period [6]. For each run, the initial state of the Markov chain was found using an \non-line version of Equation 3. \n\n\fFEEL \n\nFEELINGS \nFEELING \nANGRY \n\nWAY \n\nTHINK \nSHOW \nFEELS \nPEOPLE \nFRIENDS \nTHINGS \nMIGHT \nHELP \nHAPPY \n\nFELT \nLOVE \nANGER \nBEING \nWAYS \nFEAR \n\nMUSIC \nPLAY \nDANCE \nPLAYS \nSTAGE \nPLAYED \n\nBAND \n\nAUDIENCE \nMUSICAL \nDANCING \nRHYTHM \nPLAYING \nTHEATER \n\nDRUM \n\nACTORS \nSHOW \nBALLET \nACTOR \nDRAMA \nSONG \n\nBALL \nGAME \nTEAM \nPLAY \n\nBASEBALL \nFOOTBALL \nPLAYERS \n\nGAMES \nPLAYING \n\nFIELD \nPLAYED \nPLAYER \nCOACH \n\nBASKETBALL \n\nSPORTS \n\nHIT \nBAT \n\nTENNIS \nTEAMS \nSOCCER \n\nSCIENCE \nSTUDY \n\nSCIENTISTS \nSCIENTIFIC \nKNOWLEDGE \n\nWORK \n\nCHEMISTRY \nRESEARCH \nBIOLOGY \n\nMATHEMATICS \nLABORATORY \n\nSTUDYING \nSCIENTIST \n\nPHYSICS \nFIELD \nSTUDIES \n\nUNDERSTAND \n\nSTUDIED \nSCIENCES \n\nMANY \n\nWORKERS \n\nWORK \nLABOR \n\nJOBS \n\nWORKING \nWORKER \n\nWAGES \n\nFACTORY \n\nJOB \n\nWAGE \n\nSKILLED \n\nPAID \n\nCONDITIONS \n\nPAY \n\nFORCE \nMANY \nHOURS \n\nEMPLOYMENT \n\nFORCE \nFORCES \nMOT IO N \n\nBODY \n\nGRAVITY \n\nMASS \nPULL \n\nNEWTON \nOBJECT \n\nLAW \n\nDIRECTION \n\nMOVING \n\nREST \nFALL \n\nACTING \n\nMOMENTUM \n\nDISTANCE \n\nGRAVITATIONAL \n\nEMPLOYED \nEMPLOYERS \n\nPUSH \n\nVELOCITY \n\nTable 1: Each column shows the 20 most probable words in one of the 500 topics \nobtained from a single sample. The organization of the columns and use of boldface \ndisplays the way in which polysemy is captured by the model. \n\nthe posterior predictive distribution (and posterior mean for q/j)) : \n\nJ \n\n( .) \n\n( 0) \n\nP(wl z = j, z, w) = P(wl z = j, \u00a2 J )P(\u00a2 J Iz, w) d\u00a2 J = _(;=,.J) __ \n\n(4) \n\n( 0) \n\nn (W) + (3 \nnj + W (3 \n\n4 Predicting word association \n\nWe used both LSA and the topic model to predict the association between pairs \nof words, comparing these results with human word association norms collected by \nNelson, McEvoy and Schreiber [7]. These word association norms were established \nby presenting a large number of participants with a cue word and asking them to \nname an associated word in response. A total of 4544 of the words in these norms \nappear in the set of 26414 taken from the TASA corpus. \n\n4.1 Latent Semantic Analysis \n\nIn LSA, the association between two words is usually measured using the cosine \nof the angle between their vectors. We ordered the associates of each word in the \nnorms by their frequencies , making the first associate the word most commonly \ngiven as a response to the cue. For example, the first associate of NEURON is BRAIN. \nWe evaluated the cosine between each word and the other 4543 words in the norms, \nand then computed the rank of the cosine of each of the first ten associates, or \nall of the associates for words with less than ten. The results are shown in Figure \n1. Small ranks indicate better performance, with a rank of one meaning that the \ntarget word had the highest cosine. The median rank of the first associate was 32, \nand LSA correctly predicted the first associate for 507 of the 4544 words. \n\n4.2 The topic model \n\nThe probabilistic nature of the topic model makes it easy to predict the words likely \nto occur in a particular context. If we have seen word WI in a document, then we \ncan determine the probability that word W2 occurs in that document by computing \nP( w2IwI). The generative model allows documents to contain multiple topics, which \n\n\f1 \n\nl;r \n\nII \n\n7 \n\n8 \n\n9 \n\n10 \n\n450 1_ LSA - cosine \n\nLSA - inner product \nTopic model \n\n400 \n\nD \n\n350 \n\n300 \n\n250 \n\n200 \n\n150 \n\n100 \n\n50 \n\no \n\nlin \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\nAssociate number \n\nFigure 1: Performance of different methods of prediction on the word association \ntask. Error bars show one standard error, estimated with 1000 bootstrap samples. \n\nis extremely important to capturing the complexity of large collections of words \nand computing the probability of complete documents. However, when comparing \nindividual words it is more effective to assume that they both come from a single \ntopic. This assumption gives us \n\n(5) \n\nz \n\nwhere we use Equation 4 for P(wlz) and P(z) is uniform, consistent with the sym(cid:173)\nmetric prior on e, and the subscript in Pi (w2lwd indicates the restriction to a single \ntopic. This estimate can be computed for each sample separately, and an overall \nestimate obtained by averaging over samples. We computed Pi (w2Iwi) for the 4544 \nwords in the norms, and then assessed the rank of the associates in the resulting \ndistribution using the same procedure as for LSA. The results are shown in Figure \n1. The median rank for the first associate was 32, with 585 of the 4544 first asso(cid:173)\nciates exactly correct. The probabilistic model performed better than LSA, with \nthe improved performance becoming more apparent for the later associates . \n\n4.3 Discussion \n\nThe central problem in modeling semantic association is capturing the interaction \nbetween word frequency and similarity of word usage. Word frequency is an impor(cid:173)\ntant factor in a variety of cognitive tasks, and one reason for its importance is its \npredictive utility. A higher observed frequency means that a word should be pre(cid:173)\ndicted to occur more often. However, this effect of frequency should be tempered by \nthe relationship between a word and its semantic context . The success of the topic \nmodel is a consequence of naturally combining frequency information with semantic \nsimilarity: when a word is very diagnostic of a small number of topics, semantic \ncontext is used in prediction. Otherwise, word frequency plays a larger role. \n\n\fThe effect of word frequency in the topic model can be seen in the rank-order \ncorrelation of the predicted ranks of the first associates with the ranks predicted \nby word frequency alone, which is p = 0.49. In contrast, the cosine is used in LSA \nbecause it explicitly removes the effect of word frequency, with the corresponding \ncorrelation being p = -0.01. The cosine is purely a measure of semantic similarity, \nwhich is useful in situations where word frequency is misleading, such as in tests of \nEnglish fluency or other linguistic tasks, but not necessarily consistent with human \nperformance. This measure was based in the origins of LSA in information retrieval, \nbut other measures that do incorporate word frequency have been used for modeling \npsychological data. We consider one such measure in the next section. \n\n5 Relating LSA and the topic model \n\nThe decomposition of a word-document co-occurrence matrix provided by the topic \nmodel can be written in a matrix form similar to that of LSA. Given a word(cid:173)\ndocument co-occurrence matrix F, we can convert the columns into empirical es(cid:173)\ntimates of the distribution over words in each document by dividing each column \nby its sum. Calling this matrix P, the topic model approximates it with the non(cid:173)\nnegative matrix factorization P ~ \u00a2O, where column j of \u00a2 gives 4/j) , and column d \nof 0 gives ()(d). The inner product matrix ppT is proportional to the empirical esti(cid:173)\nmate of the joint distribution over words P(WI' W2)' We can write ppT ~ \u00a2OOT \u00a2T, \ncorresponding to P(WI ,W2) = L z\"Z 2 P(wIl zdP(W2Iz2)P(ZI,Z2) , with OOT an em(cid:173)\npirical estimate of P(ZI , Z2)' The theoretical distribution for P(ZI, Z2) is propor(cid:173)\ntional to 1+ 0::, where I is the identity matrix, so OOT should be close to diagonal. \nThe single topic assumption removes the off-diagonal elements, replacing OOT with \nI to give PI (Wl ' W2) ex: \u00a2\u00a2T. \nBy comparison, LSA transforms F to a matrix G via Equation 1, then the SVD \ngives G ~ UDV T for some low-rank diagonal D. The locations of the words along \nthe extracted dimensions are X = UD. If the column sums do not vary extensively, \nthe empirical estimate of the joint distribution over words specified by the entries in \nG will be approximately P(WI,W2) ex: GGT . The properties of the SVD guarantee \nthat XX T , the matrix of inner products among the word vectors, is the best low(cid:173)\nrank approximation to GGT in terms of squared error. The transformations in \nEquation 1 are intended to reduce the effects of word frequency in the resulting \nrepresentation, making XXT more similar to \u00a2\u00a2T. \nWe used the inner product between word vectors to predict the word association \nnorms, exactly as for the cosine. The results are shown in Figure 1. The inner \nproduct initially shows worse performance than the cosine, with a median rank \nof 34 for the first associate and 500 exactly correct, but performs better for later \nassociates. The rank-order correlation with the predictions of word frequency for \nthe first associate was p = 0.46, similar to that for the topic model. The rank(cid:173)\norder correlation between the ranks given by the inner product and the topic model \nwas p = 0.81, while the cosine and the topic model correlate at p = 0.69. The \ninner product and PI (w2lwd in the topic model seem to give quite similar results, \ndespite being obtained by very different procedures. This similarity is emphasized \nby choosing to assess the models with separate ranks for each cue word, since this \nmeasure does not discriminate between joint and conditional probabilities. While \nthe inner product is related to the joint probability of WI and W2, PI (w2lwd is a \nconditional probability and thus allows reasonable comparisons of the probability \nof W2 across choices of WI , as well as having properties like asymmetry that are \nexhibited by word association. \n\n\f\"syntax\" \n\n\"semantics\" \n\nHE \nYOU \nTHEY \n\nI \n\nSHE \nWE \nIT \n\nPEOPLE \n\nEVERYONE \n\nOTHERS \n\nSCIENTISTS \nSOMEONE \n\nWHO \n\nNOBODY \n\nONE \n\nSOMETHING \n\nANYONE \n\nEVERYBODY \n\nSOME \nTHEN \n\nON \nAT \n\nINTO \nFROM \nWITH \n\nTHROUGH \n\nOVER \n\nAROUND \nAGAINST \nACROSS \n\nUPON \n\nTOWARD \nUNDER \nALONG \nNEAR \n\nBEHIND \n\nOFF \n\nABOVE \nDOWN \nBEFORE \n\nBE \n\nMAKE \nGET \nHAVE \n\nGO \n\nTAKE \n\nDO \n\nFIND \nUSE \nSEE \nHELP \nKEEP \nGIVE \nLOOK \nCOME \nWORK \nMOVE \nLIVE \nEAT \n\nSAID \n\nASKED \n\nTHOUGHT \n\nTOLD \nSAYS \n\nMEANS \nCALLED \nCRIED \nSHOWS \n\nANSWERED \n\nTELLS \n\nREPLIED \nSHOUTED \n\nEXPLAINED \n\nLAUGHED \n\nMEANT \nWROTE \nSHOWED \nBELIEVED \n\nMAP \n\nNORTH \nEARTH \nSOUTH \nPOLE \nMAPS \n\nWEST \nLINES \nEAST \n\nEQUATOR \n\nAUSTRALIA \n\nGLOBE \nPOLES \n\nHEMISPHERE \n\nLATITUDE \n\nPLACES \n\nLAND \n\nWORLD \nCOMPASS \n\nDOCTOR \nPATIENT \nHEALTH \nHOSPITAL \nMEDICAL \n\nCARE \n\nPATIENTS \n\nNURSE \n\nDOCTORS \nMEDICINE \nNURSING \n\nTREATMENT \n\nNURSES \n\nPHYSICIAN \nHOSPITALS \n\nDR \nS ICK \n\nASSISTANT \nEMERGENCY \n\nPRACTICE \n\nBECOME \n\nWHISPERED \n\nCONTINENTS \n\nTable 2: Each column shows the 20 most probable words in one of the 48 \"syntactic\" \nstates of the hidden Markov model (four columns on the left) or one of the 150 \n\"semantic\" topics (two columns on the right) obtained from a single sample. \n\n6 Exploring more complex generative models \n\nThe topic model, which explicitly addresses the problem of predicting words from \ntheir contexts, seems to show a closer correspondence to human word association \nthan LSA. A major consequence of this analysis is the possibility that we may be \nable to gain insight into some of the associative aspects of human semantic memory \nby exploring statistical solutions to this prediction problem. In particular, it may \nbe possible to develop more sophisticated generative models of language that can \ncapture some of the important linguistic distinctions that influence our processing \nof words. The close relationship between LSA and the topic model makes the latter \na good starting point for an exploration of semantic association, but perhaps the \ngreatest potential of the statistical approach is that it illustrates how we might go \nabout relaxing some of the strong assumptions made by both of these models. \n\nOne such assumption is the treatment of a document as a \"bag of words\" , in which \nsequential information is irrelevant. Semantic information is likely to influence only \na small subset of the words used in a particular context, with the majority of the \nwords playing functional syntactic roles that are consistet across contexts. Syntax is \njust as important as semantics for predicting words, and may be an effective means \nof deciding if a word is context-dependent. In a preliminary exploration of the \nconsequences of combining syntax and semantics in a generative model for language, \nwe applied a simple model combining the syntactic structure of a hidden Markov \nmodel (HMM) with the semantic structure of the topic model. Specifically, we used \na third-order HMM with 50 states in which one state marked the start or end of \na sentence, 48 states each emitted words from a different multinomial distribution, \nand one state emitted words from a document-dependent multinomial distribution \ncorresponding to the topic model with T = 150. We estimated parameters for this \nmodel using Gibbs sampling, integrating out the parameters for both the HMM and \nthe topic model and sampling a state and a topic for each of the 11821091 word \ntokens in the corpus. 3 Some of the state and topic distributions from a single sample \nafter 1000 iterations are shown in Table 2. The states of the HMM accurately picked \nout many of the functional classes of English syntax, while the state corresponding \nto the topic model was used to capture the context-specific distributions over nouns. \n\n3This larger number is a result of including low frequency and stop words. \n\n\fCombining the topic model with the HMM seems to have advantages for both: no \nfunction words are absorbed into the topics, and the HMM does not need to deal \nwith the context-specific variation in nouns. The model also seems to do a good job \nof generating topic-specific text - we can clamp the distribution over topics to pick \nout those of interest, and then use the model to generate phrases. For example, we \ncan generate phrases on the topics of research (\"the chief wicked selection of research \nin the big months\" , \"astronomy peered upon your scientist's door\", or \"anatomy \nestablished with principles expected in biology\") , language (\"he expressly wanted \nthat better vowel\"), and the law (\"but the crime had been severely polite and \nconfused\" , or \"custody on enforcement rights is plentiful\"). While these phrases \nare somewhat nonsensical, they are certainly topical. \n\n7 Conclusion \n\nViewing memory and categorization as systems involved in the efficient prediction \nof an organism's environment can provide insight into these cognitive capacities. \nLikewise, it is possible to learn about human semantic association by considering \nthe problem of predicting words from their contexts. Latent Semantic Analysis \naddresses this problem, and provides a good account of human semantic association. \nHere, we have shown that a closer correspondence to human data can be obtained \nby taking a probabilistic approach that explicitly models the generative structure \nof language, consistent with the hypothesis that the association between words \nreflects their probabilistic relationships. The great promise of this approach is the \npotential to explore how more sophisticated statistical models of language, such as \nthose incorporating both syntax and semantics, might help us understand cognition. \n\nAcknowledgments \n\nThis work was generously supported by the NTT Communications Sciences Laboratories. \nWe used Mersenne Twister code written by Shawn Cokus, and are grateful to Touchstone \nApplied Science Associates for making available the TASA corpus, and to Josh Tenenbaum \nfor extensive discussions on this topic. \n\nReferences \n\n[1] J. R. Anderson. The Adaptive Character of Thought. Erlbaum, Hillsdale, NJ, 1990. \n\n[2] D . M. Blei, A. Y. Ng, and M. 1. Jordan. Latent Dirichlet allocation. In T. G. Dietterich, \nS. Becker, and Z. Ghahramani, eds, Advances in Neural Information Processing Systems \n14, 2002. \n[3] W . R. Gilks, S. Richardson, and D. J . Spiegelhalter, eds. Markov Chain Monte Carlo \nin Practice. Chapman and Hall, Suffolk, 1996. \n\n[4] T . Hofmann. Probabilistic Latent Semantic Indexing. In Proceedings of the Twenty(cid:173)\nSecond Annual International SIGIR Conference, 1999. \n\n[5] T. K. Landauer and S. T. Dumais. A solution to Plato's problem: The Latent Semantic \nAnalysis theory of acquisition, induction, and representation of knowledge. Psychological \nReview, 104:211- 240, 1997. \n\n[6] M. Matsumoto and T . Nishimura. Mersenne twister: A 623-dimensionally equidis(cid:173)\ntributed uniform pseudorandom number generator. ACM Transactions on Modeling and \nComputer Simulation, 8:3- 30, 1998. \n\n[7] D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The University of South Florida \nword association norms. http://www. usf. edu/FreeAssociation, 1999. \n\n\f", "award": [], "sourceid": 2153, "authors": [{"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Mark", "family_name": "Steyvers", "institution": null}]}