{"title": "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 914, "page_last": 920, "abstract": null, "full_text": "Learning the Similarity of Documents: \n\nAn Information-Geometric Approach to \nDocument Retrieval and Categorization \n\nThomas Hofmann \n\nDepartment of Computer Science \nBrown University, Providence, RI \n\nhofmann@cs.brown.edu, www.cs.brown.edu/people/th \n\nAbstract \n\nThe project pursued in this paper is to develop from first \ninformation-geometric principles a general method for learning \nthe similarity between text documents. Each individual docu(cid:173)\nment is modeled as a memoryless information source. Based on \na latent class decomposition of the term-document matrix, a low(cid:173)\ndimensional (curved) multinomial subfamily is learned. From this \nmodel a canonical similarity function - known as the Fisher kernel \n- is derived. Our approach can be applied for unsupervised and \nsupervised learning problems alike. This in particular covers inter(cid:173)\nesting cases where both, labeled and unlabeled data are available. \nExperiments in automated indexing and text categorization verify \nthe advantages of the proposed method. \n\n1 \n\nIntroduction \n\nThe computer-based analysis and organization of large document repositories is one \noftoday's great challenges in machine learning, a key problem being the quantitative \nassessment of document similarities. A reliable similarity measure would provide \nanswers to questions like: How similar are two text documents and which documents \nmatch a given query best? In a time, where searching in huge on-line (hyper-)text \ncollections like the World Wide Web becomes more and more popular, the relevance \nof these and related questions needs not to be further emphasized. \n\nThe focus of this work is on data-driven methods that learn a similarity function \nbased on a training corpus of text documents without requiring domain-specific \nknowledge. Since we do not assume that labels for text categories, document classes, \nor topics, etc. are given at this stage, the former is by definition an unsupervised \nlearning problem. In fact, the general problem of learning object similarities pre(cid:173)\ncedes many \"classical\" unsupervised learning methods like data clustering that al(cid:173)\nready presuppose the availability of a metric or similarity function. In this paper, \nwe develop a framework for learning similarities between text documents from first \nprinciples. In doing so, we try to span a bridge from the foundations of statistics \nin information geometry [13, 1] to real-world applications in information retrieval \nand text learning, namely ad hoc retrieval and text categorization. Although the \ndeveloped general methodology is not limited to text documents, we will for sake \nof concreteness restrict our attention exclusively to this domain. \n\n\fLearning the Similarity of Documents \n\n915 \n\n2 Latent Class Decomposition \n\nMemoryless Information Sources Assume we have available a set of docu(cid:173)\nments V = {d l , ..\u2022 , dN} over some fixed vocabulary of words (or terms) W = \n{WI, ... , WM}. In an information-theoretic perspective, each document di can be \nviewed as an information source, i.e. a probability distribution over word sequences. \nFollowing common practice in information retrieval, we will focus on the more re(cid:173)\nstricted case where text documents are modeled on the level of single word occur(cid:173)\nrences. This means that we we adopt the bag-of- words view and treat documents \nas memoryless information sources. I \n\nA. Modeling assumption: Each document is a memoryless information source. \n\nThis assumption implies that each document can be represented by a multinomial \nprobability distribution P(wjldi), which denotes the (unigram) probability that a \ngeneric word occurrence in document di will be Wj. Correspondingly, the data can \nbe reduced to some simple sufficient statistics which are counts n(di , Wj) of how \noften a word Wj occurred in a document dj \u2022 The rectangular N x M matrix with \ncoefficients n(di , Wj) is also called the term-document matrix. \n\nLatent Class Analysis Latent class analysis is a decomposition technique for \ncontingency tables (cf. [5, 3] and the references therein) that has been applied to \nlanguage modeling [15] (\"aggregate Markov model\") and in information retrieval [7] \n(\"probabilistic latent semantic analysis\"). In latent class analysis, an unobserved \nclass variable Zk E Z = {zt, ... , ZK} is associated with each observation, i.e. with \neach word occurrence (di , Wj). The joint probability distribution over V x W is a \nmixture model that can be parameterized in two equivalent ways \n\nP(di, Wj) = 2: P(zk)P(dilzk)P(wjlzk) = P(di) 2: P( WjIZk)P(Zk Idd . \n\n(1) \n\nK \n\nK \n\nk=l \n\nk=l \n\nThe latent class model (1) introduces a conditional independence assumption, \nnamely that di and Wj are independent conditioned on the state of the associated \nlatent variable. Since the cardinality of Zk is typically smaller than the number of \ndocuments/words in the collection, Zk acts as a bottleneck variable in predicting \nwords conditioned on the context of a particular document. \n\nTo give the reader a more intuitive understanding of the latent class decomposition, \nwe have visualized a representative subset of 16 \"factors\" from a K = 64 latent \nclass model fitted from the Reuters2I578 collection (cf. Section 4) in Figure 1. \nIntuitively, the learned parameters seem to be very meaningful in that they represent \nidentifiable topics and capture the corresponding vocabulary quite well. \n\nBy using the latent class decomposition to model a collection of memory less sources, \nwe implicitly assume that the overall collection will help in estimating parameters \nfor individual sources, an assumption which has been validated in our experiments. \n\nB. Modeling assumption: Parameters for a collection of memoryless informa(cid:173)\ntion sources are estimated by latent class decomposition. \n\nParameter Estimation The latent class model has an important geometrical \ninterpretation: the parameters \u00a21 == P( Wj IZk) define a low-dimensional subfamily \nof the multinomial family, S(\u00a2) == {11\" E [0; I]M : 1I\"j = :Ek 1/;k\u00a21 for some1/; E \n[0; I]K, :Ek 1/;k = I}, i.e. all multinomials 11\" that can be obtained by convex combi(cid:173)\nnations from the set of \"basis\" vectors {\u00a2k : 1 :::; k :::; K}. For given \u00a2-parameters, \n\n1 Extensions to the more general case are possible, but beyond the scope of this paper. \n\n\f916 \n\nT. Hofmann \n\ngovernment \n\ntax \n\nbudget \n\ncut \n\ns pending \n\ncuts \n\ndeficit \ntaxes \nreform \nbillion \ntrading \nexchange \nfutures \nstock \noptions \nindex \n\ncontracts \n\nmarket \nlondon \n\nexchanges \n\npresIdent \nchairman \nexecutive \n\nchief \nofficer \n\nvice \n\ncompany \n\nnamed \nboard \ndirector \namerica.n \ngeneral \nmotors \nchrysler \n\ngm \ncar \nford \ntest \ncars \nmotor \n\nbanks \ndebt \nbrazil \nnew \nloans \ndlrs \n\nbankers \n\nb .. nk \n\npayments \n\nbillion \ntr .. de \njapan \n\nj .. panese \n\nec \n\nstates \nunited \nofficials \n\ncommunity \neuropean \nimports \n\npct \n\njanuary \nfebruary \n\nrise \nrose \n1986 \n\ndecember \n\nyear \nfell \n\nprices \n\noil \n\ncrude \nenergy \n\npetroleum \n\nprices \nbpd \n\nbarrels \nbarrel \n\nexploration \n\nprice \n\nunlon \n\nair \n\nworkers \nstrike \nairlines \naircraft \n\nport \n\nboeing \n\nemployees \n\nairline \n\nvs \ncts \nnet \nloss \nmin \nshr \nqtr \nrevs \nprofit \nnote \n\nmarks \n\ncurrency \n\ndollar \ngerman \n\nbundesbank \n\ncentral \nmark \nwest \ndollars \ndealers \nareas \n\nweather \n\narea \n\nnormal \ngood \ncrop \n\ndamage \ncaused \naffected \npeople \n\ngold \nsteel \nplant \nmining \ncopper \n\nton s \nsilver \nmetal \n\nproduction \n\nounCeS \n\nfood \ndrug \nstudy \naid s \n\nprod uct \ntrea.tment \ncompany \n\nenvironmental \n\nproducts \napproval \n\nbllhon \n\ndlrs \nyear \n\nsurplu s \ndeficit \nforeign \ncurrent \ntrade \n\naccount \nreserves \nhouse \nrea.gan \n\npresident \n\nadministration \n\ncongress \n\nwhite \n\nsecretary \n\ntold \n\nvolcker \nreagans \n\nFigure 1: 16 selected factors from a 64 factor decomposition ofthe Reuters21578 col(cid:173)\nlection. The displayed terms are the 10 most probable words in the class-conditional \ndistribution P (Wj IZk) for 16 selected states Zk after the exclusion of stop words. \n\neach 1/;i , 1/;i == P(zkldi), will define a unique multinomial distribution rri E S(\u00a2). \nSince S( \u00a2) defines a submanifold on the multinomial simplex, it corresponds to a \ncurved exponential subfamily. 2 We would like to emphasis that we propose to learn \nboth, the parameters within the family (the 1/;'s or mixing proportions P(Zk Idi )) and \nthe parameters that define the subfamily (the \u00a2's or class-conditionals P(WjIZk)). \nThe standard procedure for maximum likelihood estimation in latent variable mod(cid:173)\nels is the Expectation Maximization (EM) algorithm. In the E-step one computes \nposterior probabilities for the latent class variables, \n\nP(Zk )P( di IZk )P( Wj IZk) \n2:1 P(zI)P(dilzt)P(wjlz/) \nThe M-step formulae can be written compactly as \n\nP(Zk) P( di IZk )P( Wj IZk) \n\nP(di' Wj) \n\nP(diIZk)} \nP(WjIZk) \n\nP(Zk) \n\nN M \n\nex 2: 2: n(dn, wm)P(zkldn, wm) X \nn=l m=l \n\n{ din \ndjm \n1 \n\n(2) \n\n(3) \n\nwhere 6 denotes the Kronecker delta. \n\nRelated Models As demonstrated in [7], the latent class model can be viewed \nas a probabilistic variant of Latent Semantic Analysis [2], a dimension reduction \ntechnique based on Singular Value Decomposition. It is also closely related to the \nnon-negative matrix decomposition discussed in [12] which uses a Poisson sampling \nmodel and has been motivated by imposing non-negativity constraints on a decom(cid:173)\nposition by PCA. The relationship of the latent class model to clustering models \nlike distributional clustering [14] has been investigated in [8]. [6] presents yet an(cid:173)\nother approach to dimension reduction for multinomials which is based on spherical \nmodels, a different type of curved exponential subfamilies than the one presented \nhere which is affine in the mean-value parameterization. \n\n2Notice that graphical models with latent variable are in general stratified exponential \nfamilies [4], yet in our case the geometry is simpler. The geometrical view also illustrates \nthe well-known identifiability problem in latent class analysis. The interested reader is \nreferred to [3]. As a practical remedy, we have used a Bayesian approach with conjugate \n(Dirichlet) prior distributions over all multinomials which for the sake of clarity is not \ndescribed in this paper since it is very technical but nevertheless rather straightforward. \n\n\fLearning the Similarity of Documents \n\n917 \n\n3 Fisher Kernel and Information Geometry \nThe Fisher Kernel We follow the work of [9] to derive kernel functions (and \nhence similarity functions) from generative data models. This approach yields a \nuniquely defined and intrinsic (i. e. coordinate invariant) kernel, called the Fisher \nkernel. One important implication is that yardsticks used for statistical models \ncarryover to the selection of appropriate similarity functions. In spite of the purely \nunsupervised manner in which a Fisher kernel can be learned, the latter is also very \nuseful in supervised learning, where it provides a way to take advantage of addi(cid:173)\ntional unlabeled data. This is important in text learning, where digital document \ndatabases and the World Wide Web offer a huge background text repository. \n\nAs a starting point, we partition the data log-likelihood into contributions from \nthe various documents. The average log-probability of a document di , i.e. the \nprobability of all the word occurrences in di normalized by document length is \ngiven by, \n\nl(dd = L F(wjldi ) log L P(WjIZk)P(Zkldi), F(wj Idi ) == 2: n(d(, ~j)) \n\n(4) \n\nM \n\nj=l \n\nK \n\nk=l \n\nm n d\" Wm \n\nwhich is up to constants the negative Kullback-Leibler divergence between the em(cid:173)\npirical distribution F(wjldi ) and the model distribution represented by (1). \nIn order to derive the Fisher kernel, we have to compute the Fisher scores u(di ; 0), \ni.e. the gradient of l(dd with respect to 0, as well as the Fisher information 1(0) in \nsome parameterization 0 [13]. The Fisher kernel at {) is then given by [9] \n\n(5) \n\nComputational Considerations For computational reasons we propose to ap(cid:173)\nproximate the (inverse) information matrix by the identity matrix, thereby making \nadditional assumptions about information orthogonality. More specifically, we use \na variance stabilizing parameterization for multinomials - the square-root param(cid:173)\neterization - which yields an isometric embedding of multinomial families on the \npositive part of a hypersphere [11]. In this parameterization, the above approx(cid:173)\nimation will be exact for the multinomial family (disregarding the normalization \nconstraint). We conjecture that it will also provide a reasonable approximation in \nthe case of the subfamily defined by the latent class model. \n\nc. Simplifying assumption: The Fisher information in the square-root param(cid:173)\neterization can be approximated by the identity matrix. \n\nInterpretation of Results \nInstead of going through the details of the derivation \nwhich is postponed to the end of this section, it is revealing to relate the results back \nto our main problem of defining a similarity function between text documents. We \nwill have a closer look at the two contributions reSUlting from different sets of pa(cid:173)\nrameters. The contribution which stems from (square-root transformed) parameters \nP(Zk) is (in a simplified version) given by \n\nL P(Zk Idi)P(Zk Idn )/ P(Zk) . \n\nk \n\n(6) \n\nJ( is a weighted inner product in the low-dimensional factor representation of the \ndocuments by mixing weights P(zkldi). This part of the kernel thus computes a \n\"topical\" overlap between documents and is thereby able to capture synonyms, i.e., \nwords with an identical or similar meaning, as well as words referring to the same \n\n\f918 \n\nT. Hofmann \n\ntopic. Notice, that it is not required that di and dn actually have (many) terms in \ncommon in order to get a high similarity score. \nThe contribution due to the parameters P(WjIZk) is of a very different type. Again \nusing the approximation of the Fisher matrix, we arrive at the inner product \n\nK:(di, do) = l( P(Wj Idi ) I'>(wj Ido ) ~ P(zkldi';(2~;:; Ido , Wj) \u2022 \n\n(7) \n\nj( has also a very appealing interpretation: It essentially computes an inner product \nbetween the empirical distributions of di and dn , a scheme that is very popular in \nthe context of information retrieval in the vector space model. However, common \nwords only contribute, if they are explained by the same factor(s), i.e., if the re(cid:173)\nspective posterior probabilities overlap. This allows to capture words with multiple \nmeanings, so-called polysems. For example, in the factors displayed in Figure 1 the \nterm \"president\" occurs twice (as the president of a company and as the president \nof the US). Depending on the document the word occurs in, the posterior proba(cid:173)\nbility will be high for either one of the factors, but typically not for both. Hence, \nthe same term used in different context and different meanings will generally not \nincrease the similarity between documents, a distinction that is absent in the naive \ninner product which corresponds to the degenerate case of K = 1. \nSince the choice of K determines the coarseness of the identified \"topics\" and dif(cid:173)\nferent resolution levels possibly contribute useful information, we have combined \nmodels by a simple additive combination of the derived inner products. This com(cid:173)\nbination scheme has experimentally proven to be very effective and robust. \n\nD. Modeling assumption: Similarities derived from latent class decompositions \nat different levels of resolution are additively combined. \n\nIn summary, the emergence of important language phenomena like synonymy and \npolysemy from information-geometric principles is very satisfying and proves in \nour opinion that interesting similarity functions can be rigorously derived, without \nspecific domain knowledge and based on few, explicitly stated assumptions (A-D). \n\nTechnical Derivation Define Pjk == 2v'P(wjlzk), then \n\n8l(d j ) \n\noP(WjIZk) = . fp(wjlzk) P(wjldi ) P(zkJdd \n\nV \n\nP(wjldi ) \n\noP(WjIZk) OPjk \nP(wjlddP(Zkldi' Wj) \n\nv'P(WjIZk) \n\nSimilarly we define Pk = 2v'P(Zk). Applying Bayes' rule to substitute P(zkldd in \nl(dd (i.e. P(zkldd = P(zk)P(di/zk)/P(di)) yields \n\n8l(dd \nOPk \n\n8l(d;) OP(Zk) = v'P(Zk) P(dilzk) ~ P(wjldd P(W ' IZk) \nOP(Zk) OPk \n\nJ \n\nP(zkJdd ~ P(wj1di)p( \nv'P(Zk) \n\nL...J \nj P(wjldi) \n\nJ \n\nP(dd ~ P(WjJdd \nI ) P( zkldj ) \nW\u00b7 Zk ~ ~==== \nv'P(Zk) . \n\nJ \n\nThe last (optional) approximation step makes sense whenever P(wjld j ) ~ P(wjldi ). \nNotice that we have ignored the normalization constraints which would yield a \n(reactive) term that is constant for each multinomial. Experimentally, we have \nobserved no deterioration in performance by making these additional simplifications. \n\n\fLearning the Similarity of Documents \n\n919 \n\nVSM \nVSM++ \n\nMedline Cranfield CACM CISI \n12.7 \n20.3 \n\n44.3 \n67.2 \n\n29.9 \n37.9 \n\n17.9 \n27.5 \n\nTable 1: Average precision results for the vector space baseline method (VSM) \nand the Fisher kernel approach (VSM ++) for 4 standard test collections, Medline, \nCranfield, CACM, and CIS!. \n\n20x sub SVM \n\nSVM++ \nkNN \nkNN++ \n\nlOx sub SVM \n\n5x sub \n\nSVM++ \nkNN \nkNN++ \nSVM \nSVM++ \nkNN \nkNN++ \n\nall data SVM \nlOx cv \n\nSVM++ \nkNN \nkNN++ \n\nI earn \n5.51 \n4.56 \n5.91 \n5.05 \n4.88 \n4.11 \n5.51 \n4.94 \n4.09 \n3.64 \n5.13 \n4.74 \n2.92 \n2.98 \n4.17 \n4.07 \n\nacq money \n7.67 \n3.25 \n5.37 \n2.08 \n3.24 \n9.64 \n3.11 \n7.80 \n5.54 \n2.38 \n2.08 \n4.84 \n2.64 \n9.23 \n2.42 \n7.47 \n4.40 \n2.10 \n1.78 \n4.15 \n2.27 \n8.70 \n2.22 \n6.99 \n3.21 \n1.20 \n1.21 \n3.15 \n1.78 \n6.69 \n5.34 \n1.73 \n\ngrain \n2.06 \n1.71 \n2.54 \n2.35 \n1.71 \n1.42 \n2.55 \n2.28 \n1.32 \n0.98 \n2.40 \n2.18 \n0.77 \n0.76 \n1.73 \n1.58 \n\ncrude \n2.50 \n1.53 \n2.42 \n1.95 \n1.88 \n1.45 \n2.42 \n1.88 \n1.46 \n1.19 \n2.23 \n1.74 \n0.92 \n0.86 \n1.42 \n1.18 \n\naverage \n\nlmprov. \n\n4.20 \n3.05 \n4.75 \n4.05 \n3.27 \n2.78 \n4.47 \n3.79 \n2.67 \n2.35 \n4.14 \n3.57 \n1.81 \n1.79 \n3.16 \n2.78 \n\n-\n\n+27.4% \n\n-\n\n+14.7% \n\n-\n\n+15.0% \n\n-\n\n+15.2% \n\n+12 .1% \n\n+13.7% \n\n-\n\n-\n\n-\n\n-\n\n+0.6% \n\n+12.0% \n\nTable 2: Classification errors for k-nearest neighbors (kNN) SVMs (SVM) with the \nnaive kernel and with the Fisher kernel(++) (derived from J( = 1 and J( = 64 \nmodels) on the 5 most frequent categories of the Reuters21578 corpus (earn, acq, \nmonex-fx, grain, and crude) at different subsampling levels. \n\n4 Experimental Results \n\nWe have applied the proposed method for ad hoc information retrieval, where the \ngoal is to return a list of documents, ranked with respect to a given query. This \nobviously involves computing similarities between documents and queries. \nIn a \nfollow-up series of experiments to the ones reported in [7] - where kernels K(di , dn ) = \n~k P(Zk Idi)P(Zk Idn ) and JC(di' dn ) = ~j P(Wj Idi )P(wjldn ) have been proposed in \nan ad hoc manner - we have been able to obtain a rigorous theoretical justification \nas well as some additional improvements. Average precision-recall values for four \nstandard test collections reported in Table 1 show that substantial performance \ngains can be achieved with the help of a generative model (cf. [7] for details on the \nconducted experiments). \n\nTo demonstrate the utility of our method for supervised learning problems, we have \napplied it to text categorization, using a standard data set in the evaluation, the \nReuters21578 collections of news stories. We have tried to boost the performance \nof two classifiers that are known to be highly competitive for text categorization: \nthe k- nearest neighbor method and Support Vector Machines (SVMs) with a linear \nkernel [10]. Since we are particularly interested in a setting, where the generative \nmodel is trained on a larger corpus of unlabeled data, we have run experiments where \nthe classifier was only trained on a subsample (at subsampling factors 20x,10x,5x). \nThe results are summarized in Table 2. Free parameters of the base classifiers have \nbeen optimized in extensive simulations with held-out data. The results indicate \n\n\f920 \n\nT. Hofmann \n\nthat substantial performance gains can be achieved over the standard k-nearest \nneighbor method at all subsampling levels. For SVMs the gain is huge on the \nsubsampled data collections, but insignificant for SVMs trained on all data. This \nseems to indicate that the generative model does not provide any extra information, \nif the SVM classifier is trained on the same data. However, notice that many \ninteresting applications in text categorization operate in the small sample limit \nwith lots of unlabeled data. Examples include the definition of personalized news \ncategories by just a few example, the classification and/or filtering of email, on-line \ntopic spotting and tracking, and many more. \n\n5 Conclusion \nWe have presented an approach to learn the similarity of text documents from \nfirst principles. Based on a latent class model, we have been able to derive a \nsimilarity function, that is theoretically satisfying, intuitively appealing, and shows \nsubstantial performance gains in the conducted experiments. Finally, we have made \na contribution to the relationship between unsupervised and supervised learning as \ninitiated in [9] by showing that generative models can help to exploit unlabeled data \nfor classification problems. \n\nReferences \n[1] Shun'ichi Amari. Differential-geometrical methods in statistics. Springer-Verlag, \n\nBerlin, New York, 1985. \n\n[2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. \nIndexing by latent semantic analysis. Journal of the American Society for Information \nScience, 41:391-407, 1990. \n\n[3] M. J. Evans, Z. Gilula, and I. Guttman. Latent class analysis of two-way contingency \n\ntables by Bayesian methods. Biometrika, 76(3):557-563, 1989. \n\n[4] D. Geiger, D. Heckerman, H. King, and C. Meek. Stratified exponential families: \nGraphical models and model selection. Technical Report MSR-TR-98-31, Microsoft \nResearch, 1998. \n\n[5] Z. Gilula and S. J . Haberman. Canonical analysis of contingency tables by maximum \n\nlikelihood. Journal of the American Statistical Association, 81(395):780-788, 1986. \n\n[6] A. Gous. Exponential and Spherical Subfamily Models. PhD thesis, Stanford, Statistics \n\nDepartment, 1998. \n\n[7] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22th Inter(cid:173)\n\nnational Conference on Research and Development in Information Retrieval (SIGIR), \npages 50-57, 1999. \n\n[8] T . Hofmann, J. Puzicha, and M. I. Jordan. Unsupervised learning from dyadic data. \n\nIn Advances in Neural Information Processing Systems 11. MIT Press, 1999. \n\n[9] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi(cid:173)\nfiers. In Advances in Neural Information Processing Systems 11. MIT Press, 1999. \n\n[lO] T. Joachims. Text categorization with support vector machines: Learning with many \nrelevant features. In International Conference on Machine Learning (ECML), 1998. \n[ll] R.E. Kass and P. W. Vos. Geometrical foundations of asymptotic inference. Wiley, \n\nNew York, 1997. \n\n[12] D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factoriza(cid:173)\n\ntion. Nature, 401:788-791, 1999. \n\n[13] M. K. Murray and J. W. Rice. Differential geometry and statistics. Chapman & Hall, \n\nLondon, New York, 1993. \n\n[14] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of English words. \n\nIn Proceedings of the ACL, pages 183- 190, 1993. \n\n[15] L. Saul and F . Pereira. Aggregate and mixed-order Markov models for statistical \nlanguage processing. In Proceedings of the 2nd International Conference on Empirical \nMethods in Natural Language Processing, 1997. \n\n\f", "award": [], "sourceid": 1654, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": null}]}