{"title": "The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity", "book": "Advances in Neural Information Processing Systems", "page_first": 430, "page_last": 436, "abstract": null, "full_text": "The Missing Link - A Probabilistic Model of \n\nDocument Content and Hypertext Connectivity \n\nDavid Cohn \n\nBurning Glass Technologies \n201 South Craig St, Suite 2W \n\nPittsburgh, PA 15213 \n\ndavid. cohn @burning-glass.com \n\nThomas Hofmann \n\nDepartment of Computer Science \n\nBrown University \n\nProvidence, RI 02192 \n\nth@cs.brown.edu \n\nAbstract \n\nWe describe a joint probabilistic model for modeling the contents and \ninter-connectivity of document collections such as sets of web pages or \nresearch paper archives. The model is based on a probabilistic factor \ndecomposition and allows identifying principal topics of the collection \nas well as authoritative documents within those topics. Furthermore, the \nrelationships between topics is mapped out in order to build a predictive \nmodel of link content. Among the many applications of this approach are \ninformation retrieval and search, topic identification, query disambigua(cid:173)\ntion, focused web crawling, web authoring, and bibliometric analysis. \n\n1 Introduction \n\nNo text, no paper, no book can be isolated from the all-embracing corpus of documents \nit is embedded in. Ideas, thoughts, and work described in a document inevitably relate \nto and build upon previously published material. 1 Traditionally, this interdependency has \nbeen represented by citations, which allow authors to explicitly make references to related \ndocuments. More recently, a vast number of documents have been \"published\" electron(cid:173)\nically on the world wide web; here, interdependencies between documents take the form \nof hyperlinks, and allow instant access to the referenced material. We would like to have \nsome way of modeling these interdependencies, to understand the structure implicit in the \ncontents and connections of a given document base without resorting to manual clustering, \nclassification and ranking of documents. \n\nThe main goal of this paper is to present a joint probabilistic model of document con(cid:173)\ntent and connectivity, i.e., a parameterized stochastic process which mimics the generation \nof documents as part of a larger collection, and which could make accurate predictions \nabout the existence of hyperlinks and citations. More precisely, we present an extension of \nour work on Probabilistic Latent Semantic Analysis (PLSA) [4, 7] and Probabilistic HITS \n(PHITS) [3, 8] and propose a mixture model to perform a simultaneous decomposition of \nthe contingency tables associated with word occurrences and citations/links into \"topic\" \nfactors. Such a model can be extremely useful in many applications, a few of which are: \n\n\u2022 Identifying topics and common subjects covered by documents. Representing \n\n1 Although the weakness of our memory might make us forget this at times. \n\n\fdocuments in a low-dimensional space can help understanding of relations be(cid:173)\ntween documents and the topics they cover. Combining evidence from terms and \nlinks yields potentially more meaningful and stable factors and better predictions. \n\n\u2022 Identifying authoritative documents on a given topic. The authority of a document \n\nis correlated with how frequently it is cited, and by whom. Identifying topic(cid:173)\nspecific authorities is a key problems for search engines [2]. \n\n\u2022 Predictive navigation. By predicting what content might be found \"behind\" a link, \n\na content/connectivity model directly supports navigation in a document collec(cid:173)\ntion, either through interaction with human users or for intelligent spidering. \n\n\u2022 Web authoring support. Predictions about links based on document contents can \n\nsupport authoring and maintenance of hypertext documents, e.g., by (semi-) auto(cid:173)\nmatically improving and updating link structures. \n\nThese applications address facets of one of the most pressing challenges of the \"informa(cid:173)\ntion age\": how to locate useful information in a semi-structured environment like the world \nwide web. Much of this difficulty, which has led to the emergence of an entire new indus(cid:173)\ntry, is due to the impoverished explicit structure of the web as a whole. Manually created \nhyperlinks and citations are limited in scope - the annotator can only add links and pointers \nto other document they are aware of and have access to. Moreover, these links are static; \nonce the annotator creates a link between documents, it is unchanging. If a different, more \nrelevant document appears (or if the cited document disappears), the link may not get up(cid:173)\ndated appropriately. These and other deficiencies make the web inherently \"noisy\" - links \nbetween relevant documents may not exist and existing links might sometimes be more or \nless arbitrary. Our model is a step towards a technology that will allow us to dynamically \ninfer more reliable inter-document structure from the impoverished structure we observe. \n\nIn the following section, we first review PLSA and PHITS. In Section 3, we show how \nthese two models can be combined into a joint probabilistic term-citation model. Section 4 \ndescribes some of the applications of this model, along with preliminary experiments in \nseveral areas. In Section 5 we consider future directions and related research. \n\n2 PLSA and PHITS \n\nPLSA [7] is a statistical variant of Latent Semantic Analysis (LSA) [4] that builds a fac(cid:173)\ntored multinomial model based on the assumption of an underlying document generation \nprocess. The starting point of (P)LSA is the term-document matrix N of word counts, i.e., \nNij denotes how often a term (single word or phrase) ti occurs in document dj . In LSA, N \nis decomposed by a SVD and factors are identified with the left/right principal eigenvectors. \nIn contrast, PLSA performs a probabilistic decomposition which is closely related to the \nnon-negative matrix decomposition presented in [9]. Each factor is identified with a state Zk \n(1 :::; k :::; K) of a latent variable with associated relative frequency estimates P(ti IZk) for \neach term in the corpus. A document dj is then represented as a convex combination of fac(cid:173)\ntors with mixing weights P(Zk Idj ), i.e., the predictive probabilities for terms in a particular \ndocument are constrained to be of the functional form P(ti ldj) = I:k P(ti lzk) P(Zk Idj ), \nwith non-negative probabilities and two sets of normalization constraints I:i P(ti IZk) = 1 \nfor all k and I:k P(Zk Idj ) = 1 for all.i. \nBoth the factors and the document -specific mixing weights are learned by maximizing the \nlikelihood of the observed term frequencies. More formally, PLSA aims at maximizing \nL = I:i,j N ij IogI:kP(til zk) P(zkldj). Since factors Zk can be interpreted as states \nof a latent mixing variable associated with each observation (i.e., word occurrence), the \nExpectation-Maximization algorithm can be applied to find a local maximum of L. \n\nPLSA has been demonstrated to be effective for ad hoc information retrieval, language \n\n\fmodeling and clustering. Empirically, different factors usually capture distinct \"topics\" \nof a document collection; by clustering documents according to their dominant factors, \nuseful topic-specific document clusters often emerge (using the Gaussian factors of LSA, \nthis approach is known as \"spectral clustering\"). \n\nIt is important to distinguish the factored model used here from standard probabilistic mix(cid:173)\nture models. In a mixture model, each object (such as a document) is usually assumed to \ncome from one of a set of latent sources (e.g. a document is either from Z l or Z2). Credit for \nthe object may be distributed among several sources because of ambiguity, but the model \ninsists that only one of the candidate sources is the true origin of the object. In contrast, a \nfactored model assumes that each object comes from a mixture of sources - without ambi(cid:173)\nguity, it can assert that a document is half Z l and half Z2. This is because the latent variables \nare associated with each observation and not with each document (set of observations). \n\nPHITS [3] performs a probabilistic factoring of document citations used for bibliometric \nanalysis. Bibliometrics attempts to identify topics in a document collection, as well as in(cid:173)\nfluential authors and papers on those topics, based on patterns in citation frequency. This \nanalysis has traditionally been applied to references in printed literature, but the same tech(cid:173)\nniques have proven successful in analyzing hyperlink structure on the world wide web [8]. \nIn traditional bibliometrics, one begins with a matrix A of document-citation pairs. Entry \nAij is nonzero if and only if document di is cited by document dj or, equivalently, if dj \ncontains a hyperlink to di .2 The principal eigenvectors of AA' are then extracted, with each \neigenvector corresponding to a \"community\" of roughly similar citation patterns. The co(cid:173)\nefficient of a document in one of these eigenvectors is interpreted as the \"authority\" of that \ndocument within the community -\nhow likely it is to by cited within that community. A \ndocument's coefficient in the principal eigenvectors of A' A is interpreted as its \"hub\" value \nin the community - how many authoritative documents it cites within the community. \nIn PHITS, a probabilistic model replaces the eigenvector analysis, yielding a model that \nhas clear statistical interpretations. PHITS is mathematically identical to PLSA, with one \ndistinction: instead of modeling the citations contained within a document (corresponding \nto PLSA's modeling of terms in a document), PHITS models \"inlinks,\" the citations to a \ndocument. It substitutes a citation-source probability estimate P( cl lzk) for PLSA's term \nprobability estimate. As with PLSA and spectral clustering, the principal factors of the \nmodel are interpreted as indicating the principal citation communities (and by inference, \nthe principal topics). For a given factor/topic Z b the probability that a document is cited, \nP( dj IZk ), is interpreted as the document's authority with respect to that topic. \n\n3 A Joint Probabilistic Model for Content and Connectivity \n\nLinked and hyperlinked documents are generally composed of terms and citations; as such, \nboth term-based PLSA and citation-based PHITS analyses are applicable. Rather than \napplying each separately, it is reasonable to merge the two analyses into ajoint probabilistic \nmodel, explaining terms and citations in terms of a common set of underlying factors. \n\nSince both PLSA and PHITS are based on a similar decomposition, one can define the \nfollowing joint model for predicting citationsllinks and terms in documents: \n\nP(til dj ) = LP(ti!zk )P(Zkldj) , P(czldj) = LP(Cl !zk )P(Zkldj). \n\n(1) \n\nk \n\nk \n\nNotice that both decompositions share the same document-specific mixing proportions \nP(Zk Idj ). This couples the conditional probabilities for terms and citations: each \"topic\" \n\n2In fact, since multiple citationsllinks may exist, we treat A ij as a count variable. \n\n\fhas some probability P( cllzk) of linking to document dl as well as some probability \nP(ti lzk) of containing an occurrence of tenn ti. The advantage of this joint modeling ap(cid:173)\nproach is that it integrates content- and link-information in a principled manner. Since the \nmixing proportions are shared, the learned decomposition must be consistent with content \nand link statistics. In particular, this coupling allows the model to take evidence about link \nstructure into account when making predictions about document content and vice versa. \nOnce a decomposition is learned, the model may be used to address questions like \"What \nwords are likely to be found in a document with this link structure?\" or \"What link structure \nis likely to go with this document?\" by simple probabilistic inference. \n\nThe relative importance one assigns to predicting terms and links will depend on the spe(cid:173)\ncific application. In general, we propose maximizing the following (nonnalized) log(cid:173)\nlikelihood function with a relative weight 0:. \n\nL = ~ [0: ~ L~Xri'j 10g~P(tilzk)P(zkldj) \n\n+ (1 - 0:) L L A~,. log LP(CI IZk)P(Zkldj )] \n\nI' \n\nI J \n\nI \n\nk \n\n(2) \n\nThe normalization by term/citation counts ensures that each document is given the same \nweight in the decomposition, regardless of the number of observations associated with it. \n\nFollowing the EM approach it is straightforward to derive a set of re-estimation equations. \nFor the E-step one gets formulae for the posterior probabilities of the latent variables asso(cid:173)\nciated with each observation3 \n\n4 Experiments \n\nIn the introduction, we described many potential applications of the the joint probabilistic \nmodel. Some, like classification, are simply extensions of the individual PHITS and PLSA \nmodels, relying on the increased power of the joint model to improve their performance. \nOthers, such as intelligent web crawling, are unique to the joint model and require its \nsimultaneous modelling of a document's contents and connections. \n\nIn this section, we first describe experiments verifying that the joint model does yield im(cid:173)\nproved classification compared with the individual models. We then describe a quantity \ncalled \"reference flow\" which can be computed from the joint model, and demonstrate its \nuse in guiding a web crawler to pages of interest. \n\n30ur experiments used a tempered version of Equation 3 to minimize overfitting; see [7] for \n\ndetails. \n\n\f0.38 \n\n0_34 \n\n0.32 \n\n0.3 \n\n0.28 \n\n0 .2 6 \n\n0 . 5 \n\n0.45 \n\n0.4 \n\n0.35 \n\n0.3 \n\n0 .2 4 \n\no \n\nVVebKB da.t:a.. - - (cid:173)\n\ns1:c:I error ~ \n\n0.2 0.4 0.6 0.8 \n\n1 \n\na.lpha. \n\n0.25 \n\no \n\nCora.. dat:a. - - (cid:173)\n\nstc:l error ~ \n\n0.2 0.4 0.6 0.8 \n\nalpha. \n\nFigure 1: Classification accuracy on the WebKB and Cora data sets for PHITS (a = 0), \nPLSA (a = 1) and the joint model (0 < a < 1). \n\nWe used two data sets in our experiments. The WebKB data set [11], consists of ap(cid:173)\nproximately 6000 web pages from computer science departments, classified by school and \ncategory (student, course, faculty, etc.). The Cora data set [10] consists of the abstracts and \nreferences of approximately 34,000 computer science research papers; of these, we used \nthe approximately 2000 papers categorized into one of seven subfields of machine learning. \n\n4.1 Classification \n\nAlthough the joint probabilistic model performs unsupervised learning, there are a number \nof ways it may be used for classification. One way is to associate each document with its \ndominant factor, in a form of spectral clustering. Each factor is then given the label of the \ndominant class among its associated documents. Test documents are judged by whether \ntheir dominant factor shares their label. \n\nAnother approach to classification (but one that forgoes clustering) is a factored nearest \nneighbor approach. Test documents are judged against the label of their nearest neighbor, \nbut the \"nearest\" neighbor is determined by cosines of their projections in factor space. \nThis is the method we used for our experiments. \n\nFor the Cora and WebKB data, we used seven factors and six factors respectively, arbitrarily \nselecting the number to correspond to the number of human-derived classes. We compared \nthe power of the joint model with that of the individual models by varying a from zero to \none, with the lower and upper extremes corresponding to PHITS and PLSA, respectively. \nFor each value of a, a randomly selected 15% of the data were reserved as a test set. The \nmodels were tempered (as per [7]) with a lower limit of (3 = 0.8, decreasing (3 by a factor \nof 0.9 each time the data likelihood stopped increasing. \n\nFigure 1 illustrates several results. First, the accuracy of the joint model (where a is neither \no nor 1), is greater than that of either model in isolation, indicating that the contents and link \nstructure of a document collection do indeed corroborate each other. Second, the increase \nin accuracy is robust across a wide range of mixing proportions. \n\n4.2 Reference Flow \n\nThe previous subsection demonstrated how the joint model amplifies abilities found in \nthe individual models. But the joint model also provides features found in neither of its \nprogenitors. \n\n\f~prOj \n\ndept.! \nfaculty \n1J \n\nstudent! \ndepa~ent \n\nA document d may be thought of as occupying \na point Z = {P(zl ld) , ... ,P(Zk ld)} in the joint \nmodel's space of factor mixtures. The terms in d \nact as \"signposts\" describing Z, and the links act as \ndirected connections between that point and others. \nTogether, they provide a reference flow, indicating \na referential connection between one topic and an(cid:173)\nother. This reference flow exists between arbitrary \npoints in the factor space, even in the absence of \ndocuments that map directly to those points. \nConsider a reference from document di to document \ndj , and two points in factor space zm and zn, not \nparticularly associated with di or dj . Our model \nallows us to compute P(di lzm) and P(dj l~) , the \nprobability that the combination of factors at zm \nand ~ are responsible for di and dj respectively. \nTheir product P(dil~) P(dj lzn) is then the prob(cid:173)\nability that the observed link represents a reference between those two points in fac(cid:173)\ntor space. By integrating over all links in the corpus we can compute, fmn = \n2:. \u00b7A .. ...i.oP(dilzm)P(dJ\u00b7I~), an unnormalized \"reference flow\" between zm and ~. \nFigure 2 shows the principal reference flow between several topics in the WebKB archive. \n\nFigure 2: Principal reference \nflow between the primary topics \nidentified in the examined subset \nof the WebKB archive. \n\nfaculty \u00a2::::::l course \n\n~ \n\n~,J. 1,l T \n\n4.3 \n\nIntelligent Web Crawling with Reference Flow \n\nLet us suppose that we want to find new web pages on a certain topic, described by a set \nof words composed into a target pseudodocument dt . We can project dt into our model to \nidentify the point Zt in factor space that represents that topic. Now, when we explore web \npages, we want to follow links that will lead us to new documents that also project to \u00a3,; . \n\nTo do so, we can use reference flow. \nConsider a web page ds (or section of \na web page4 ). Although we don't know \nwhere its links point, we do know what \nwords it contains. We can project them \nas a peudodocument to find Zs the point \nin factor space the page/section occupies, \nprior to any information about its links. \nWe can then use our model to compute \nthe reference flow fst indicating the (un(cid:173)\nnormalized) probability that a document \nat Zs would contain a link to one at Zt . \n\n350'-~~---t=,u-e~s-ou~m~e-_~_~_-' \n\n'placebo' \n\n---\n\n---\n\n300 \n\n250 \n\n~ 200 \n~ \nI 150 \n\n100 \n\n50 \n\n102030405060708090100 \n\nAs a greedy solution, we could simply \nfollow links in documents or sections \nthat have the highest reference flow to(cid:173)\nward the target topic. Or if computation \nis no barrier, we could (in theory) use ref(cid:173)\nerence flow as state transition probabili(cid:173)\nties and find an optimal link to follow by \ntreating the system as a continuous-state Markov decision process. \n\nFigure 3: When ranked according to mag(cid:173)\nnitude of reference flow to a designated tar(cid:173)\nget, a \"true source\" scores much higher than \na placebo source document drawn at random. \n\nrank \n\n4Tbough not described here, we have had success using our model for document segmentation, \nfollowing an approach similar to that of [6]. By projecting successive n-sentence windows of a \ndocument into the factored model, we can observe its trajectory through \"topic space.\" A large jump \nin the factor mixture between successive windows indicates a probable topic boundary in document. \n\n\fTo test our model's utility in intelligent web crawling, we conducted experiments on the \nWebKB data set using the greedy solution. On each trial, a \"target page\" dt was selected \nat random from the corpus. One \"source page\" ds containing a link to the target was \nidentified, and the reference flow 1st computed. The larger the reference flow, the stronger \nour model's expectation that there is a directed link from the source to the target. \n\nWe ranked this flow against the reference flow to the target from 100 randomly chosen \"dis(cid:173)\ntractor\" pages dr1 , dr2 ... , drlOO . As seen in Figure 3, reference flow provides significant \npredictive power. Based on 2400 runs, the median rank for the \"true source\" was 27/100, \nversus a median rank of 50/100 for a \"placebo\" distractor chosen at random. Note that the \ndis tractors were not screened to ensure that they did not also contain links to the target; as \nsuch, some of the high-ranking dis tractors may also have been valid sources for the target \nin question. \n\n5 Discussion and Related Work \n\nThere have been many attempts to combine link and term information on web pages, though \nmost approaches are ad hoc and have been aimed at increasing the retrieval of authoritative \ndocuments relevant to a given query. Bharat and Henzinger [1] provide a good overview \nof research in that area, as well as an algorithm that computes bibliometric authority after \nweighting links based on the relevance of the neighboring terms. The machine learning \ncommunity has also recently taken an interest in the sort of relational models studied by \nBibliometrics. Getoor et al. [5] describe a general framework for learning probabilistic \nrelational models from a database, and present experiments in a variety of domains. \n\nIn this paper, we have described a specific probabilistic model which attempts to explain \nboth the contents and connections of documents in an unstructured document base. While \nwe have demonstrated preliminary results in several application areas, this paper only \nscratches the surface of potential applications of a joint probabilistic document model. \n\nReferences \n[1] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked envi(cid:173)\n\nronments. In Proceedings of the 21 st Annual International ACM SIGIR Conference on Research \nand Development in Information Retrieval, 1998. \n\n[2] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Technical \n\nreport, Computer Science Department, Stanford University, 1998. \n\n[3] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In \n\nProceedings of the 17th International Conference on Machine Learning, 2000. \n\n[4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by \nlatent semantic analysis. J. of the American Society for Information Science , 41 :391--407, 1990. \n[5] L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In \n\nS. Dzeroski and N. Lavrac, editors, Relational Data Mining. Springer-Verlag, 2001. \n\n[6] M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of ACL, June 1994. \n[7] T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on \n\nUncertainty in AI, pages 289-296, 1999. \n\n[8] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM \n\nSymposium on Discrete Algorithms, 1998. \n\n[9] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization . \n\nNature , pages 788- 791,1999. \n\n[10] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet \n\nportals with machine learning. Information Retrieval Journal, 3: 127-163, 2000. \n\n[11] Web--->KB. Available electronically at http://www.es . emu. edu/ -WebKB/. \n\n\f", "award": [], "sourceid": 1846, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": null}]}