{"title": "Recursive Attribute Factoring", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 304, "abstract": "", "full_text": "Recursive Attribute Factoring\n\nDavid Cohn\nGoogle Inc.,\n\n1600 Amphitheatre Parkway\nMountain View, CA 94043\n\ncohn@google.com\n\nDeepak Verma\n\nDept. of CSE, Univ. of Washington,\n\nSeattle WA- 98195-2350\n\ndeepak@cs.washington.edu\n\nKarl P\ufb02eger\nGoogle Inc.,\n\n1600 Amphitheatre Parkway\nMountain View, CA 94043\nkpfleger@google.com\n\nAbstract\n\nClustering, or factoring of a document collection attempts to \u201cexplain\u201d each ob-\nserved document in terms of one or a small number of inferred prototypes. Prior\nwork demonstrated that when links exist between documents in the corpus (as is\nthe case with a collection of web pages or scienti\ufb01c papers), building a joint model\nof document contents and connections produces a better model than that built from\ncontents or connections alone.\nMany problems arise when trying to apply these joint models to corpus at the\nscale of the World Wide Web, however; one of these is that the sheer overhead\nof representing a feature space on the order of billions of dimensions becomes\nimpractical.\nWe address this problem with a simple representational shift inspired by proba-\nbilistic relational models: instead of representing document linkage in terms of\nthe identities of linking documents, we represent it by the explicit and inferred at-\ntributes of the linking documents. Several surprising results come with this shift:\nin addition to being computationally more tractable, the new model produces fac-\ntors that more cleanly decompose the document collection. We discuss several\nvariations on this model and show how some can be seen as exact generalizations\nof the PageRank algorithm.\n\n1 Introduction\n\nThere is a long and successful history of decomposing collections of documents into factors or\nclusters to identify \u201csimilar\u201d documents and principal themes. Collections have been factored on\nthe basis of their textual contents [1, 2, 3], the connections between the documents [4, 5, 6], or both\ntogether [7].\n\nA factored corpus model is usually composed of a small number of \u201cprototype\u201d documents along\nwith a set of mixing coef\ufb01cients (one for each document in the corpus). Each prototype corresponds\nto an abstract document whose features are, in some mathematical sense, \u201ctypical\u201d of some sub-\nset of the corpus documents. The mixing coef\ufb01cients for a document d indicate how the model\u2019s\nprototypes can best be combined to approximate d.\n\n\fMany useful applications arise from factored models:\n\n\u2022 Model prototypes may be used as \u201ctopics\u201d or cluster centers in spectral clustering [8] serv-\n\ning as \u201ctypical\u201d documents for a class or cluster.\n\n\u2022 Given a topic, factored models of link corpora allow identifying authoritative documents\n\non that topic [4, 5, 6].\n\n\u2022 By exploiting correlations and \u201cprojecting out\u201d uninformative terms, the space of a fac-\ntored model\u2019s mixing coef\ufb01cients can provide a measure of semantic similarity between\ndocuments, regardless of the overlap in their actual terms [1].\n\nThe remainder of this paper is organized as follows: Below, we \ufb01rst review the vector space model,\nformalize the factoring problem, and describe how factoring is applied to linked document collec-\ntions. In Section 2 we point out limitations of current approaches and introduce Attribute Factoring\n(AF) to address them. In the following two sections, we identify limitations of AF and describe\nRecursive Attribute Factoring and several other variations to overcome them, before summarizing\nour conclusions in Section 5.\n\nThe Vector Space Model: The vector space model is a convention for representing a document\ncorpus (ordinarily sets of strings of arbitrary length) as a matrix, in which each document is repre-\nsented as a column vector.\nLet the number of documents in the corpus be N and the size of vocabulary M. Then T denotes\nthe M \u00d7 N term-document matrix such that column j represents document dj, and Tij indicates\nthe number of times term ti appears in document dj. Geometrically, the columns of T can also be\nviewed as points in an M dimensional space, where each dimension i indexes the number of times\nterm ti appears in the corresponding document.\nA link-based corpus may also be represented as a vector space, de\ufb01ning an N \u00d7 N matrix L where\nLij = 1 if there is a link from document i to j and 0 otherwise. It is sometimes preferable to work\ni0 Li0j; that is, each document\u2019s outlinks\n\nwith P , a normalized version of L in which Pij = Lij/P\n\nsum to 1.\n\nFactoring: Let A represent a matrix to be factored (usually T or\nT augmented with some other matrix) into K factors. Factoring de-\ncomposes A into two matrices U and V (each of rank K) such that\nA \u2248 U V .1 In the geometric interpretation, columns of U contains\nthe K prototypes, while columns of V indicate what mixture of pro-\ntotypes best approximates the columns in the original matrix.\n\nThe de\ufb01nition of what constitutes a \u201cbest approximation\u201d leads to\nthe many different factoring algorithms in use today. Latent Seman-\ntic Analysis [1] minimizes the sum squared reconstruction error of\nA , PLSA [2] maximizes the log-likelihood that a generative model\nusing U as prototypes would produce the observed A , and Non-\nNegative Matrix Factorization [3] adds constraints that all compo-\nnents of U and V must be greater than or equal to zero.\nFor the purposes of this paper, however, we are agnostic as to the factorization method used \u2014 our\nmain concern is how A , the document matrix to be factored, is generated.\n\nFigure 1: Factoring decom-\nposes matrix A into matri-\nces U and V\n\n1.1 Factoring Text and Link Corpora\n\nWhen factoring a text corpus (e.g. via LSA [1], PLSA [2], NMF [3] or some other technique), we\ndirectly factor the matrix T . Columns of the resulting M \u00d7 K matrix U are often interpreted as the\nK \u201cprincipal topics\u201d of the corpus, while columns of the K \u00d7 N matrix V are \u201ctopic memberships\u201d\nof the corpus documents.\n\n1In general, A \u2248 f (U , V ), where f can be any function with takes in the weights for a document and the\n\ndocument prototypes to generate the original vector.\n\n\fWhen factoring a link corpus (e.g. via ACA [4] or PHITS [6]), we factor L or the normalized\nlink matrix P . Columns of the resulting N \u00d7 K matrix U are often interpreted as the K \u201ccitation\ncommunities\u201d of the corpus, and columns of the K \u00d7 N matrix V indicate to what extent each\ndocument belongs to the corresponding community. Additionally, U ij, the degree of citation that\ncommunity j accords to document di can be interpreted as the \u201cauthority\u201d of di in that community.\n\n1.2 Factoring Text and Links Together\n\nMany interesting corpora, such as scienti\ufb01c literature and the World Wide Web, contain both text\ncontent and links. Prior work [7] has demonstrated that building a single factored model of the joint\nterm-link matrix produces a better model than that produced by using text or links alone.\nThe naive way to produce such a joint model is to append L or P below T , and factor the joint\nmatrix:\n\n(cid:20) T\n\nL\n\n(cid:21)\n\n\u2248\n\n(cid:21)\n\n(cid:20) U T\n\nU L\n\n\u00d7 V .\n\nWhen factored, the resulting U matrix can be seen as having\ntwo components, representing the two distinct types of infor-\nmation in [T ; L ]. Column i of UT indicates the expected\nterm distribution of factor i, while the corresponding column\nof UL indicates the distribution of documents that typically\nlink to documents represented by that factor.\nIn practice, L should be scaled by some factor \u03bb to control\nthe relative importance of the two types of information, but\nempirical evidence [7] suggests that performance is somewhat\ninsensitive to its exact value. For clarity, we omit reference to\n\u03bb in the equations below.\n\n2 Beyond the Naive Joint Model\n\n(1)\n\nFigure 2: The naive joint model\nconcatenates term and link matrices\n\nJoint models provide a systematic way of incorporating information from both the terms and link\nstructure present in a corpus. But the naive approach described above does not scale up to web-sized\ncorpora, which may have millions of terms and tens of billions of documents. The matrix resulting\nfrom a naive representation of a web-scale problem would have N + M features with N \u2248 1010\nand M \u2248 106. Simply representing this matrix (let alone factoring it) is impractical on a modern\nworkstation.\n\nWork on Probabilistic Relational Models (PRMs) [9] suggests another approach. The terms in a\ndocument are explicit attributes; links to the document provide additional attributes, represented\n(in the naive case) as the identities of the inlinking documents. In a PRM however, entities are\nrepresented by their attributes, rather than their identities. By taking a similar tack, we arrive at\nAttribute Factoring \u2014 the approach of representing link information in terms of the attributes of the\ninlinking documents, rather than by their explicit identities.\n\n2.1 Attribute Factoring\n\nEach document dj, along with an attribute for each term, has an attribute for each other document di\nin the corpus, signifying the presence (or absence) of a link from di to dj. When N \u2248 1010, keeping\neach document identity as a separate attribute is prohibitive. To create a more economical represen-\ntation, we propose replacing the link attributes by a smaller set of attributes that \u201csummarize\u201d the\ninformation from link matrix L , possibly in combination with the term matrix T .\nThe most obvious attributes of a document are what terms it contains. Therefore, one simple way\nto represent the \u201cattributes\u201d of a document\u2019s inlinks is to aggregate the terms in the documents that\nlink to it. There are many possible ways to aggregate these terms, including Dirichlet and more\nsophisticated models. For computational and representational simplicity in this paper, however, we\nreplace inlink identities with a sum of the terms in the inlinking documents. In matrix notation, this\n\n\fis just\n\n(cid:20) T\n\nT \u00d7 L\n\n(cid:21)\n\n\u2248\n\n(cid:20) U T\n\nU T \u00d7L\n\n(cid:21)\n\n\u00d7 V .\n\nColloquially, we can look at this representation as saying that a doc-\nument has \u201csome distribution of terms\u201d (T ) and is linked to by doc-\numents that have \u201csome other term distribution\u201d (T \u00d7 L ).\nBy substituting the aggregated attributes of the inlinks for their\nidentities, we can reduce the size of the representation down from\n(M +N)\u00d7N to a much more manageable 2M \u00d7N. What is surpris-\ning is that, on the domains tested, this more compact representation\nactually improves factoring performance.\n\n(2)\n\nFigure 3: Representation for\nAttribute Factoring\n\n2.2 Attribute Factoring Experiments\n\nWe tested Attribute Factoring on two publicly-\navailable corpora of interlinked text documents.\nThe Cora dataset [10] consists of abstracts and\nreferences of of approximately 34,000 com-\nputer science research papers; of these we used\nthe approximately 2000 papers categorized into\nthe seven sub\ufb01elds of machine learning. The\nWebKB dataset [11] consists of approximately\n6000 web pages from computer science depart-\nments, classi\ufb01ed by school and category (stu-\ndent, course, faculty, etc.).\n\nFor both datasets, we factored the content-only,\nnaive joint, and AF joint representations using\nPLSA [2]. We varied K, the number of com-\nputed factors from 2 to 16, and performed 10\nfactoring runs for each value of K tested. The\nfactored models were evaluated by clustering\neach document to its dominant factor and mea-\nsuring cluster precision: the fraction of docu-\nments in a cluster sharing the majority label.\n\nFigure 4: Attribute Factoring outperforms the\ncontent-only and naive joint representations\n\nFigure 4 illustrates a typical result: adding explicit link information improves cluster precision, but\nabstracting the link information with Attribute Factoring improves it even more.\n\n3 Beyond Simple Attribute Factoring\n\nAttribute Factoring reduces the number of attributes from\nN +M to 2M, allowing existing factoring techniques to scale\nto web-sized corpora. This reduction in number of attributes\nhowever, comes at a cost. Since the identity of the document\nitself is replaced by its attributes, it is possible for unscrupu-\nlous authors (spammers) to \u201cpose\u201d as a legitimate page with\nhigh PageRank.\n\nConsider the example shown in Figure 5, showing two sub-\ngraphs present in the web. On the right is a legitimate page\nlike the Yahoo! homepage, linked to by many pages, and link-\ning to page RYL (Real Yahoo Link). A link from the Ya-\nhoo! homepage to RYL imparts a lot of authority and hence\nis highly desired by spammers. Failing that, a spammer might\ntry to create a counterfeit copy of the Yahoo! homepage, boost\nits PageRank by means of a \u201clink farm\u201d, and create a link from\nit to his page FYL (Fake Yahoo Link).\n\nFigure 5: Attribute Factoring can be\n\u201cspammed\u201d by mirroring one level\nback\n\n\fWithout link information, our factoring can not distinguish the counterfeit homepage from the real\none. Using AF or the naive joint model allows us to distinguish them based on the distribution\nof documents that link to each. But with AF, that real/counterfeit distinction is not propagated to\ndocuments that they point to. All that AF tells us is that RYL and FYL are pointed to by pages that\nlook a lot like the Yahoo! homepage.\n\n3.1 Recursive Attribute Factoring\n\nSpamming AF was simple because it only looks one link behind. That is, attributes for a document\nare either explicit terms in that document or explicit terms in documents linking to the current\ndocument. This let us infer that the fake Yahoo! homepage was counterfeit, but provided no way to\npropagate this inference on to later pages.\n\nThe AF representation introduced in the previous section can\nbe easily fooled. It makes inferences about a document based\non explicit attributes propagated from the documents linking\nto it, but this inference only propagates one level. For example\nit lets us infer that the fake Yahoo! homepage was counterfeit,\nbut provides no way to propagate this inference on to later\npages. This suggests that we need to propagating not only\nexplicit attributes of a document (its component terms), but\nits inferred attributes as well.\n\nA ready source of inferred attributes comes from the factoring\nprocess itself. Recall that when factoring T \u2248 U \u00d7 V , if we\ninterpret the columns of U as factors or prototypes, then each\ncolumn of V can be interpreted as the inferred factor member-\nships of its corresponding document. Therefore, we can prop-\nagate the inferred attributes of inlinking documents by aggre-\ngating the columns of V they correspond to (Figure 6). Nu-\nmerically, this replaces T (the explicit document attributes) in\nthe bottom half of the left matrix with V (the inferred document attributes):\n\nFigure 6: Recursive Attribute Fac-\ntoring aggregates the inferred at-\ntributes (columns of V ) of inlink-\ning documents\n\n(cid:20) T\n\nV \u00d7 L\n\n(cid:21)\n\n\u2248\n\n(cid:20) UT\n\nUV\u00d7L\n\n(cid:21)\n\n\u00d7 V .\n\n(3)\n\nThere are some worrying aspects of this representation: the document representation is no longer\nstatically de\ufb01ned, and the equation itself is recursive. In practice, there is a simple iterative procedure\nfor solving the equation (See Algorithm 1), but it is computationally expensive, and carries no\nconvergence guarantees. The \u201cinferred\u201d attributes (IA ) are set initially to random values, which are\nthen updated until they converge. Note that we need to use the normalized version of L , namely\nP 2.\n\nAlgorithm 1 Recursive Attribute Factoring\n1: Initialize IA\n2: while Not Converged do\n\n0 with random entries.\n\n(cid:20) T\n\n(cid:21)\n\n\u2248\nt+1 = V \u00d7 P .\n\nIA\nt\n\n(cid:20) UT\n\nUIA\n\n3:\n\nFactor At =\n\nUpdate IA\n\n4:\n5: end while\n\n(cid:21)\n\n\u00d7 V\n\n3.2 Recursive Attribute Factoring Experiments\n\nTo evaluate RAF, we used the same data sets and procedures as in Section 2.2, with results plotted\nin Figure 7.\nIt is perhaps not surprising that RAF by itself does not perform as well as AF on\n\n2We would use L and P interchangeably to represent contribution from inlinking documents distinguishing\n\nonly in case of \u201crecursive\u201d equations where it is important to normalize L to facilitate convergence.\n\n\fthe domains tested3 - when available, explicit information is arguably more powerful than inferred\ninformation.\n\nIt\u2019s important to realize, however, that AF and RAF are in no way exclusive of each other; when we\ncombine the two and propagate both explicit and implicit attributes, our performance is (satisfyingly)\nbetter than with either alone (top lines in Figures 7(a) and (b)).\n\n(a) Cora\n\n(b) WebKB\n\nFigure 7: RAF and AF+RAF results on Cora and WebKB datasets\n\n4 Discussion: Other Forms of Attribute Factoring\n\nBoth Attribute Factoring and Recursive Attribute Factoring involve augmenting the term matrix\nwith a matrix (call it IA ) containing attributes of the inlinking documents, and then factoring the\naugmented matrix:\n\n(cid:20) T\n\nIA\n\n(cid:21)\n\n\u2248\n\n(cid:21)\n\n(cid:20) UT\n\nUIA\n\nA =\n\nV .\n\n(4)\n\nThe traditional joint model set IA = L ; in Attribute Factoring we set IA = T \u00d7L and in Recursive\nAttribute Factoring IA = V \u00d7P . In general though, we can set IA to be any matrix that aggregates\nattributes of a document\u2019s inlinks.4 For AF we can replace the N dimensional inlink vector with a\nM-dimensional inferred vector d0\nj:L ji=1 wjdj and then IA would be the matrix\nwith inferred attributes for each document i.e. ith column of IA is d0\ni. Different choices for wj lead\nto different weighting of aggregation of attributes from the incoming documents; some variations\nare summarized in Table 1.\n\ni =P\n\ni such that d0\n\nIA\nSummed function\nwi\nT \u00d7L\n1\nAttribute Factoring\nT \u00d7P\nOutdegree-normalized Attribute Factoring Pji\nP j\nT \u00d7 diag(P ) \u00d7L\nPageRank-weighted Attribute Factoring\nP jPji T \u00d7diag(P ) \u00d7P\nPageRank- and outdegree-normalized\n\nTable 1: Variations on attribute weighting for Attribute Factoring. (Pj is PageRank of document j)\n\n3It is somewhat surprising (and disappointing) that RAF performs worse that the content-only model, but\n\nother work [7] has posited situations when this may be expected.\n\n4This approach can, of course, be extended to also include attributes of the outlinked documents, but bib-\nliometric analysis has historically found that inlinks are more informative about the nature of a document than\noutlinks (echoing the Hollywood adage that \u201cIt\u2019s not who you know that matters - it\u2019s who knows you\u201d).\n\n\fExtended Attribute Factoring: Recursive Attribute Factoring was originally motivated by the\n\u201cFake Yahoo!\u201d problem described in Section 3. While useful in conjunction with ordinary Attribute\nFactoring, its recursive nature and lack of convergence guarantees are troubling. One way to simu-\nlate the desired effect of RAF in a closed form is to explicitly model the inlink attributes more than\njust one level.5 For example, ordinary AF looks back one level at the (explicit) attributes of inlink-\ning documents by setting IA = T \u00d7 L . We can extend that \u201clookback\u201d to two levels by de\ufb01ning\nIA = [T \u00d7 L ; T \u00d7 L \u00d7 L ]. The IA matrix would have 2M features (M attributes for inlinking\ndocuments and another M for attributes of documents that linked to the inlinking documents). Still,\nit would be possible, albeit dif\ufb01cult, for a determined spammer to fool this Extended Attribute Fac-\ntoring (EAF) by mimicking two levels of the web\u2019s linkage. This can be combatted by adding a third\n\nlevel to the model (IA =(cid:2)T \u00d7 L ; T \u00d7 L 2; T \u00d7 L 3(cid:3)), which increases the model complexity by\n\nonly a linear factor, but (due to the web\u2019s high branching) vastly increases the number of pages a\nspammer would need to duplicate. It should be pointed out that these extended attributes rapidly\nconverge to the stationary distribution of terms on the web: T \u00d7 L \u221e = T \u00d7 eig(L ), equivalent to\nweighting inlinking attributes by a version of PageRank that omits random restarts. (Like in Algo.\n1, P needs to be used instead of L to achieve convergence).\n\nAnother PageRank Connection: While the vanilla RAF(+AF) gives good results, one can imag-\nine many variations with interesting properties; one of them in particular is worth mentioning. A\nsmoothed version of the recursive equation, can be written as\n\nT\n\n\u0001 + \u03b3 \u00b7 V \u00d7 P\n\n\u2248\n\n\u00d7 V .\n\n(5)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\n(cid:20) UT\n\nUV\u00d7L\n\nThis the same basic equation as the RAF but multiplied with a damping factor \u03b3. This smoothed\nRAF gives a further insight into working of RAF itself once we look at a simpler version of it.\nStarting the the original equation let us \ufb01rst remove the explicit attributes. This reduces the equation\nto \u0001 + \u03b3 \u00b7 V \u00d7 P \u2248 UV\u00d7L \u00d7 V . For the case where UV\u00d7L has a single dimension, the above\nequation further simpli\ufb01es to \u0001 + \u03b3 \u00b7 V \u00d7 P \u2248 u\u00d7 V .\nFor some constrained values of \u0001 and \u03b3, we get \u0001 + (1\u2212 \u0001)\u00b7 V \u00d7 P \u2248 V , which is just the equation\nfor PageRank [12]. This means that, in the absence of T \u2019s term data, the inferred attributes V\nproduced by smoothed RAF represent a sort of generalized, multi-dimensional PageRank, where\neach dimension corresponds to authority on one of the inferred topics of the corpus.6 With the\nterms of T added, the intuition is that V and the inferred attributes IA = V \u00d7 P converge to a\ntrade-off between the generalized PageRank of link structure and factor values for T in terms of the\nprototypes U T capturing term information.\n\n5 Summary\n\nWe have described a representational methodology for factoring web-scale corpora, incorporating\nboth content and link information. The main idea is to represent link information with attributes of\nthe inlinking documents rather than their explicit identities. Preliminary results on a small dataset\ndemonstrate that the technique not only makes the computation more tractable but also signi\ufb01cantly\nimprove the quality of the resulting factors.\n\nWe believe that we have only scratched the surface of this approach; many issues remain to be\naddressed, and undoubtedly many more remain to be discovered. We have no principled basis for\nweighting the different kinds of attributes in AF and EAF; while RAF seems to converge reliably\nin practice, we have no theoretical guarantees that it will always do so. Finally, in spite of our\nmotivating example being the ability to factor very large corpora, we have only tested our algorithms\non small \u201cacademic\u201d data sets; applying the AF, RAF and EAF to a web-scale corpus remains as the\nreal (and as yet untried) criterion for success.\n\n5Many thanks to Daniel D. Lee for this insight.\n6This is related to, but distinct from the generalization of PageRank described by Richardson and Domingos\n[13], which is computed as a scalar quantity over each of the (manually-speci\ufb01ed) lexical topics of the corpus.\n\n\fReferences\n\n[1] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and\nIndexing by latent semantic analysis. Journal of the American So-\n\nRichard A. Harshman.\nciety of Information Science, 41(6):391\u2013407, 1990.\n\n[2] Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Arti\ufb01cial\n\nIntelligence, UAI\u201999, Stockholm, 1999.\n\n[3] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In\n\nAdvances in Neural Information Processing Systems 12, pages 556\u2013562. MIT Press, 2000.\n\n[4] H.D. White and B.C. Grif\ufb01th. Author cocitation: A literature measure of intellectual structure.\n\nJournal of the American Society for Information Science, 1981.\n\n[5] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,\n\n46(5):604\u2013632, 1999.\n\n[6] David Cohn and Huan Chang. Learning to probabilistically identify authoritative documents.\nIn Proc. 17th International Conf. on Machine Learning, pages 167\u2013174. Morgan Kaufmann,\nSan Francisco, CA, 2000.\n\n[7] David Cohn and Thomas Hofmann. The missing link - a probabilistic model of document\n\ncontent and hypertext connectivity. In Neural Information Processing Systems 13, 2001.\n\n[8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.\n\nAdvances in Neural Information Processing Systems 14, 2002.\n\nIn\n\n[9] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In\nProceedings of the Sixteenth International Joint Conference on Arti\ufb01cial Intelligence (IJCAI-\n99), pages 1300\u20131309, Stockholm, Sweden, 1999. Morgan Kaufman.\n\n[10] Andrew K. McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the\nconstruction of internet portals with machine learning. Information Retrieval, 3(2):127\u2013163,\n2000.\nhttp://cs.cmu.edu/\u223cWebKB). 1998.\n\nThe World Wide Knowledge Base Project\n\n[11] T. Mitchell et. al.\n\n(Available at\n\n[12] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine.\n\nComputer Networks and ISDN Systems, 30(1\u20137):107\u2013117, 1998.\n\n[13] Mathew Richardson and Pedro Domingos. The Intelligent Surfer: Probabilistic Combination\nof Link and Content Information in PageRank. In Advances in Neural Information Processing\nSystems 14. MIT Press, 2002.\n\n\f", "award": [], "sourceid": 2969, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}, {"given_name": "Deepak", "family_name": "Verma", "institution": null}, {"given_name": "Karl", "family_name": "Pfleger", "institution": null}]}