{"title": "Improving a Page Classifier with Anchor Extraction and Link Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1505, "page_last": 1512, "abstract": null, "full_text": "Improving a Page Classi\ufb01er with Anchor\n\nExtraction and Link Analysis\n\nWilliam W. Cohen\n\nCenter for Automated Learning and Discovery,\n\nCarnegie-Mellon University\n\n5000 Forbes Ave, Pittsburgh, PA 15213\n\nwilliam@wcohen.com\n\nAbstract\n\nMost text categorization systems use simple models of documents and\ndocument collections.\nIn this paper we describe a technique that im-\nproves a simple web page classi\ufb01er\u2019s performance on pages from a new,\nunseen web site, by exploiting link structure within a site as well as\npage structure within hub pages. On real-world test cases, this technique\nsigni\ufb01cantly and substantially improves the accuracy of a bag-of-words\nclassi\ufb01er, reducing error rate by about half, on average. The system uses\na variant of co-training to exploit unlabeled data from a new site. Pages\nare labeled using the base classi\ufb01er; the results are used by a restricted\nwrapper-learner to propose potential \u201cmain-category anchor wrappers\u201d;\nand \ufb01nally, these wrappers are used as features by a third learner to \ufb01nd\na categorization of the site that implies a simple hub structure, but which\nalso largely agrees with the original bag-of-words classi\ufb01er.\n\n1 Introduction\n\nMost text categorization systems use simple models of documents and document collec-\ntions. For instance, it is common to model documents as \u201cbags of words\u201d, and to model\na collection as a set of documents drawn from some \ufb01xed distribution. An interesting\nquestion is how to exploit more detailed information about the structure of individual doc-\numents, or the structure of a collection of documents.\n\nFor web page categorization, a frequently-used approach is to use hyperlink information\nto improve classi\ufb01cation accuracy (e.g., [7, 9, 15]). Often hyperlink structure is used to\n\u201csmooth\u201d the predictions of a learned classi\ufb01er, so that documents that (say) are pointed to\nby the same \u201chub\u201d page will be more likely to have the same classi\ufb01cation after smoothing.\nThis smoothing can be done either explicitly [15] or implicitly (for instance, by represent-\ning examples so that the distance between examples depends on hyperlink connectivity\n[7, 9]).\n\nThe structure of individual pages, as represented by HTML markup structure or linguis-\n\n\ftic structure, is less commonly used in web page classi\ufb01cation: however, page structure is\noften used in extracting information from web pages. Page structure seems to be particu-\nlarly important in \ufb01nding site-speci\ufb01c extraction rules (\u201cwrappers\u201d), since on a given site,\nformatting information is frequently an excellent indication of content [6, 10, 12].\n\nThis paper is based on two practical observations about web page classi\ufb01cation. The \ufb01rst\nis that for many categories of economic interest (e.g., product pages, job-posting pages,\nand press releases) many sites contain \u201chub\u201d or index pages that point to essentially all\npages in that category on a site. These hubs rarely link exclusively to pages of a single\ncategory\u2014instead the hubs will contain a number of additional links, such as links back to\na home page and links to related hubs. However, the page structure of a hub page often\ngives strong indications of which links are to pages from the \u201cmain\u201d category associated\nwith the hub, and which are ancillary links that exist for other (e.g., navigational) purposes.\n\nAs an example, refer to Figure 1. Links to pages in the main category associated with this\nhub (previous NIPS conference homepages) are in the left-hand column of the table, and\nhence can be easily identi\ufb01ed by the page structure.\n\nThe second observation is that it is relatively easy to learn to extract links from hub pages\nto main-category pages using existing wrapper-learning methods [8, 6]. Wrapper-learning\ntechniques interactively learn to extract data of some type from a single site using user-\nprovided training examples. Our experience in a number of domains indicates that main-\ncategory links on hub pages (like the NIPS-homepage links from Figure 1) can almost\nalways be learned from two or three positive examples.\n\nExploiting these observations, we describe in this paper a web page categorization system\nthat exploits link structure within a site, as well as page structure within hub pages, to\nimprove classi\ufb01cation accuracy of a traditional bag-of-words classi\ufb01er on pages from a\npreviously unseen site. The system uses a variant of co-training [3] to exploit unlabeled\ndata from a new, previously unseen site. Speci\ufb01cally, pages are labeled using a simple\nbag-of-words classi\ufb01er, and the results are used by a restricted wrapper-learner to propose\npotential \u201cmain-category link wrappers\u201d. These wrappers are then used as features by a\ndecision tree learner to \ufb01nd a categorization of the pages on the site that implies a simple\nhub structure, but which also largely agrees with the original bag-of-words classi\ufb01er.\n\n2 One-step co-training and hyperlink structure\n\nConsider a binary bag-of-words classi\ufb01er f that has been learned from some set of labeled\nweb pages D\u2018. We wish to improve the performance of f on pages from an unknown web\nsite S, by smoothing its predictions in a way that is plausible given the hyperlink of S,\nand the page structure of potential hub pages in S. As background for the algorithm, let\nus consider \ufb01rst co-training, a well-studied approach for improving classi\ufb01er performance\nusing unlabeled data [3].\n\nIn co-training one assumes a concept learning problem where every instance x can be\nwritten as a pair (x1; x2) such that x1 is conditionally independent of x2 given the class\ny. One also assumes that both x1 and x2 are suf\ufb01cient for classi\ufb01cation, in the sense that\nthe target function f (x) can be written either as a function of x1 or x2, i.e., that there exist\nfunctions f1(x1) = f (x) and f2(x2) = f (x). Finally one assumes that both f1 and f2 are\nlearnable, i.e., that f1 2 H1 and f2 2 H2 and noise-tolerant learning algorithms A1 and\nA2 exist for H1 and H2.\n\n\f...\nWebpages and Papers for Recent NIPS Conferences\n\nA. David Redish (dredish@cs.cmu.edu) created and maintained these web pages from 1994\nuntil 1996.\nL. Douglas Baker (ldbapp+nips@cs.cmu.edu) maintained these web pages from\n1997 until 1999.\nThey were maintained in 2000 by L. Douglas Baker and Alexander Gray\n(agray+nips@cs.cmu.edu).\n\nNIPS*2000 NIPS 13,\n\nthe conference proceedings for 2000 (\u201dAdvances in Neural\nInformation Processing Systems 13\u201d, edited by Leen, Todd K., Dietterich,\nThomas G. and Tresp, Volker will be available to all attendees in June 2001.\n\n(cid:3) Abstracts and papers from this forthcoming volume are available\non-line.\n(cid:3) BibTeX entries for all papers from this forthcoming volume are available\non-line.\n\nNIPS 12 is available from MIT Press.\nAbstracts and papers from this volume are available on-line.\n\nNIPS 11 is available from MIT Press.\nAbstracts and (some) papers from this volume are available on-line.\n\nNIPS*99\n\nNIPS*98\n\n...\n\nFigure 1: Part of a \u201chub\u201d page. Links to pages in the main category associated with this\nhub are in the left-hand column of the table.\n\nIn this setting, a large amount of unlabeled data Du can be used to improve the accuracy\nof a small set of labeled data D\u2018, as follows. First, use A1 to learn an approximation f 0\n1 to label the examples in Du, and use A2 to learn from this\nto f1 using D\u2018. Then, use f 0\ntraining set. Given the assumptions above, f 0\n1\u2019s errors on Du will appear to A2 as random,\nuncorrelated noise, and A2 can in principle learn an arbitrarily good approximation to f,\ngiven enough unlabeled data in Du. We call this process one-step co-training using A1,\nA2, and Du.\n\n1\n\nNow, consider a set DS of unlabeled pages from a unseen web site S. It seems not un-\nreasonable to assume that the words x1 on a page x 2 S and the hub pages x2 2 S\nthat hyperlink to x are independent, given the class of x. This suggests that one-step co-\ntraining could be used to improve a learned bag-of-words classi\ufb01er f 0\n1, using the following\nalgorithm:\n\nAlgorithm 1 (One-step co-training):\n\n1. Parameters. Let S be a web site, f 0\n\n1 be a bag-of-words page classi\ufb01er, and DS be\n\nthe pages on the site S.\n\n2. Instance generation and labeling. For each page xi 2 DS, represent xi as a vector\n\nof all pages in S that hyperlink to xi. Call this vector xi\n\n2. Let yi = f 0\n\n1(xi).\n\n3. Learning. Use a learner A2 to learn f 0\n\n2 from the labeled examples D2 =\n\nf(xi\n\n2; yi)gi.\n\n4. Labeling. Use f 0\n\n2(x) as the \ufb01nal label for each page x 2 DS.\n\n\fThis \u201cone-step\u201d use of co-training is consistent with the theoretical results underlying co-\ntraining. In experimental studies, co-training is usually done iteratively, alternating be-\ntween using f 0\n2 for tagging the unlabeled data. The one-step version seems more\nappropriate in this setting, in which there are a limited number of unlabeled examples over\nwhich each x2 is de\ufb01ned.\n\n1 and f 0\n\n3 Anchor Extraction and Page Classi\ufb01cation\n\n3.1 Learning to extract anchors from web pages\n\nAlgorithm 1 has some shortcomings. Co-training assumes a large pool of unlabeled data:\nhowever, if the informative hubs for pages on S are mostly within S (a very plausible\nassumption) then the amount of useful unlabeled data is limited by the size of S. With lim-\nited amounts of unlabeled data, it is very important that A2 has a strong (and appropriate)\nstatistical bias, and that A2 has some effective method for avoiding over\ufb01tting.\n\nAs suggested by Figure 1, the informativeness of hub features can be improved by using\nknowledge of the structure of hub pages themselves. To make use of hub page structure,\nwe used a wrapper-learning system called WL 2, which has experimentally proven to be\neffective at learning substructures of web pages [6]. The output of WL 2 is an extraction\npredicate: a binary relation p between pages x and substrings a within x. As an example,\nWL2 might output p = f(x; a) : x is the page of Figure 1 and a is an anchor appearing\nin the \ufb01rst column of the tableg. (An anchor is a substring of a web page that de\ufb01nes a\nhyperlink.)\n\nThis suggests a modi\ufb01cation of Algorithm 1, in which one-step co-training is carried out on\nthe problem of extracting anchors rather than the problem of labeling web pages. Speci\ufb01-\ncally, one might map f1\u2019s predictions from web pages to anchors, by giving a positive label\nto anchor a iff a links to a page x such that f 0\n1(x) = 1; then use WL2 algorithm A2 to learn\na predicate p0\n\n2; and \ufb01nally, map the predictions of p0\n\n2 from anchors back to web pages.\n\nOne problem with this approach is that WL 2 was designed for user-provided data sets,\nwhich are small and noise-free. Another problem is that it unclear how to map class la-\nbels from anchors back to web pages, since a page might be pointed to by many different\nanchors.\n\n3.2 Bridging the gap between anchors and pages\n\nBased on these observations we modi\ufb01ed Algorithm 1 as follows. As suggested, we map\nthe predictions about page labels made by f 0\n1 to anchors. Using these anchor labels, we then\nproduce many small training sets that are passed to WL 2. The intuition here is that some of\nthese training sets will be noise-free, and hence similar to those that might be provided by\na user. Finally, we use the many wrappers produced by WL 2 as features in a representation\nof a page x, and again use a learner to combine the wrapper-features and produce a single\nclassi\ufb01cation for a page.\n\nAlgorithm 2:\n\n1. Parameters. Let S be a web site, f 0\n\n1 be a bag-of-words page classi\ufb01er, and DS be\n\nthe pages on the site.\n\n2. Link labeling. For each anchor a on a page x 2 S, label a as tentatively-positive\n\n\fif a points to a page x0 such that x0 2 S and f 0\n\n1(x0) = 1.\n\n3. Wrapper proposal. Let P be the set of all pairs (x; a) where a is a tentatively-\npositive link and x is the page on which a is found. Generate a number of small\nsets D1; : : : ; Dk containing such pairs, and for each subset Di, use WL2 to pro-\nduce a number of possible extraction predicates pi;1; : : : ; pi;ki. (See appendix for\ndetails).\n\n4. Instance generation and labeling. We will say that the \u201cwrapper predicate\u201d pij\nlinks to x iff pij includes some pair (x0; a) such that x0 2 DS and a is a hyperlink\nto page x. For each page xi 2 DS, represent xi as a vector of all wrappers pij\nthat link to x. Call this vector xi\n\n2. Let yi = f 0\n\n1(xi).\n\n5. Learning. Use a learner A2 to learn f 0\n\n2 from the labeled examples DS =\n\nf(xi\n\n2; yi)gi.\n\n6. Labeling. Use f 0\n\n2(x) as the \ufb01nal label for each page x 2 DS.\n\nA general problem in building learning systems for new problems is exploiting existing\nknowledge about these problems. In this case, in building a page classi\ufb01er, one would\nlike to exploit knowledge about the related problem of link extraction. Unfortunately this\nknowledge is not in any particularly convenient form (e.g., a set of well-founded parametric\nassumptions about the data): instead, we only know that experimentally, a certain learning\nalgorithm works well on the problem.\nIn general, it is often the case that this sort of\nexperimental evidence is available, even when a learning problem is not formally well-\nunderstood.\n\nThe advantage of Algorithm 2 is that one need make no parametric assumptions about\nthe anchor-extraction problem. The bagging-like approach of \u201cfeeding\u201d WL 2 many small\ntraining sets, and the use of a second learning algorithm to aggregate the results of WL 2,\nare a means of exploiting prior experimental results, in lieu of more precise statistical as-\nsumptions.\n\n4 Experimental results\n\nTo evaluate the technique, we used the task of categorizing web pages from company sites\nas executive biography or other. We selected nine company web sites with non-trivial\nhub structures. These were crawled using a heuristic spidering strategy intended to \ufb01nd\nexecutive biography pages with high recall. 1 The crawl found 879 pages, of which 128\nwere labeled positive. A simple bag-of-words classi\ufb01er f 0\n1 was trained using a disjoint set\nof sites (different from the nine above), obtaining an average accuracy of 91.6% (recall\n82.0%, precision 61.8%) on the nine held-out sites. Using an implemention of Winnow\n[2, 11] as A2, Algorithm 2 obtained an average accuracy of 96.4% on the nine held-out\nsites. Algorithm 2 improves over the baseline classi\ufb01er f 0\n1 on six of the nine sites, and\nobtains the same accuracy on two more. This difference is signi\ufb01cant at the 98% level with\na 2-tailed paired sign test, and at the 95% level with a 2-tailed paired t test.\n\nSimilar results were also obtained using a sparse-feature implementation of a C4.5-like\ndecision tree learning algorithm [14] for learner A2. (Note that both Winnow and C4.5 are\nknown to work well when data is noisy, irrelevant attributes are present, and the underlying\nconcept is \u201csimple\u201d.) These results are summarized in Table 1.\n\n1The authors wish to thank Vijay Boyaparti for assembling this data set.\n\n\fSite\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\navg\n\n1\n\nClassi\ufb01er f 0\n(SE)\n(0.000)\n(0.027)\n(0.028)\n(0.029)\n(0.024)\n(0.000)\n(0.028)\n(0.044)\n(0.029)\n\nAccuracy\n1.000\n0.932\n0.813\n0.904\n0.939\n1.000\n0.918\n0.788\n0.948\n0.916\n\nAlgorithm 2 (C4.5) Algorithm 2 (Winnow)\nAccuracy\n0.960\n0.955\n0.934\n0.962\n0.960\n1.000\n0.990\n0.882\n0.948\n0.954\n\nAccuracy\n0.960\n0.955\n0.939\n0.962\n0.960\n1.000\n0.990\n0.929\n0.983\n0.964\n\n(SE)\n(0.028)\n(0.022)\n(0.018)\n(0.019)\n(0.020)\n(0.000)\n(0.010)\n(0.035)\n(0.029)\n\n(SE)\n(0.028)\n(0.022)\n(0.017)\n(0.019)\n(0.020)\n(0.000)\n(0.010)\n(0.028)\n(0.017)\n\nTable 1: Experimental results with Algorithm 2. Paired tests indicate that both versions of\nAlgorithm 2 signi\ufb01cantly improve on the baseline classi\ufb01er.\n\n5 Related work\n\nThe introduction discusses the relationship between this work and a number of previous\ntechniques for using hyperlink structure in web page classi\ufb01cation [7, 9, 15]. The WL 2-\nbased method for \ufb01nding document structure has antecedents in other techniques for learn-\ning [10, 12] and automatically detecting [4, 5] structure in web pages.\n\nIn concurrent work, Blei et al [1] introduce a probabilistic model called \u201cscoped learning\u201d\nwhich gives a generative model for the situation described here: collections of examples\nin which some subsets (documents from the same site) share common \u201clocal\u201d features,\nand all documents share common \u201ccontent\u201d features. Blei et al do not address the speci\ufb01c\nproblem considered here, of using both page structure and hyperlink structure in web page\nclassi\ufb01cation. However, they do apply their technique to two closely related problems:\nthey augment a page classi\ufb01cation method with local features based on the page\u2019s URL,\nand also augment content-based classi\ufb01cation of \u201ctext nodes\u201d (speci\ufb01c substrings of a web\npage) with page-structure-based local features.\n\nWe note that Algorithm 2 could be adapted to operate in Blei et al\u2019s setting: speci\ufb01cally,\nthe x2 vectors produced in Steps 2-4 could be viewed as \u201clocal features\u201d. (In fact, Blei\net al generated page-structure-based features for their extraction task in exactly this way:\nthe only difference is that WL 2 was parameterized differently.) The co-training framework\nadopted here clearly makes different assumptions than those adopted by Blei et al. More ex-\nperimentation is needed to determine which is preferable\u2014current experimental evidence\n[13] is ambiguous as to when probabilistic approaches should be prefered to co-training.\n\n6 Conclusions\n\nWe have described a technique that improves a simple web page classi\ufb01er by exploiting\nlink structure within a site, as well as page structure within hub pages. The system uses\na variant of co-training called \u201cone-step co-training\u201d to exploit unlabeled data from a new\nsite. First, pages are labeled using the base classi\ufb01er. Next, results of this labeling are\npropogated to links to labeled pages, and these labeled links are used by a wrapper-learner\ncalled WL2 to propose potential \u201cmain-category link wrappers\u201d. Finally, these wrappers\n\n\fare used as features by another learner A2 to \ufb01nd a categorization of the site that implies a\nsimple hub structure, but which also largely agrees with the original bag-of-words classi\ufb01er.\nExperiments suggest the choice of A2 is not critical.\n\nOn a real-world benchmark problem, this technique substantially improved the accuracy\nof a simple bag-of-words classi\ufb01er, reducing error rate by about half. This improvement is\nstatistically signi\ufb01cant.\n\nAcknowledgments\n\nThe author wishes to thank his former colleagues at Whizbang Labs for many helpful dis-\ncussions and useful advice.\n\nAppendix A: Details on \u201cWrapper Proposal\u201d\n\nExtraction predicates are constructed by WL 2 using a rule-learning algorithm and a con\ufb01g-\nurable set of components called builders. Each builder B corresponds to a language LB of\nextraction predicates. Builders support a certain set of operations relative to LB, in partic-\nular, the least general generalization (LGG) operation. Given a set of pairs D = f(xi; ai)g\nsuch that each ai is a substring of xi, LGGB(D) is the least general p 2 LB such that\n(x; a) 2 D ) (x; a) 2 p. Intuitively, LGGB(D) encodes common properties of the (posi-\ntive) examples in D. Depending on B, these properties might be membership in a particular\nsyntactic HTML structure (e.g., a speci\ufb01c table column), common visual properties (e.g.,\nbeing rendered in boldface), etc.\n\nTo generate subsets Di in Step 3 of Algorithm 2, we used every pair of links that pointed\nto the two most con\ufb01dently labeled examples; every pair of adjacent tentatively-positive\nlinks; and every triple and every quadruple of tentatively-positive links that were separated\nby at most 10 intervening tokens. These heuristics were based on the observation that in\nmost extraction tasks, the items to be extracted are close together. Careful implementation\nallows the subsets Di to be generated in time linear in the size of the site. (We also note\nthat these heuristics were initially developed to support a different set of experiments [1],\nand were not substantially modi\ufb01ed for the experiments in this paper.)\n\nNormally, WL2 is parameterized by a list B of builders, which are called by a \u201cmaster\u201d\nrule-learning algorithm. In our use of WL 2, we simply applied each builder Bj to a dataset\nDi, to get the set of predicates fpijg = fLGGBj (Di)g, instead of running the full WL 2\nlearning algorithm.\n\nReferences\n\n[1] David M. Blei, J. Andrew Bagnell, and Andrew K. McCallum. Learning with scope,\nwith application to information extraction and classi\ufb01cation. In Proceedings of UAI-\n2002, Edmonton, Alberta, 2002.\n\n[2] Avrim Blum. Learning boolean functions in an in\ufb01nite attribute space. Machine\n\nLearning, 9(4):373\u2013386, 1992.\n\n[3] Avrin Blum and Tom Mitchell. Combining labeled and unlabeled data with co-\ntraining. In Proceedings of the 1998 Conference on Computational Learning Theory,\nMadison, WI, 1998.\n\n\f[4] William W. Cohen. Automatically extracting features for concept learning from the\nweb. In Machine Learning: Proceedings of the Seventeeth International Conference,\nPalo Alto, California, 2000. Morgan Kaufmann.\n\n[5] William W. Cohen and Wei Fan. Learning page-independent heuristics for extracting\ndata from web pages. In Proceedings of The Eigth International World Wide Web\nConference (WWW-99), Toronto, 1999.\n\n[6] William W. Cohen, Lee S. Jensen, and Matthew Hurst. A \ufb02exible learning system\nfor wrapping tables and lists in HTML documents. In Proceedings of The Eleventh\nInternational World Wide Web Conference (WWW-2002), Honolulu, Hawaii, 2002.\n\n[7] David Cohn and Thomas Hofmann. The missing link - a probabilistic model of doc-\nument content and hypertext connectivity. In Advances in Neural Information Pro-\ncessing Systems 13. MIT Press, 2001.\n\n[8] Lee S. Jensen and William W. Cohen. A structured wrapper induction system for\nextracting information from semi-structured documents. In Proceedings of the IJCAI-\n2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.\n\n[9] T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext\ncategorisation. In Proceedings of the International Conference on Machine Learning\n(ICML-2001), 2001.\n\n[10] N. Kushmeric. Wrapper induction: ef\ufb01ciency and expressiveness. Arti\ufb01cial Intelli-\n\ngence, 118:15\u201368, 2000.\n\n[11] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-\n\nthreshold algorithm. Machine Learning, 2(4), 1988.\n\n[12] Ion Muslea, Steven Minton, and Craig Knoblock. Wrapper induction for semistruc-\ntured information sources. Journal of Autonomous Agents and Multi-Agent Systems,\n16(12), 1999.\n\n[13] Kamal Nigam and Rayyid Ghani. Analyzing the effectiveness and applicability of co-\ntraining. In Proceedings of the Ninth International Conference on Information and\nKnowledge Management (CIKM-2000), 2000.\n\n[14] J. Ross Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1994.\n\n[15] S. Slattery and T. Mitchell. Discovering test set regularities in relational domains.\nIn Proceedings of the 17th International Conference on Machine Learning (ICML-\n2000), June 2000.\n\n\f", "award": [], "sourceid": 2291, "authors": [{"given_name": "William", "family_name": "Cohen", "institution": null}]}