Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
William W. Cohen
Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that im- proves a simple web page classiﬁer’s performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique signiﬁcantly and substantially improves the accuracy of a bag-of-words classiﬁer, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classiﬁer; the results are used by a restricted wrapper-learner to propose potential “main-category anchor wrappers”; and ﬁnally, these wrappers are used as features by a third learner to ﬁnd a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classiﬁer.