Part of Advances in Neural Information Processing Systems 16 (NIPS 2003)

*Philip Derbeko, Ran El-Yaniv, Ron Meir*

This paper is concerned with transductive learning. Although transduc- tion appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present ex- plicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering.

1 Introduction and Related Work

In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difﬁcult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved.

In this paper we provide new error bounds and a general technique for transductive learn- ing. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This ﬂexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple ob- servation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added ﬂexibil- ity, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms.

We brieﬂy review some previous results. The idea of transduction, and a speciﬁc algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an

error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insigniﬁcant. Reﬁning the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduc- tion is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduc- tion, in the spirit of the results in [2].

2 The Transduction Setup

We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classiﬁcation (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1, . . . , xm+u} be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1, . . . , xm} and Ym = {y1, . . . , ym}, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1, y1), . . . , (xm, ym)} is referred to as a training sample. In this paper we assume that yi = φ(xi) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1, . . . , x|Z| of samples deﬁne

Do not remove: This comment is monitored to verify that the site is working properly