{"title": "Error Bounds for Transductive Learning via Compression and Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1085, "page_last": 1092, "abstract": "", "full_text": "Error Bounds for Transductive Learning via\n\nCompression and Clustering\n\nPhilip Derbeko\n\n{philip,rani}@cs.technion.ac.il rmeir@ee.technion.ac.il\n\nTechnion - Israel Institute of Technology\n\nRan El-Yaniv\n\nRon Meir\n\nAbstract\n\nThis paper is concerned with transductive learning. Although transduc-\ntion appears to be an easier task than induction, there have not been many\nprovably useful algorithms and bounds for transduction. We present ex-\nplicit error bounds for transduction and derive a general technique for\ndevising bounds within this setting. The technique is applied to derive\nerror bounds for compression schemes such as (transductive) SVMs and\nfor transduction algorithms based on clustering.\n\n1 Introduction and Related Work\n\nIn contrast to inductive learning, in the transductive setting the learner is given both the\ntraining and test sets prior to learning. The goal of the learner is to infer (or \u201ctransduce\u201d)\nthe labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who\nproposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of\npoints in the test set can be done using an inductive scheme. However, as pointed out\nin [2], it makes little sense to solve an easier problem by \u2018reducing\u2019 it to a much more\ndif\ufb01cult one. In particular, the prior knowledge carried by the (unlabeled) test points can\nbe incorporated into an algorithm, potentially leading to superior performance. Indeed,\na number of papers have demonstrated empirically that transduction can offer substantial\nadvantage over induction whenever the training set is small or moderate (see e.g. [3, 4,\n5, 6]). However, unlike the current state of affairs in induction, the question of what are\nprovably effective learning principles for transduction is quite far from being resolved.\n\nIn this paper we provide new error bounds and a general technique for transductive learn-\ning. Our technique is based on bounds that can be viewed as an extension of McAllester\u2019s\nPAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this\nframework in transduction is that here priors can be selected after observing the unlabeled\ndata (but before observing the labeled sample). This \ufb02exibility allows for the choice of\n\u201ccompact priors\u201d (with small support) and therefore, for tight bounds. Another simple ob-\nservation is that the PAC-Bayesian framework can be operated with polynomially (in m, the\ntraining sample size) many different priors simultaneously. Altogether, this added \ufb02exibil-\nity, of using data-dependent multiple priors allows for easy derivation of tight error bounds\nfor \u201ccompression schemes\u201d such as (transductive) SVMs and for clustering algorithms.\n\nWe brie\ufb02y review some previous results. The idea of transduction, and a speci\ufb01c algorithm\nfor SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an\n\n\ferror bound is also proposed. However, this bound is implicit and rather unwieldy and,\nto the best of our knowledge, has not been applied in practical situations. A PAC-Bayes\nbound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is\ndata-dependent depending on the number of decision nodes, the margins at each node and\nthe sample size. However, the authors state that the transduction bound is not much tighter\nthan the induction bound. Empirical tests show that this transduction algorithm performs\nslightly better than induction in terms of the test error, however, the advantage is usually\nstatistically insigni\ufb01cant. Re\ufb01ning the algorithm of [2] a transductive algorithm based on a\nSVMs is proposed in [3]. The paper also provides empirical tests indicating that transduc-\ntion is advantageous in the text categorization domain. An error bound for transduction,\nbased on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11]\nderived a transductive bound for kernel methods based on spectral properties of the kernel\nmatrix. Blum and Langford [12] recently also established an implicit bound for transduc-\ntion, in the spirit of the results in [2].\n\n2 The Transduction Setup\n\nWe consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is\ndescribed in the context of binary classi\ufb01cation (the general case will be discussed in the\nfull paper). Let H be a set of binary hypotheses consisting of functions from input space\nX to {\u00b11} and let Xm+u = {x1, . . . , xm+u} be a set of points from X each of which is\nchosen i.i.d. according to some unknown distribution \u00b5(x). We call Xm+u the full sample.\nLet Xm = {x1, . . . , xm} and Ym = {y1, . . . , ym}, where Xm is drawn uniformly from\nXm+u and yi \u2208 {\u00b11}. The set Sm = {(x1, y1), . . . , (xm, ym)} is referred to as a training\nsample. In this paper we assume that yi = \u03c6(xi) for some unknown function \u03c6. The\nremaining subset Xu = Xm+u \\ Xm is referred to as the unlabeled sample. Based on Sm\nand Xu our goal is to choose h \u2208 H which predicts the labels of points in Xu as accurately\nas possible. For each h \u2208 H and a set Z = x1, . . . , x|Z| of samples de\ufb01ne\n\n|Z|(cid:88)\n\ni=1\n\nRh(Z) =\n\n1\n|Z|\n\n(cid:96)(h(xi), yi),\n\n(1)\n\nwhere in our case (cid:96)(\u00b7,\u00b7) is the zero-one loss function. Our goal in transduction is to learn\nan h such that Rh(Xu) is as small as possible. This problem setup is summarized by the\nfollowing transduction \u201cprotocol\u201d introduced in [2] and referred to as Setting 1:\n\ngiven.1\n\n(i) A full sample Xm+u = {x1, . . . , xm+u} consisting of arbitrary m + u points is\n(ii) We then choose uniformly at random the training sample Xm \u2286 Xm+u and re-\nceive its labeling Ym; the resulting training set is Sm = (Xm, Ym) and the re-\nmaining set Xu is the unlabeled sample, Xu = Xm+u \\ Xm;\n(iii) Using both Sm and Xu we select a classi\ufb01er h \u2208 H whose quality is measured by\n\nRh(Xu).\n\nVapnik [2] also considers another formulation of transduction, referred to as Setting 2:\n\n(i) We are given a training set Sm = (Xm, Ym) selected i.i.d according to \u00b5(x, y).\n(ii) An independent test set Su = (Xu, Yu) of u samples is then selected in the same\n\nmanner.\n\n1The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen\n\nindependently at random according to some source distribution \u00b5(x).\n\n\f(iii) We are required to choose our best h \u2208 H based on Sm and Xu so as to minimize\n\nRm,u(h) =\n\n(cid:96) (h(xi), yi) d\u00b5(x1, y1)\u00b7\u00b7\u00b7 d\u00b5(xm+u, ym+u).\n\n(2)\n\n(cid:90)\n\nm+u(cid:88)\n\ni=m+1\n\n1\nu\n\nEven though Setting 2 may appear more applicable in practical situations than Setting 1, the\nderivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the\nexpected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies\nan equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel\nto Setting 1.\n\nWe make use of the following quantities, which are all instances of (1). The quantity\nRh(Xm+u) is called the full sample risk of the hypothesis h, Rh(Xu) is referred to as\nthe transduction risk (of h), and Rh(Xm) is the training error (of h). Thus, Rh(Xm) is\nthe standard training error denoted by \u02c6Rh(Sm). While our objective in transduction is to\nachieve small error over the unlabeled set (i.e. to minimize Rh(Xu)), it turns out that it is\nmuch easier to derive error bounds for the full sample risk. The following simple lemma\ntranslates an error bound on Rh(Xm+u), the full sample risk, to an error bound on the\ntransduction risk Rh(Xu).\nLemma 2.1 For any h \u2208 H and any C\n\nRh(Xm+u) \u2264 \u02c6Rh(Sm) + C\n\n\u21d4\n\nRh(Xu) \u2264 \u02c6Rh(Sm) + m + u\n\nu\n\n\u00b7 C.\n\n(3)\n\nProof: For any h\n\nRh(Xm+u) = mRh(Xm) + uRh(Xu)\n\n.\n\n(4)\nSubstituting \u02c6Rh(Sm) for Rh(Xm) in (4) and then substituting the result for the left-hand\nside of (3) we get\n\nm + u\n\nRh(Xm+u) = m \u02c6Rh(Sm) + uRh(Xu)\n\nm + u\n\n\u2264 \u02c6Rh(Sm) + C.\n\nThe equivalence (3) is now obtained by isolating Rh(Xu) on the left-hand side.\n\n2\n\n3 General Error Bounds for Transduction\nConsider a hypothesis class H and assume for simplicity that H is countable; in fact, in\nthe case of transduction it suf\ufb01ces to consider a \ufb01nite hypothesis class. To see this note\nthat all m + u points are known in advance. Thus, in the case of binary classi\ufb01cation\n(for example) it suf\ufb01ces to consider at most 2m+u possible dichotomies. Recall that in the\nsetting considered we select a sub-sample of m points from the set Xm+u of cardinality\nm+u. This corresponds to a selection of m points without replacement from a set of m+u\npoints, leading to the m points being dependent. A naive utilization of large deviation\nbounds would therefore not be directly applicable in this setting. However, Hoeffding\n(see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one\ninvolving independent data. While this procedure leads to non-trivial bounds, it does not\nfully take advantage of the transductive setting and will not be used here. Consider for\nsimplicity the case of binary classi\ufb01cation.\nIn this case we make use of the following\nconcentration inequality, based on [14].\nTheorem 3.1 Let C = {c1, . . . , cN}, ci \u2208 {0, 1}, be a \ufb01nite set of binary numbers, and\nset \u00afc = (1/N)\ni=1 ci. Let Z1, . . . , Zm, be random variables obtaining their values\n\n(cid:80)N\n\n\f(cid:189)\n\nby sampling C uniformly at random without replacement. Set Z = (1/m)\n\u03b2 = m/N. Then, if 2 \u03b5 \u2264 min{1 \u2212 \u00afc, \u00afc(1 \u2212 \u03b2)/\u03b2},\nPr{Z \u2212 EZ > \u03b5} \u2264 exp\nwhere D(p(cid:107)q) = p log(p/q) = (1 \u2212 p) log(1 \u2212 p)/(1 \u2212 q), p, q,\u2208 [0, 1] is the binary\nKullback-Leibler divergence.\n\n\u2212mD(\u00afc + \u03b5(cid:107)\u00afc) \u2212 (N \u2212 m) D\n\n\u00afc \u2212 \u03b2\u03b5\n1 \u2212 \u03b2\n\ni=1 Zi and\n\n(cid:181)\n\n+ 7 log(N + 1)\n\n,\n\n(cid:190)\n\n(cid:80)m\n(cid:176)(cid:176)(cid:176)(cid:176) \u00afc\n(cid:182)\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:195)\n\n(cid:33)\n\n(cid:180)\n\nUsing this result we obtain the following error bound for transductive classi\ufb01cation.\nTheorem 3.2 Let Xm+u = Xm\u222aXu be the full sample and let p = p(Xm+u) be a (prior)\ndistribution over the class of binary hypotheses H that may depend on the full sample. Let\n\u03b4 \u2208 (0, 1) be given. Then, with probability at least 1 \u2212 \u03b4 over choices of Sm (from the full\nsample) the following bound holds for any h \u2208 H,\n\nRh(Xu) \u2264 \u02c6Rh(Sm) +\n\n(cid:179)\n\n2 \u02c6Rh(Sm)(m + u)\n\nu\n\nlog 1\n\np(h) + ln m\n\n\u03b4 + 7 log(m + u + 1)\nm \u2212 1\n\n2\n\n+\n\nlog 1\n\np(h) + ln m\n\n\u03b4 + 7 log(m + u + 1)\nm \u2212 1\n\n.\n\n(5)\n\n(cid:96)(h(x), \u03c6(x)).\n\nE\u03a3m\n\n\u02c6Rh(Sm) =\n\nProof: (sketch) In our transduction setting the set Xm (and therefore Sm) is obtained by\nsampling the full sample Xm+u uniformly at random without replacement. We \ufb01rst claim\nthat\n(6)\nwhere E\u03a3m(\u00b7) is the expectation with respect to a random choice of Sm from Xm+u with-\nout replacement. This is shown as follows.\n\n\u02c6Rh(Sm) = Rh(Xm+u),\n\nE\u03a3m\n\n1(cid:161)m+u\n\nm\n\n(cid:162)(cid:88)\n(cid:161)m+u\n\nSm\n\n\u02c6Rh(Sm) =\n\n1(cid:161)m+u\n(cid:162)\n(cid:162)\u2212(cid:161)m+u\u22121\n\nm\n\n(cid:88)\n\nx\u2208Sm\n\n(cid:162) (cid:88)\n(cid:161)m+u\u22121\n\nXm\u2286Xm+n\n\n1\nm\n\n(cid:162)\n(cid:161)m+u\u22121\n\nm\n\nm\n\nm\n\n/\n\n=\n\n(cid:162)\n\n(cid:162)\n\n= m\n\nm+u .\n\n(cid:161)m+u\n\nBy symmetry, all points x \u2208 Xm+u are counted on the right-hand side an equal number of\n. The equality (6) is obtained\ntimes; this number is precisely\nm\u22121\nby considering the de\ufb01nition of Rh(Xm+u) and noting that\nm\u22121\nThe remainder of the proof combines Theorem 3.1 and the techniques presented in [15].\nThe details will be provided in the full paper.\n2\nNotice that when \u02c6Rh(Sm) \u2192 0 the square root in (5) vanishes and faster rates are obtained.\nAn important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order\nto choose the prior distribution p(h). This advantage has already been alluded to in [2], but\ndoes not seem to have been widely used in practice. Additionally, observe that (5) holds\nwith probability at least 1 \u2212 \u03b4 with respect to the random selection of sub-samples of size\nm from the \ufb01xed set Xm+u. This should be contrasted with the standard inductive setting\nresults where the probabilities are with respect to a random choice of m training points\nchosen i.i.d. from \u00b5(x, y).\nThe next bound we present is analogous to McAllester\u2019s Theorem 1 in [8]. This theorem\nconcerns Gibbs composite classi\ufb01ers, which are distributions over the base classi\ufb01ers in\nH. For any distribution q over H denote by Gq the Gibbs classi\ufb01er, which classi\ufb01es an\n2The second condition, \u03b5 \u2264 \u00afc(1 \u2212 \u03b2)/\u03b2, simply guarantees that the number of \u2018ones\u2019 in the\n\nsub-sample does not exceed their number in the original sample.\n\n\f(cid:110)\n\n(cid:111)\n\n(1/|Z|)\n\n(cid:80)|Z|\n\ni=1 (cid:96)(h(xi), \u03c6(xi))\n\ninstance (in Xu) by randomly choosing, according to q, one hypothesis h \u2208 H. For Gibbs\nclassi\ufb01ers we now extend de\ufb01nition (1) as follows. Let Z = x1, . . . , x|Z| be any set of\nsamples and let Gq be a Gibbs classi\ufb01er over H. The risk of Gq over Z is RGq(Z) =\n. As before, when Z = Xm (the training set) we\nEh\u223cq\nuse the standard notation \u02c6RGq(Sm) = RGq(Xm). Due to space limitations, the proof of\nthe following theorem will appear in the full paper.\nTheorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend\non Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and\nXu. Let \u03b4 \u2208 (0, 1) be given. With probability at least 1 \u2212 \u03b4 over the choices of Sm for any\ndistribution q\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:195)\n\n(cid:33)\n\n(cid:162)\n\n2 \u02c6RGq(Sm)(m + u)\n\nu\n\nD(q(cid:107)p) + ln m\n\n\u03b4 + 7 log(m + u + 1)\nm \u2212 1\n\nRGq(Xu) \u2264 \u02c6RGq(Sm) +\n\n(cid:161)\n\n2\n\n+\n\nD(q(cid:107)p) + ln m\n\nm log(m + u + 1)\n\n\u03b4 + 7\nm \u2212 1\n\n.\n\nIn the context of inductive learning, a major obstacle in generating meaningful and effec-\ntive bounds using the PAC-Bayesian framework [8] is the construction of \u201ccompact pri-\nors\u201d. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow\nfor easy choices of compact priors that can yield tight error bounds. The \ufb01rst extension\nInstead of a single prior p in the original PAC-\nwe offer is the use of multiple priors.\nBayesian framework we observe that one can use all PAC-Bayesian bounds with a number\nof priors p1, . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2)\nby mini ln(1/pi(h)), at a cost of an additional ln k term (see below). Similarly, in The-\norem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi). The\npenalty for using k priors is logarithmic in k (speci\ufb01cally the ln(1/\u03b4) term in the original\nbound becomes ln(k/\u03b4)). As long as k is sub-exponential in m we still obtain effective\ngeneralization bounds. The second \u201cextension\u201d is simply the feature of our transduction\nbounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full\nsample Xm+u. The combination of these two simple ideas yields a powerful technique for\nderiving error bounds in realistic transductive settings. After stating the extended result we\nlater use it for deriving tight bounds for known learning algorithms and for deriving new\nalgorithms. Suppose that instead of a single prior p over H we want to utilize k priors,\np1, . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian\nbounds. The following theorem shows that one can use polynomially many priors with\na minor penalty. The proof, which is omitted due to space limitations, utilizes the union\nbound in a straightforward manner.\n\nTheorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior\ndistributions p1, . . . , pk de\ufb01ned over H, each of which may depend on Xm+u. Let \u03b4 \u2208\n(0, 1) be given. Then, with probability at least 1\u2212 \u03b4 over random choices of sub-samples of\nsize m from the full-sample, for all h \u2208 H, (5) holds with p(h) replaced by min1\u2264i\u2264k pi(h)\nand log 1\n\n\u03b4 is replaced by log k\n\u03b4 .\n\n(cid:80)\n\nRemark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by\none of the reviewers, when the supports of the k priors intersect (i.e. there is at least one\npair of priors pi and pj with overlapping support), then one can do better by utilizing the\n\u201csuper prior\u201d p = 1\ni pi within the original Theorem 3.2. However, note that when the\nk\nsupports are disjoint, these two views (of multiple priors and a super prior) are equivalent.\nIn the applications below we utilize non-intersecting priors.\n\n\f4 Bounds for Compression Algorithms\n\n\u03c4\n\n(cid:161)m+u\n\n(cid:162)\n\nHere we propose a technique for bounding the error of \u201ccompression\u201d algorithms based on\nappropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively,\nA is a \u201ccompression scheme\u201d if it can generate the same hypothesis using a subset of the\ndata. More formally, a learning algorithm A (viewed as a function from samples to some\nhypothesis class) is a compression scheme with respect to a sample Z if there is a sub-\nsample Z(cid:48), Z(cid:48) \u2282 Z, such that A(Z(cid:48)) = A(Z). Observe that the SVM approach is a\ncompression scheme, with Z(cid:48) being determined by the set of support vectors.\nLet A be a deterministic compression scheme and consider the full sample Xm+u. For\neach integer \u03c4 = 1, . . . , m, consider all subsets of Xm+u of size \u03c4 , and for each subset\nconstruct all possible dichotomies of that subset (note that we are not proposing this ap-\nproach as an algorithm, but rather as a means to derive bounds; in practice one need not\nconstruct all these dichotomies). A deterministic algorithm A uniquely determines at most\none hypothesis h \u2208 H for each dichotomy.3 For each \u03c4 , let the set of hypotheses generated\nby this procedure be denoted by H\u03c4 . For the rest of this discussion we assume the worst\ncase where |H\u03c4| =\n(i.e. if H\u03c4 does not contains one hypothesis for each dichotomy\nthe bounds improve). The prior p\u03c4 is then de\ufb01ned to be a uniform distribution over H\u03c4 .\nIn this way we have m priors, p1, . . . , pm which are constructed using only Xm+u (and\nare independent of Sm). Any hypothesis selected by the learning algorithm A based on\nthe labeled sample Sm and on the test set Xu belongs to \u222am\n\u03c4 =1H\u03c4 . The motivation for this\nconstruction is as follows. Each \u03c4 can be viewed as our \u201cguess\u201d for the maximal number of\ncompression points that will be utilized by a resulting classi\ufb01er. For each such \u03c4 the prior\np\u03c4 is constructed over all possible classi\ufb01ers that use \u03c4 compression points. By systemati-\ncally considering all possible dichotomies of \u03c4 points we can characterize a relatively small\nsubset of H without observing labels of the training points. Thus, each prior p\u03c4 represents\none such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound\ncorresponding to the best \u201cguess\u201d. The following corollary identi\ufb01es an upper bound on\nthe divergence in terms of the observed size of the compression set of the \ufb01nal classi\ufb01er.\nCorollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning\nalgorithm leading to a hypothesis h \u2208 H based on a compression set of size s. Then\nwith probability at least 1 \u2212 \u03b4 for all h \u2208 H, (5) holds with log(1/p(h)) replaced by\ns log(2e(m + u)/s) and ln(m/\u03b4) replaced by ln(m2/\u03b4).\nProof: Recall that Hs \u2286 H is the support set of ps and that ps(h) = 1/|Hs| for all\nh \u2208 Hs, implying that ln(1/ps(h)) = |Hs|. Using the inequality\nwe have that |Hs| = 2s\nwhile restricting the minimum over i to be over i \u2265 s, leads to the desired result.\n\n(cid:162) \u2264 (e(m + u)/s)s\n(cid:162) \u2264 (2e(m + u)/s)s. Substituting this result in Theorem 3.4\n\n(cid:161)m+u\n\n(cid:161)m+u\n\ns\n\n2\n\ns\n\nThe bound of Corollary 4.1 can be easily computed once the classi\ufb01er is trained. If the size\nof the compression set happens to be small, we obtain a tight bound. SVM classi\ufb01cation is\none of the best studied compression schemes. The compression set for a sample Sm is given\nby the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with\ns being the number of observed support vectors (after training). We note that this bound\nis similar to a recently derived compression bound for inductive learning (Theorem 5.18 in\n[16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the\nunlabeled sample (although the bound does use this sample). Nevertheless, using exactly\nthe same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4\n\n3It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature\n\nspace without soft margin will fail to classify non linearly-separable dichotomies of Xm+u.\n\n4Note however that our bounds are optimized with a \u201cminimum number of support vectors\u201d ap-\n\nproach rather than \u201cmaximum margin\u201d.\n\n\f5 Bounds for Clustering Algorithms\n\nSome learning problems do not allow for high compression rates using compression\nschemes such as SVMs (i.e. the number of support vectors can sometimes be very large).\nA considerably stronger type of compression can often be achieved by clustering algo-\nrithms. While there is lack of formal links between entirely unsupervised clustering and\nclassi\ufb01cation, within a transduction setting we can provide a principled approach to using\nclustering algorithms for classi\ufb01cation. Let A be any (deterministic) clustering algorithm\nwhich, given the full sample Xm+u, can cluster this sample into any desired number of\nclusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c \u2264 m. Thus, the al-\ngorithm generates a collection of partitions of Xm+u into \u03c4 = 2, 3, . . . , c clusters, where\neach partition is denoted by C\u03c4 . For each value of \u03c4 , let H\u03c4 consist of those hypotheses\nwhich assign an identical label to all points in the same cluster of partition C\u03c4 , and de\ufb01ne\nthe prior p\u03c4 (h) = 1/2\u03c4 for each h \u2208 H\u03c4 and zero otherwise (note that there are 2\u03c4 possi-\nble dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing\nthe labeled sample Sm = (Xm, Ym), for each of the clusterings C2, . . . , Cc constructed\nabove, it assigns a label to each cluster based on the majority vote from the labels Ym of\npoints falling within the cluster (in case of ties, or if no points from Xm belong to the\ncluster, choose a label arbitrarily). Doing this leads to c \u2212 1 classi\ufb01ers h\u03c4 , \u03c4 = 2, . . . , c.\nFor each h\u03c4 there is a valid error bound as given by Theorem 3.4 and all these bounds are\nvalid simultaneously. Thus we choose the best classi\ufb01er (equivalently, number of clusters)\nfor which the best bound holds. We thus have the following corollary of Theorem 3.4 and\nLemma 2.1.\nCorollary 5.1 Let A be any clustering algorithm and let h\u03c4 , \u03c4 = 2, . . . , c be classi\ufb01cations\nof test set Xu as determined by clustering of the full sample Xm+u (into \u03c4 clusters) and\nthe training set Sm, as described above. Let \u03b4 \u2208 (0, 1) be given. Then with probability at\nleast 1 \u2212 \u03b4, for all \u03c4 , (5) holds with log(1/p(h)) replaced by \u03c4 and ln(m/\u03b4) replaced by\nln(mc/\u03b4).\n\nError bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm\nis successful (i.e. when it captures the class structure in the data using a small number of\nclusters).\n\nCorollary 5.1 can be extended in a number of ways. One simple extension is the use of\nan ensemble of clustering algorithms. Speci\ufb01cally, we can concurrently apply k clustering\nalgorithm (using each algorithm to cluster the data into \u03c4 = 2, . . . , c clusters). We thus\nobtain kc hypotheses (partitions of Xm+u). By a simple application of the union bound\nwe can replace ln cm\nin Corollary 5.1 and guarantee that kc bounds hold si-\nmultaneously for all kc hypotheses (with probability at least 1 \u2212 \u03b4). We thus choose the\nhypothesis which minimizes the resulting bound. This extension is particularly attractive\nsince typically without prior knowledge we do not know which clustering algorithm will\nbe effective for the dataset at hand.\n\n\u03b4 by ln kcm\n\n\u03b4\n\n6 Concluding Remarks\n\nWe presented new bounds for transductive learning algorithms. We also developed a new\ntechnique for deriving tight error bounds for compression schemes and for clustering al-\ngorithms in the transductive setting. We expect that these bounds and new techniques will\nbe useful for deriving new error bounds for other known algorithms and for deriving new\ntypes of transductive learning algorithms. It would be interesting to see if tighter trans-\nduction bounds can be obtained by reducing the \u201cslacks\u201d in the inequalities we use in our\nanalysis. Another promising direction is the construction of better (multiple) priors. For ex-\nample, in our compression bound (Corollary 4.1), for each number of compression points\n\n\fwe assigned the same prior to each possible point subset and each possible dichotomy.\nHowever, in practice a vast majority of all these subsets and dichotomies are unlikely to\noccur.\n\nAcknowledgments The work of R.E and R.M. was partially supported by the Technion\nV.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center\nof the department of Electrical Engineering at the Technion is also acknowledged. We also\nthank anonymous referees for their useful comments.\n\nReferences\n\n[1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag,\n\nNew York, 1982.\n\n[2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998.\n[3] T. Joachims. Transductive inference for text classi\ufb01cation unsing support vector ma-\n\nchines. In European Conference on Machine Learning, 1999.\n\n[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph min-\ncuts. In Proceeding of The Eighteenth International Conference on Machine Learning\n(ICML 2001), pages 19\u201326, 2001.\n\n[5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-\nsupervised learning. In Advances in Neural Information Processing Systems (NIPS\n2001), pages 1025\u20131032, 2001.\n\n[6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of\n\nThe Twentieth International Conference on Machine Learning (ICML-2003), 2003.\n\n[7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363,\n\n1999.\n\n[8] D. McAllester.\n\n51(1):5\u201321, 2003.\n\nPAC-Bayesian stochastic model selection. Machine Learning,\n\n[9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for in-\n\nduction and transduction. In International Conference on Machine Learning, 1999.\n\n[10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report,\n\nAT&T, 1994.\n\n[11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning\nthe kernel matrix with semi-de\ufb01nite programming. Technical report, University of\nBerkeley, Computer Science Division, 2002.\n\n[12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344\u2013357, 2003.\n[13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J.\n\nAmer. Statis. Assoc., 58:13\u201330, 1963.\n\n[14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer,\n\nNew York, second edition, 1998.\n\n[15] D. McAllester. Simpli\ufb01ed pac-bayesian margin bounds. In COLT, pages 203\u2013215,\n\n2003.\n\n[16] R. Herbrich. Learning Kernel Classi\ufb01ers: Theory and Algorithms. MIT Press,\n\nBoston, 2002.\n\n\f", "award": [], "sourceid": 2345, "authors": [{"given_name": "Philip", "family_name": "Derbeko", "institution": null}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}