{"title": "On Transductive Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 305, "page_last": 312, "abstract": null, "full_text": "On Transductive Regression\n\nCorinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com\n\nMehryar Mohri Courant Institute of Mathematical Sciences and Google Research 251 Mercer Street New York, NY 10012 mohri@cs.nyu.edu\n\nAbstract\nIn many modern large-scale learning applications, the amount of unlabeled data far exceeds that of labeled data. A common instance of this problem is the transductive setting where the unlabeled test points are known to the learning algorithm. This paper presents a study of regression problems in that setting. It presents explicit VC-dimension error bounds for transductive regression that hold for all bounded loss functions and coincide with the tight classification bounds of Vapnik when applied to classification. It also presents a new transductive regression algorithm inspired by our bound that admits a primal and kernelized closedform solution and deals efficiently with large amounts of unlabeled data. The algorithm exploits the position of unlabeled points to locally estimate their labels and then uses a global optimization to ensure robust predictions. Our study also includes the results of experiments with several publicly available regression data sets with up to 20,000 unlabeled examples. The comparison with other transductive regression algorithms shows that it performs well and that it can scale to large data sets.\n\n1 Introduction\nIn many modern large-scale learning applications, the amount of unlabeled data far exceeds that of labeled data. Large amounts of digitized data are widely available but the cost of labeling is often prohibitive since it typically requires human assistance. Semi-supervised learning or transductive inference leverage unlabeled data to achieve better predictions and are thus particularly relevant to modern applications. Semi-supervised learning consists of using both labeled and unlabeled data to find a hypothesis that accurately labels unseen examples. Transductive inference uses the same information but only aims at predicting the labels of the known unlabeled examples. This paper deals with regression problems in the transductive setting, which arise in a variety of contexts. This may be to predict the real-valued labels of the nodes of a known graph in computational biology, or the scores associated to known documents in information extraction problems. The problem of transduction inference was originally formulated and analyzed by Vapnik [1982] who described it as a simpler task than the traditional induction treated in machine learning. A number of recent publications have dealt with the topic of transductive inference [Vapnik, 1998, Joachims, 1999, Bennett and Demiriz, 1998, Chapelle et al., 1999, Graepel et al., 1999, Schuurmans and Southey, 2002, Corduneanu and Jaakkola, 2003, Zhu et al., 2004, Lanckriet et al., 2004, Derbeko et al., 2004, Belkin et al., 2004, Zhou et al., 2005]. But, with the exception of [Chapelle et al., 1999], [Schuurmans and Southey, 2002], and [Belkin et al., 2004], this work has primarily dealt with classification problems. We present a specific study of transductive regression. We give new error bounds for transductive regression that hold for all bounded loss functions and coincide with the tight classification bounds of Vapnik [1998] when applied to classification. Our results also include explicit VC-dimension bounds for transductive regression. This contrasts with the original regression bound given by Vapnik [1998] which assumes a specific condition of global regularity on the class of functions and is based on a complicated and implicit function of the samples sizes and the confidence parameter. As stated by Vapnik [1998], this function must be \"tabulated by a computer\".\n\n\f\nWe also present a new algorithm for transductive regression inspired by our bound which first exploits the position of unlabeled points to locally estimate their labels, and then uses a global optimization to ensure robust predictions. We show that our algorithm admits both a primal and a kernelized closed-form solution. Existing algorithms for the transductive setting require the inversion of a matrix whose dimension is either the total number of unlabeled and labeled examples [Belkin et al., 2004], or the total number of unlabeled examples [Chapelle et al., 1999]. This may be prohibitive for many real-world applications with very large amounts of unlabeled examples. One of the original motivations for our work was to design algorithms dealing precisely with such situations. When the dimension of the feature space N is not too large, our algorithm provides a very efficient solution whose cost is dominated by the construction and inversion of an N N -matrix. Similarly, when the number of training points m is small compared to the number of unlabeled points, using an empirical kernel map, our algorithm requires only constructing and inverting an m m-matrix.\n\nOur study also includes the results of our experiments with several publicly available regression data sets with up to 20,000 unlabeled examples, limited only by the size of the data sets. We compared our algorithm with those of Belkin et al. [2004] and Chapelle et al. [1999], which are among the very few algorithms described in the literature dealing specifically with the problem of transductive regression. The results show that our algorithm performs well in several data sets compared to these algorithms and that it can scale to large data sets. The paper is organized as follows. Section 2 describes in more detail the transductive regression setting we are studying. New generalization error bounds for transductive regression are presented in Section 3. Section 4 describes and analyzes both the primal and dual versions of our algorithm and the experimental results of our study are reported in Section 5.\n\n2 Definition of the Problem\nAssume that a full sample X of m + u examples is given. The learning algorithm further receives the labels of a random subset of X of size m which serves as a training sample: (x1 , y1 ), . . . , (xm , ym ) X R. (1 ) The remaining u unlabeled examples, xm+1 , . . . , xm+u X , serve as test data. The learning problem that we consider consists of predicting accurately the labels ym+1 , . . . , ym+u of the test examples. No other test examples will ever be considered. This is a transduction regression problem [Vapnik, 1998].1 It differs from the standard (induction) regression estimation problem by the fact that the learning algorithm is given the unlabeled test examples beforehand. Thus, it may exploit that information and achieve a better result than via the standard induction. In what follows, we consider a hypothesis space H of real-valued functions for regression estimation. For a hypothesis h H , we denote by R0 (h) its mean squared error on the full sample, by R(h) its error on the training data, and by R(h) the error of h on the test examples: R0 (h) =\nm +u i 1 (h(xi ) - yi )2 m + u =1\n\n1i (h(xi ) - yi )2 . u =m +1 (2 ) For convenience, we will sometimes denote by yx = yi the label of a point x = xi X . R(h) =\n\nim R(h) = 1 (h(xi ) - yi )2 m =1\n\nm +u\n\n3 Transductive Regression Generalization Error\nThis section presents explicit generalization error bounds for transductive regression. Vapnik [1998] introduced and analyzed the problem of transduction and presented transductive inference bounds for both classification and regression. His regression bound assumes however a specific regularity condition on the hypothesis functions leading in particular to a surprising bound where no error on the training data implies zero generalization error. The bound has the multiplicative form: R(h) (m, u, d, )R(h), where d is the VC-dimension of the class of hypotheses used and is the confidence parameter. Furthermore, for certain values of the parameters, for example larger ds or smaller s, becomes infinite and the bound is ineffective [Vapnik, 1998, page 349]. is also based on a complicated and implicit function of m, u, and , which makes its interpretation difficult. For example, it is hard to analyze the asymptotic behavior of the bound for large u.\nThis is in fact one of the two transduction settings discussed by [Vapnik, 1998], but, under some general conditions, the results proved with this setting carry over to the other.\n1\n\n\f\nr where I (m, u, k , ) is the set of integers r such that: k-r - m > and max(0, k - u) r u min(m, k ). (, k ) represents the probability of observing a difference in error rate of more than between the training and test set when the total number of errors is k (see [Cortes and Mohri, k 2006]). Then is defined as () = maxk ( , k ). is used in the transductive classim +u\n\nInstead, our bounds simply hold for general bounded loss functions and, when applied to classification, coincide with the tight classification bounds of Vapnik [1998]. Our results also include explicit VC-dimension bounds for transductive regression. To the best of our knowledge, these are the first general explicit bounds for transductive regression. Our first bound uses the function defined as follows. Let (, k ) be defined by: k m +u-k m r r m- r , 0, k N, u k m(1 - ) + u, (, k ) = (3 ) +u\nI (m,u,) m\n\nfication bound of Vapnik [1998] (see [Cortes and Mohri, 2006][Theorem 2]). [Cortes and Mohri, 2006][Corollary 2] gives an upper bound on .\n\nTwo classifiers associated in this way to (h, t, X ) and (h , t , X ) can be viewed as equivalent if they label X in an identical way. Since X is finite, there is a finite number of equivalence classes of such classifiers, we will denote that number by N (m + u).\n\nFor any subset X X , any non-negative real number t 0, and hypothesis h H , let (h, t, X ) denote the fraction of the points xi X , i = 1, . . . , k , such that (h(xi ) - yi )2 - t > 0. Thus, (h, t, X ) represents the error rate over the sample X of the classifier that associates to a point x the value zero if (h(x) - yx )2 t, one otherwise.\n\n Theorem 1 Let > 0, and let 0 > 0 be the minimum value of such that N (m + u)() , and assume that the loss function is bounded: for all h H and x X , (h(x) - yx )2 B 2 , where B R+ . Then, with probability at least 1 - , for all h H , R 2 u u 2 B 2 0 B 0 . (4 ) + 0 B R(h) R(h) + (h) + 2(m + u) 2(m + u) Proof. For any h H , let R1 (h) be defined by: B2 R1 (h) = (h, t, X ) dt.\n0\n\n(5 )\n\nBy the Cauchy-Schwarz inequality, B2 R1 (h) \n0\n\n(h, t, X ) dt\n\n1/2\n\n0\n\nB2\n\n1dt\n\n1/2\n\n=B\n0\n\nB2\n\n(h, t, X ) dt\n\n1/2\n\n.\n\n(6 )\n\nLet D denote the uniform probability distribution associated to the sample X . Thus, D(x) = m1 u + for all x X . Let PrxD [Ex ] denote the probability of event Ex when x is randomly drawn according to D. By definition of R0 and the Lebesgue integral, for all h H , B2 X 2 2 (h, t, X ) dt. (7) Pr [(h(x) - yx ) > t] dt = (h(x) - yx ) D(x) dx = R0 (h) =\n0 x D 0\n\nIn view of Equation 7, Inequality 6 can be rewritten as: R1 (h) B 2006][Theorem 2], for all > 0 and for any t 0, Pr[ sup\nh H\n\nSimilarly, setting Xm = {xi X : i [1, m]} and Xu = {xi X : i [m + 1, m + u]}, we have B2 B2 R(h) = (h, t, Xm ) dt and R(h) = (h, t, Xu ) dt. (8 )\n0 0\n\n(h, t, Xu ) - (h, t, Xm ) > ] N (m + u)(). (h, t, X )\n\nR\n\n0 (h).\n\nBy [Cortes and Mohri,\n\n(9 )\n\n\f\n Fix > 0. Then, with probability at least 1 - N (m + u)(), for all integers n > 1 and i 0, (h, iB , Xu ) - (h, iB , Xm ) n n . iB 2 (h, n , X )\n2 2\n\n(1 0 )\n\nThen, the convergence of the Riemann sums to the integral ensures that n n iB 2 iB 2 1i 1i (h, (h, R(h) - R(h) = lim , Xu ) - , Xm ) n n n n =0 n =0 n R iB 2 1i (h, lim , X ) = R1 (h) B 0 (h). n n n =0\n\n(1 1 ) (1 2 )\n\n Let > 0 and select = 0 as the minimum value of such that N (m + u)() , then with probability at least 1 - , R (1 3 ) R(h) - R(h) 0 B 0 (h). R(h) Plugging in the following expression of R0 (h) with respect to R(h) and u mR (h) + R(h), (1 4 ) R0 (h) = m+u m+u and solving the second-degree equation in R(h) yields directly the statement of the theorem.\n\nTheorem 1 provides a general bound on the regression error within the transduction setting. The theorem can also be used to derive a bound in the classification case by simply setting B = 1. The resulting bound coincides with the tight classification bound given by Vapnik [1998]. The bound given by Theorem 1 depends on the function and is implicit. The following provides a general and explicit error bound for transduction regression directly expressed in terms of the empirical error, the number of equivalence N (m + u) or the VC-dimension d, and the sample sizes m and u. Corollary 1 Let H be a set of hypotheses with VC-dimension d. Assume that the loss function is bounded: for all h H and x X , (h(x) - yx )2 B 2 , where B R+ . Then, with probability at least 1 - , for all h H , R 2 u 22 u B B , (1 5 ) + B R(h) R(h) + (h) + 2(m + u) 2(m + u) 2 2 d . (m+u) l (m+u) 1 1 with = log (m+u)e + log og N (m + u) + log mu mu d\n\n Proof. By Theorem 1, Inequality 15 holds for all > such that N (m + u)() . By [Cortes 0 N 1 mu () and Mohri, 2006][Corollary 2], log (m + u) log N (m + u) - 2 m+u 2 . Setting log to match this upper bound yields the expression of given above. Since N (m + u) is bounded by the shattering coefficient of H of order m + u, by Sauer's lemma, log N (m + u) d log (m+u)e . d This gives the upper bound on in terms of the VC-dimension.\n\nThe bound is explicit and can be readily used within the Structural Risk Minimization (SRM) framework, either by using the expression of in terms of the VC-dimension, or the tighter expression with respect to the number of equivalence classes N . In the latter case, a structure of increasing number of equivalence classes can be constructed as in [Vapnik, 1998, page 360]. A more practical algorithm inspired by these concepts is described in the next section.\n\n4 Transductive Regression Algorithm\nThis section presents an algorithm for the transductive regression problem. Before presenting this algorithm, let us first emphasize that the algorithms introduced for transductive classification problems, e.g., transductive SVMs [Vapnik, 1998, Joachims, 1999], cannot be readily used for regression. These algorithms typically select the hypothesis h, out of a hypothesis space H , that minimizes the following optimization function m u h , 1i 1i L (h(xi ), yi ) + C L (xm+i ), ym+i (1 6 ) min (h) + C ym+i ,i=1,...,u m =1 u =1\n\n\f\nwhere (h) is a capacity measure term, L is the loss function used, C 0 and C 0 regularization parameters, and where the minimum is taken over all possible labels ym+1 , . . . , ym+u for the test points. In regression, this scheme would lead to a trivial solution not exploiting the transduction setting. Indeed, let h0 be the hypothesis minimizing the first two terms, that is the solution of the induction problem. For the particular choice ym+i = h0 (xm+i ), i = 1, . . . , u, the third term vanishes. Thus, h0 is also minimizing the sum of all three terms. In two-group classification, the trivial solution is typically not the solution of the minimization problem because in general h0 (xm+i ) is not in {0, 1}. The main idea behind the design of our algorithm is to exploit the additional information provided in transduction, that is the position of the unlabeled examples. Our algorithm has two stages. The first stage is based on the position of unlabeled points. For each unlabeled point xi , i = m + 1, . . . , m + u, a local estimate label yi is determined using the labeled points in the neighborhood of xi . In the second stage, a global hypothesis h is found that best matches all labels, those of the training data and the estimate labels yi . \n\nThis second stage is critical and distinguishes our method from other suggested ones. While using local information to determine labels is important (see for example the discussion of Vapnik [1998]), it is not sufficient for a robust prediction. A global estimate of all labels is needed to make predictions less vulnerable to noise. 4.1 Local Estimates Let be a feature mapping from X to a vector space F provided with a norm. We fix a radius r 0 and consider for all x Xu , the ball of radius r centered in (x ), denoted by B ((x ), r). This defines the neighborhood of the image of each unlabeled point. A single radius r is used for all neighborhoods to limit the number of parameters for the algorithm. Labeled points x Xm whose images (x) fall within the neighborhood of (x ), x Xu , help determine an estimate label of x . With a very large radius r, the labels of all training examples contribute to the definition of the local estimates. But, with smaller radii, only a limited number of computations are needed. When no such labeled point exists in the neighborhood of x Xu , which depends on the radius r selected, x is disregarded in both training stages of the algorithm.\n\nThe estimate labels can also be obtained as the solution of a local linear or kernel ridge regression, which is what we used in most of our experiments. In practice, with a relatively small radius r, the computation of an estimated label yi depends only on a limited number of labeled points and their labels, and is quite efficient. 4.2 Global Optimization\n\nThere are many possible ways to define the estimate label of x Xu based on the neighborhood points. One simple way consists of defining it as the weighted average of the neighborhood labels yx , where the weights may be defined as the inverse of distances of (x) to (x ), or as similarity measures K (x, x ) when a positive definite kernel K is associated to . Thus, when the set of labeled points with images in the neighborhood of (x ) is not empty, I = {i [1, m] : (xi ) B ((x ), r)} = , the estimate label yx of x Xu can be given by: i wi yi - i yx = with wi 1 = (x ) - (xi ) r or wi = K (x , xi ). (1 7 ) wi\nI\n\nThe second stage of our algorithm consists of selecting a hypothesis h that fits best the labels of the training points and the estimate labels provided in the first stage. As suggested by Corollary 1, hypothesis spaces with a smaller number of equivalence classes guarantee a better generalization error. The bound also suggests reducing the empirical error. This leads us to consider the following objective function G = ||w|| + C\n2\n\nim\n\n=1\n\n(h(xi ) - yi ) + C\n\n2\n\n\n\ni m +u\n\n=m +1\n\n(h(xi ) - yi )2 , \n\n(1 8 )\n\nwhere h is as a linear function with weight vector w F : x X , h(x) = w (x), and where C 0 and C 0 are regularization parameters. The first two terms of the objective function coincide with those used in standard (kernel) ridge regression. The third term, which restricts the estimate error, can be viewed as imposing a smaller number of equivalence classes on the hypothesis space as suggested by the error bound of Corollary 1. The constraint explicitly exploits knowledge\n\n\f\nabout the location of all the test points, and limits the range of the hypothesis at these locations, thereby reducing the number of equivalence classes. Our algorithm can be viewed as a generalization of (kernel) ridge regression to the transductive setting. In the following, we will show that this generalized optimization problem admits a closed-form solution and a natural kernel-based solution. 4.2.1 Primal solution Let N be the dimension of the feature space and let W RN 1 denote the column matrix whose components are the coordinates of w, Y Rm1 the column matrix whose components are the labels yi of the training examples, and Y Ru1 the column-matrix whose components are the estimated labels yi of the test examples. Let X = [(x1 ), . . . , (xm )] RN m denote the matrix whose columns are the components of the images by of the training examples, and similarly X = [(xm+1 ), . . . , (xm+u )] RN u the matrix corresponding to the test examples. G can then be rewritten as: G = W 2 + C X W - Y 2 + C X W - Y 2. (1 9 ) G is convex and differentiable and its gradient is given by The matrix W minimizing G is the unique solution of G = 0. Since (IN + C XX + C X X ) is invertible, it is given by the following expression W = (IN + C XX + C X X )-1 (C XY + C X Y ). (2 1 )\nN N\n\nG = 2W + 2C X(X W - Y) + 2C X (X W - Y ).\n\n(2 0 )\n \n\nThis gives a closed-form solution in the primal space based on the inversion of a matrix in R . Let T (N ) be the time complexity of computing the inverse of a matrix in RN N . T (N ) = O(N 3 ) using standard methods or T (N ) = O(N 2.376 ) with the method of Coppersmith and Winograd. The time complexity of the computation of W from X, X , Y, and Y is thus in O(T (N ) + (m + u)N 2). When the dimension N of the feature space is small compared to the number of examples m + u, which is typical in modern learning applications where u is large, this method remains practical and leads to a very efficient computation. The use of the so-called empirical kernel map [Scholkopf and Smola, 2002] also makes this method very attractive. Given a kernel K , the empirical kernel feature vector associated to x is the m-dimensional vector (x) = [K (x, x1 ), . . . , K (x, xm )] . Thus, the dimension of the feature space is then N = m. For relatively small m, even for very large values of u with respect to m, the solution is efficiently computable and yet benefits from the use of kernels. This computational advantage is not shared by other methods such as the manifold regularization techniques [Belkin et al., 2004], or even by the regression technique described by [Chapelle et al., 1999], despite it is based on a primal method (we have derived a dual version of that method as well, see Section 5) since it requires among other things the inversion of a matrix in Ruu . Once W is computed, prediction can be done by computing X W in time O(uN ). 4.2.2 Dual solution The computation can also be done in the dual space, which is useful in the case of very highdimensional feature spaces. Let MX RN (m+u) and MY R(m+u)1 be the matrices defined by: . M X C Y MX = (2 2 ) = CX C Y C Y Then, Equation 21 can be rewritten as: W = (IN + MX M )-1 MX MY . To determine the dual X solution, observe that M (MX M + IN )-1 = (M MX + Im+u )-1 M , X X X X\n(m+u)(m+u)\n\n(2 3 )\n\nwhere Im+u denotes the identity matrix of R . This can be derived without difficulty from a series expansion of (MX M + IN )-1 . Thus, W can also be computed via: X W = MX (Im+u + K)-1 MY , (2 4 ) be = where K is the Gram matrix K = M MX . Let K21 Rum and K22 Ruu X the sub-matrices of the Gram K defined by: K21 = (K (xm+i , xj )1iu,1j m ) and K22 (K (xm+i , xm+j )1i,j u ) and let K2 Ru(m+u) be the matrix defined by: K2 = C K21 = X MX . C K22\n\n(2 5 )\n\n\f\nDataset Boston Housing [13] California Housing [8] kin-32fh [32] Elevators [18]\n\nNo. of unlab. points 25 500 2,500 5,000 20,000 2,500 8,000 500 2500 15,000\n\nRelative improvement in MSE (%) Our algorithm Chapelle et al. [1999] Belkin et al. [2004] 20.214.7 4.311.3 2.45.4 8.46.9 2.73.0 3.912.3 25.98.3 0.20.3 0.00.0 17.28.7 0.00.0 0.00.0 22.011.0 -- -- 9.43.7 2.22.6 2.73.1 18.45.9 0.50.5 0.90.7 14.410.4 1.52.7 2.6 7.7 9.06.9 2.22.9 0.00.0 9.75.8 -- --\n\nTable 1: Transductive regression experiments. The number in brackets after the name indicates the input dimensionality of the data set. The number of training examples was m = 481 for the Boston Housing data set, m = 25 for the other tasks. The number of unlabeled examples was u = 25 for the Boston Housing data set and varied from u = 500 to the maximum of 20,000 examples for the California Housing data set. For u 10,000, the algorithms of Chapelle et al. [1999] and Belkin et al. [2004] did not terminate within the time period of our experiments. Then, predictions can be made using kernel functions alone since X W can be computed by: X W = X MX (Im+u + K)-1 MY = K2 (Im+u + K)-1 MY . (2 6 ) When the dimension of the feature space N is very large with respect to the total number of examples, this can lead to a faster computation of the solution. (Im+u + K)-1 MY can be computed in O(T (m + u) + (m + u)2 tK ) and predictions are computed in time O(u (m + u)), where tK is the time complexity of the computation of K (x, x), x, x X . As already pointed in the description of the local estimates, in practice, some unlabeled points are disregarded in the training phases because no labeled point falls in their neighborhood. Thus, instead of u, a smaller number of unlabeled examples u u determines the computational cost.\n\n5 Experimental Results\nThis section reports the results of our experiments with the transductive regression algorithm just presented with several data sets. For comparison, we also implemented the algorithm of Chapelle et al. [1999] and that of Belkin et al. [2004], which are among the very few algorithms described in the literature dealing specifically with the problem of transductive regression. For the algorithm of Chapelle et al. [1999], we in fact derived and implemented a dual solution not described in the original paper. With the notation used in that paper, it can be shown that Our comparisons were made using several publicly available regression data sets: Boston Housing, kin-32fh a data set in the Kinematics family with high unpredictability or noise, California Housing, and Elevators [Torgo, 2006]. For the Boston Housing data set, we used the same partitioning of the training and test sets as in [Chapelle et al., 1999]: 481 training examples and 25 test examples. The input variables were normalized to have mean zero and a variance one. For the kin-32fh, California Housing, and Elevators data sets, 25 training examples were used with varying (large) amounts of test examples: 2,500 and 8,000 for kin-32fh; from 500 up to 20,000 for California Housing; and from 500 to 15,000 for Elevators. The experiments were repeated for 100 random partitions of training and test sets. The kernels used with all algorithms were Gaussian kernels. To measure the improvement produced by the transductive inference algorithms, we used kernel ridge regression as a baseline. The optimal 1 values for the width of the Gaussian and the ridge C were determined using cross-validation. These parameters were then fixed at these values. The remaining parameters for our algorithm, r and C , were determined using a grid search and cross-validation. The parameters of the algorithms of Chapelle et al. [1999] and Belkin et al. [2004] were determined in the same way. Alternatively, the parameters could be selected using the explicit VC-dimension generalization bound of Corollary 1. For our algorithm, we found the best values of r to be typically among the 2.5% smallest distances between training and test points. Thus, each estimate label was determined by only a small number of labeled points. For our algorithm, we experimented both with the dual solution using Gaussian kernels, and the primal solution with an empirical Gaussian kernel map as described in Section 4.2.1. The results ^^ ^^ C = I - KK (KK + I)-1 . (2 7 )\n\n\f\nobtained were very similar, however the primal method was dramatically faster since it required the inversion of relatively small-dimensional matrices even for a large number of unlabeled examples. For consistency, all the results reported for our method relate to the dual solution, except from those with very large u, e.g. u 10,000, where the dual method was too time-consuming. Table 1 shows the results of our experiments. For each data set and each algorithm, the relative improvement in mean squared error (MSE) with respect to the baseline averaged over the random partitions is indicated, followed by its standard deviation. Some improvements were small or not statistically significant. In general, we observed no significant performance improvement over the baseline on any of these data sets using the Laplacian regularized least squares method of Belkin et al. [2004]. We note that, while positive classification results have been previously reported for this algorithm, no transductive regression experimental result seems to have been published for it. Our results for the method of Chapelle et al. [1999] match those reported by the authors for the Boston Housing data set (both absolute and relative MSE).\n\nOur algorithm achieved a significant improvement of the MSE in all data sets and for different amounts of unlabeled data and was shown to be practical for large data sets of 20,000 test examples. This matches many real-world situations where amount of unlabeled data is orders of magnitude larger than that of labeled data.\n\n6 Conclusion\nWe presented a general study of transductive regression. We gave new and general explicit error bounds for transductive regression and described a simple and general algorithm inspired by our bound that can scale to relatively large data sets. The results of experiments show that our algorithm achieves a smaller error in several tasks compared to other previously published algorithms for transductive regression. The problem of transductive regression arises in a variety of learning contexts, in particular for learning node labels of a very large graphs such as the web graph. This leads to computational problems that may require approximations or new algorithms. We hope that our study will be useful for dealing with these and other similar transduction regression problems.\n\nReferences\nMikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization; a geometric framework for learning from examples. Technical Report TR-2004-06, University of Chicago, 2004. Kristin Bennett and Ayhan Demiriz. Semi-supervised support vector machines. NIPS 11, pages 368374, 1998. Olivier Chapelle, Vladimir Vapnik, and Jason Weston. Transductive Inference for Estimating Values of Functions. NIPS 12, pages 421427, 1999. Adrian Corduneanu and Tommi Jaakkola. On information regularization. In Christopher Meek and Uffe Kjrulff, editors, Proceedings of the Nineteenth Annual Conference on Uncertainty in Artificial Intelligence, pages 151158, 2003. Corinna Cortes and Mehryar Mohri. On Transductive Regression. Technical Report TR2006-883, Courant Institute of Mathematical Sciences, New York University, November 2006. Philip Derbeko, Ran El-Yaniv, and Ron Meir. Explicit learning curves for transduction and application to clustering and compression algorithms. J. Artif. Intell. Res. (JAIR), 22:117142, 2004. Thore Graepel, Ralf Herbrich, and Klaus Obermayer. Bayesian transduction. NIPS 12, 1999. Thorsten Joachims. Transductive inference for text classification using support vector machines. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 200209. Morgan Kaufmann Publishers, San Francisco, US, 1999. Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res., 5:2772, 2004. ISSN 1533-7928. Bernhard Scholkopf and Alex Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002. Dale Schuurmans and Finnegan Southey. Metric-Based Methods for Adaptive Model Selection and Regularization. Machine Learning, 48:5184, 2002. Lus Torgo. Regression datasets, 2006. http://www.liacc.up.pt/ ltorgo/Regression/DataSets.html. i Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Berlin, 1982. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Dengyong Zhou, Jiayuan Huang, and Bernard Scholkopf. Learning from labeled and unlabeled data on a directed graph. In L. De Raedt and S. Wrobel, editors, Proceedings of ICML-05, pages 10411048, 2005. Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, and John Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. NIPS 17, 2004.\n\n\f\n", "award": [], "sourceid": 3074, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}