{"title": "Transductive Inference for Estimating Values of Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 421, "page_last": 427, "abstract": null, "full_text": "Transductive Inference for Estimating \n\nValues of Functions \n\nOlivier Chapelle*, Vladimir Vapnik*,t, Jason Westontt.t,* \n\n* AT&T Research Laboratories, Red Bank, USA. \n\nt Royal Holloway, University of London, Egham, Surrey, UK. \n\ntt Barnhill BioInformatics.com, Savannah, Georgia, USA. \n\n{ chapelle, vlad, weston} @research.att.com \n\nAbstract \n\nWe introduce an algorithm for estimating the values of a function \nat a set of test points Xe+!, ... , xl+m given a set of training points \n(XI,YI), ... ,(xe,Ye) without estimating (as an intermediate step) \nthe regression function . We demonstrate that this direct (transduc(cid:173)\nti ve) way for estimating values of the regression (or classification \nin pattern recognition) can be more accurate than the tradition(cid:173)\nalone based on two steps, first estimating the function and then \ncalculating the values of this function at the points of interest. \n\n1 \n\nIntroduction \n\nFollowing [6] we consider a general scheme of transductive inference. Suppose there \nexists a function y* = fo(x) from which we observe the measurements corrupted \nwith noise \n\n(1) \nFind an algorithm A that using both the given set of training data (1) and the given \nset of test data \n\n. (xe, Ye)), Yi = Y; + ~i' \n\n((Xl, YI),\" \n\nselects from a set of functions {x t--+ f (x)} a function \n\n(Xl+!,' .. , XHm) \n\nY = f(x) = fA(xlxl,YI, ... ,Xl,Yl,XHI\"\",XHm) \n\nand minimizes at the points of interest the functional \n\nR(A) = E (~ (y; -\n\ni=l+l \n\nfA(Xilxl,Yl, ... ,Xl,Ye,Xl+l, . .. ,Xl+m))2) \n\n(2) \n\n(3) \n\n(4) \n\nwhere expectation is taken over X and~. For the training data we are given the \nvector X and the value Y, for the test data we are only given x. \n\nUsually, the problem of estimating values of a function at points of interest is \nsol ved in two steps: first in a given set of functions f (x, a), a E A one estimates \nthe regression, i.e the function which minimizes the functional \n\nR(a) = J ((y -\n\nf(x, a))2dF(x, Y), \n\n(5) \n\n\f422 \n\n0. Chapelle, V. N. Vapnik and J. Weston \n\n(the inductive step) and then using the estimated function Y = f(x,al) we calculate \nthe values at points of interest \n\nyi = f(x;, ae) \n\n(6) \n\n(the deductive step). \n\nNote, however, that the estimation of a function is equivalent to estimating its val(cid:173)\nues in the continuum points of the domain of the function. Therefore, by solving \nthe regression problem using a restricted amount of information, we are looking \nfor a more general solution than is required. In [6] it is shown that using a di(cid:173)\nrect estimation method one can obtain better bounds than through the two step \nprocedure. \n\nIn this article we develop the idea introduced in [5] for estimating the values of a \nfunction only at the given points. \n\nThe material is organized as follows. In Section 1 we consider the classical (induc(cid:173)\ntive) Ridge Regression procedure, and the leave-one--out technique which is used to \nmeasure the quality of its solutions. Section 2 introduces the transductive method \nof inference for estimation of the values of a function based on this leave-one- out \ntechnique. \nIn Section 3 experiments which demonstrate the improvement given \nby transductive inference compared to inductive inference (in both regression and \npattern recognition) are presented. Finally, Section 4 summarizes the results. \n\n2 Ridge Regression and the Leave-One-Out procedure \n\nIn order to describe our transductive method, let us first discuss the classical two(cid:173)\nstep (inductive plus deductive) procedure of Ridge Regression. Consider the set of \nfunctions linear in their parameters \n\nn \n\nf(x, a) = L aicPi(x). \n\ni=1 \n\n(7) \n\nTo minimize the expected loss (5), where F(x, y) is unknown, we minimize the \nfollowing empirical functional (the so-called Ridge Regression functional [1]) \n\nRemp(a) = e L)Yi -\n1 ~ \n\nl \n\ni=1 \n\nf(Xi, a)) + 1'110.11 \n2 \n\n2 \n\n(8) \n\nwhere l' is a fixed positive constant, called the regularization parameter. The min(cid:173)\nimum is given by the vector of coefficients \n\nae = a(xl, Yl, ... , Xl, Yl) = (KT K + 1'1)-1 KTy \n\nwhere \n\nand K is a matrix with elements: \n\ny = (Y1, ... ,Ylf, \n\nKij=cPj(Xi), i=I, ... ,\u00a3, j=I, ... ,n. \n\n(9) \n\n(10) \n\n(11) \n\nThe problem is to choose the value l' which provides small expected loss for training \non a sample Sl = {(Xl,Yl), .. . ,(Xl,Yl)}. \nFor this purpose, we would like to choose l' such that f\"f minimizing (8) also mini(cid:173)\nmizes \n\nR = J (Y* - f\"f(x* ISl))2dF(x*, y*)dF(Se). \n\n(12) \n\n\fTransductive Inference for Estimating Values of Functions \n\n423 \n\nSince F(x, y) is unknown one cannot estimate this minimum directly. To solve this \nproblem we instead use the leave-one-out procedure, which is an almost unbiased \nestimator of (12). The leave-one-out error of an algorithm on the training sample \nSf. is \n\n(13) \n\nThe leave-one-out procedure consists of removing from the training data one el(cid:173)\nement (say (Xi, Yi)), constructing the regression function only on the basis of the \nremaining training data and then testing the removed element. In this fashion one \ntests all f elements of the training data using f different decision rules. The mini(cid:173)\nmum over, of (13) we consider as the minimum over, of (12) since the expectation \nof (13) coincides with (12) [2]. \n\nFor Ridge Regression, one can derive a dosed form expression for the leave- one- out \nerror. Denoting \n\nthe error incurred by the leave-one-out procedure is [6] \n\nT. \n\nloo(r) -\n\n1 \n-_ \n\nf L \n\n~=1 \n\nf. (Y'_kTA-1KTy)2 \n\n'Y \n\n~ \n\n~ \n\n1 _ kT A-1k. \n~ \n\n'Y \n\n~ \n\nwhere \n\nLet, = ,0 be the minimum of (15). Then the vector \n\nkt = (i>I(xd\u00b7\u00b7\u00b7 ,l/>n(Xt)f\u00b7 \n\nyO = K*(KT K +,0 I)-I KTy \n\nwhere \n\nK*-\n\n( \n\nI/>(XHI) \n\n. \n\n1/>1 (XHm) \n\n(14) \n\n(15) \n\n(16) \n\n(17) \n\n(18) \n\nis the Ridge Regression estimate of the unknown values (Ye+l' ... ,Ye+m)' \n\n3 Leave-One-Out Error for Transductive Inference \n\nIn transductive inference, our goal is to find an algorithm A which minimizes the \nfunctional (4) using both the training data (1) and the test data (2). We suggest the \nfollowing method: predict (Ye+l' ... 'Ye+m) by finding those values which minimize \nthe leave-one-out error of Ridge Regression training on the joint set \n\n(Xl, yd,\u00b7\u00b7 . ,(Xl, Yl), (Xl+l, ye+l),\u00b7\u00b7 ., (XHm, Ye+m)' \n\n(19) \nThis is achieved in the following way. Suppose we treat the unknown values \n(Ye+l\" .. ,Ye+m) as variables and for some fixed value of these variables we min(cid:173)\nimize the following empirical functional \n\nRemp(aly;, .. \u00b7, y~) = f: m ~(Yi - f(xi,a))2 + . L (y; -\n\nHm \n\nf. \n\n(\n\n) \n\nf(xi, a))2 +,llaI1 2 . \n\n~=l \n\n~=l+1 \n\n(20) \nThis functional differs only in the second term from the functional (8) and corre(cid:173)\nsponds to performing Ridge Regression with the extra pairs \n\n(21) \n\n\f424 \n\nO. Chapel/e, V. N. Vapnik and J. Weston \n\nSuppose that vector Y\" = (Yi, ... , y:n) is taken from some set Y\" E Y such that \nthe pairs (21) can be considered as a sample drawn from the same distribution as \nthe pairs (Xl, yi), ... , (Xl, yi)\u00b7 In this case the leave-one-out error of minimizing \n(20) over the set (19) approximates the functional (4). We can measure this leave(cid:173)\none-out error using the same technique as in Ridge Regression. Using the closed \nform (15) one obtains \n\n7loo(rly~, .. \u00b7,y~) = -f-- L \n1 \n+ m i=l \n\nl+m (Y:' _ kT A-I kTY) 2 \n\n~ \nt~T ~-1~ \n1 - ki A-y ki \n\nwhere we denote x = (Xl, ... , Xl+m), Y = (YI, ... , Yl, Yi+1\" \n\n.. , Yi+m)T, and \n\nKij=