{"title": "Incorporating Test Inputs into Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 437, "page_last": 443, "abstract": "", "full_text": "Incorporating Test Inputs into Learning \n\nZebra Cataltepe \n\nLearning Systems Group \n\nDepartment of Computer Science \nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\nMalik Magdon-Ismail \nLearning Systems Group \n\nDepartment of Electrical Engineering \n\nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\nzehra@cs.caltech.edu \n\nmagdon@cco.caltech.edu \n\nAbstract \n\nIn many applications, such as credit default prediction and medical im(cid:173)\nage recognition, test inputs are available in addition to the labeled train(cid:173)\ning examples. We propose a method to incorporate the test inputs into \nlearning. Our method results in solutions having smaller test errors than \nthat of simple training solution, especially for noisy problems or small \ntraining sets. \n\n1 Introduction \n\nWe introduce an estimator of test error that takes into consideration the test inputs. The \nnew estimator, augmented error, is composed of the training error and an additional term \ncomputed using the test inputs. In some applications, such as credit default prediction and \nmedical image recognition, we do have access to the test inputs. In our experiments, we \nfound that the augmented error (which is computed without looking at the test outputs but \nonly test inputs and training examples) can result in a smaller test error. In particular, it \ntends to increase when the test error increases (overtraining) even if the simple training \nerror does not. (see figure (1)). \n\nIn this paper, we provide an analytic solution for incorporating test inputs into learning in \nthe case oflinear, noisy targets and linear hypothesis functions. We also show experimental \nresults for the nonlinear case. \n\nPrevious results on the use of unlabeled inputs include Castelli and Cover [2] who show that \nthe labeled examples are exponentially more valuable than unlabeled examples in reducing \nthe classification error. For mixture models, Shahshahani and Landgrebe [7] and Miller \nand Uyar [6] investigate incorporating unlabeled examples into learning for classification \nproblems and using EM algorithm, and show that unlabeled examples are useful especially \nwhen input dimensionality is high and the number of examples is small. In our work we \nonly concentrate on estimating the test error better using the test inputs and our method \n\n\f438 \n\nZ Cataltepe and M. MagMn-Ismail \n\n2\u00b7~----~======~==~~------1 \n\nTraining error \n- Test error \n\n-\n- - Augmented error \n\n---\n\n,. \n\n, .-\n\n,. \n\nI \n\nI \n\n\\ \n\n\\ , \n\n\\ \n\n__ \u2022 \n\nJ \n\n1:l1----__ \n\n'. , \n\nI!! \n~ w \n1 \n\n5 \n\n~~------~5~------~6--------~7--------~8 \n\nlog(pass) \n\nFigure I: The augmented error, computed not looking at the test outputs at all, follows the \ntest error as overtraining occurs. \n\nextends to the case of unlabeled inputs or input distribution information. Our method is \nalso applicable for regression or classification problems. \n\nIn figure 1, we show the training, test and augmented errors, while learning a nonlinear \nnoisy target function with a nonlinear hypothesis. As overtraining occurs, the augmented \nerror follows the test error. In section 2, we explain our method of incorporating test inputs \ninto learning and give the analytical solutions for linear target and hypothesis functions. \nSection 3 includes theory about the existence and general form of the new solution. Section \n4 discusses experimental results. Section 5 extends our solution to the case of knowing the \ninput distribution, or knowing extra inputs that are not necessarily test inputs. \n\n2 \n\nIncorporating Test Inputs into Learning \n\nIn learning-from-examples, we assume we have a training set: {(Xl, II), ., . ,(XN, IN)} \nwith inputs Xn and possibly noisy targets In. Our goal is to choose a hypothesis \ngv, among a class of hypotheses G, minimizing the test error on an unknown test set \n{(YI, hd,\u00b7\u00b7\u00b7, (YM, hM)}' \nUsing the sample mean square error as our error criterion, the training error of hypothesis \ngv is: \n\nSimilarly the test error of gv is: \n\nE(gv) \n\nExpanding the test error: \n\n\fIncorporating Test InpuJs into Learning \n\n439 \n\nThe main observation is that. when we know the test inputs. we know the first term exactly. \nTherefore we need only approximate the remaining terms using the training set: \n\n1 M \nM L9~(Ym)- NL9v (Xn)!n+ NL!~ \n\n2 N \n\n(1) \n\nm=1 \n\nn=1 \n\n1 N \n\nn=1 \n\nWe scale the addition to the training error by an augmentation parameter a to obtain a \nmore general error function that we call the augmented error: \n\nEo (g.) + <> (~ ~ g; (Ym) - ~ t. g; (Xn)) \n\nwhere a = a corresponds to the training error Eo and a = 1 corresponds to equation (1). \nThe best value of the augmentation parameter depends on a number offactors including the \ntarget function. the noise distribution and the hypothesis class. In the following sections \nwe investigate properties of the best augmentation parameter and give a method of finding \nthe best augmentation parameter when the hypothesis is linear. \n\n3 Augmented Solution for the Linear Hypothesis \n\nIn this section we assume hypothesis functions of the form 9v(X) = v T x. From here \nonwards we will denote the functions by the vector that multiplies the inputs. When the \nhypothesis is linear we can find the minimum of the augmented error analytical1y. \n\nLet X dxN be the matrix of training inputs. YdxM be the matrix of test inputs and f NXI \ncontain the training targets. The solution Wo minimizing the training error Eo is the least \n\nsquares solution [5]: Wo = (X~T) -1 x:. \nThe augmented error Ea (v) = Eo (v) + avT (Y[/ - x ~T ) v is minimized at the aug(cid:173)\n\nmented error Wa : \n\n( T) -1 T \nwhere R = 1 -\nx ~ \nleast mean squares solution Woo \n\n(2) \nY1,; . When a = O. the augmented solution Wa is equal to the \n\n4 Properties of the Augmentation Parameter \n\nAssume a linear target and possibly noisy training outputs: f = w\u00b7T X +e where (ee T ) = \nu;INxN . \nSince the specific realization of noise e is unknown. instead of minimizing the test error \ndirectly. we focus on minimizing (E (wa))e. the expected value of the test error of the \naugmented solution with respect to the noise distribution: \n\n(E (wa))e \n\nw\u00b7T ((I - aRT) -1 - I) Y~T ((I - aR)-1 - I) w\u00b7 \n+ ~tr ((I -aRT' Y~T (I - <>R)-' (X;Tr') \n\n(3) \n\n\f440 \n\nZ Cataltepe and M. Magdon-Ismail \n\nwhere we have used (e T Ae) e = cr;tr (A) and tr(A) denotes the trace of matrix A. When \no = 0, we have: \n\n(YYT (XXT) -1) \n\ncr; \n-tr - - - - -\nN \n\n(4) \n\nM \n\nN \n\nNow, we prove the existence of a nonzero augmentation parameter 0 when the outputs are \nnoisy. \nTheorem 1: If cr; > 0 and tr (R (I - R)) =1= 0, then there is an 0 =1= 0 that minimizes the \nexpected test error (E (wa))e' \n\nProof: Since &B;~(a) = _B-l(o)&~~a)B-l(o) for any matrix B whose elements are \nscalar functions of 0 [3], the derivative of (E (wa))e with respect to 0 at 0 = 0 is: \n\ndIE t\u00b7)). I.~. = 2~tr (R( X;TfY~T) = 2~tr (R(I -R\u00bb \n\nIf the derivative is < 0 (> 0 respectively), then (E (wa))e is minimized at some 0 > 0 \n(0 < 0 respectively). 0 \n\nThe following proposition gives an approximate formula for the best o. \nTheorem 2: If Nand M are large, and the traini~ and test inputs are drawn i.i.d from \nan input distribution with covariance matrix (xx ) = cr;l, then the 0* minimizing \n(E (wa))e,x,y' the expected test error of the augmented solution with respect to noise and \ninputs, is approximately: \n\n(5) \n\nProof: is given in the appendix. 0 \n\nThis formula determines the behavior of the best o. The best 0: \n\n\u2022 decreases as the signal-to-noise ratio increases. \n\u2022 increases as ~ increases, i.e. as we have less examples per input dimension. \n\n4.1 Wa as an Estimator ofw* \n\nThe mean squared error (m.s.e.) of any estimator W ofw*, can be written as [1]: \n\nm.s.e(w) \n\nIIW* - (w)eI1 2 + (11w -\nbias2(w) + variance(w) \n\n(w)eI1 2 ) e \n\nWhen 0 is independent of the specific realization e of the noise: \n\nW*T (1 - (1 - oRT)-l) (1 - (I - oR)-I) w* \n+ ~ tr ( (X;T) -1 (I _ aRTr1(I _ aR)-I) \n\n\fIncorporating Test Inputs into Learning \n\n441 \n\nHence the m.s.e. of the least square estimator Wo is: \n\nm.s.e.(wo) \n\nWo is the minimum variance unbiased linear estimator of W\u00b7. Although w 0< is a biased \nestimator if exR =/:. 0, the following proposition shows that, when there is noise, there is an \nex =/:. o minimizing them.s.e. ofwo<: \nTheorem 3: If 0'; > 0 and tT ( ( X~T) -1 (R + RT)) =/:. 0, then there is an ex =/:. 0 that \nminimizes the m.s.e. ofwo<' \n\nProof: is similar to the proof of proposition 1 and will be skipped D. \nAs Nand M get large, R = 1 - (X~T) -1 Y,{/ -+ 0 and Wo< = (1 - aR)-lwo -+ woo \nHence, for large Nand M, the bias and variance of w 0< approach 0, making w 0< an un biased \nand consistent estimator of w\u00b7 . \n\n5 A Method to Find the Best Augmentation Parameter \n\nUver data. <1=0. M=CO \n\naugmented error with estimated alpha -\n\nleast squares >+-< \n\nBond da1a. d=11. 1.1=50 \n\n6.5 k \n\n\\ \n\nI .. \n\n'. '. \n\n' . \n..... \n\n~:f.::~ \n\n6 \n\nw \ne \n\u2022 \u2022 5.S \n! \u2022 \n0> \n! \n~ \n< \n\n4.5 \n\ni .. \n\n1 \n, \n\"-\n\\ \n\nw \n!! \n~ \n\n\u2022 ! \n\nCD \n0> \n! \n~ \n< \n\n1.5 \n\n1.4 \n\n1.3 \n\n12 \n\n1.1 \n\n0.9 \n\n0.8 \n\n0 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nN:number of tranng exal!'Clies \n\n120 \n\n140 \n\n160 \n\n4 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\nN romber of tranng exal!'Clies \n\nFigure 2: Using the augmented error results in smaller test error especially when the num(cid:173)\nber of training examples is small. \n\nGiven only the training and test inputs X and Y, and the training outputs f, in this section \nwe propose a method to find the best ex minimizing the test error of w 0<' \nEquation (3) gives a formula for the expected test error which we want to minimize. How(cid:173)\never, we do not know the target w\u00b7 and the noise variance 0';. In equation (3), we replace \nw'\" by Wo< and 0'; by (XTwa-:L:~~Twa-f), where Wo< is given by equation (2). Then we \nfind the ex minimizing the resulting approximation to the expected test error. \nWe experimented with this method of finding the best a on artificial and real data. The \nresults of experiments for liver datal and bond data2 are shown in figure 2. In the liver \n\nIftp:llftp.ics.uci.edulpub/machine-Iearning-databaseslliver-disorders/bupa.data \n2We thank Dr. John Moody for providing the bond data. \n\n\f442 \n\nZ Cataltepe and M. Magdon-Ismail \n\ndatabase the inputs are different blood test results and the output is the number of drinks \nper day. The bond data consists of financial ratios as inputs and rating of the bond from \nAAA to B- or lower as the output. \n\nWe also compared our method to the least squares (wo) and early stopping using different \nvalidation set sizes for linear and noisy problems. The table below shows the results. \n\nSNR \nO.oI \nI \n100 \n\nmean E(wol \nE(wo) \n0.650 \u00b1 0.006 \n0.830 \u00b1 0.007 \n1.001 \u00b1 0.002 \n\nmean 1>(Wearlll .top) N -!i \n\n' v -\n\n3 \n\nE~wo.l \n\n0.126 \u00b1 0.003 \n1.113 \u00b1 0.021 \n2.373 \u00b1 0.040 \n\nmean 1>(Wearlll .top) N =!i \n' v 6 \n\nE Wn \n\n0.192 \u00b1 0.004 \n1.075 \u00b1 0.020 \n2.073 \u00b1 0.042 \n\nTable 1: Augmented solution is consistently better than the least squares whereas early \nstopping gives worse results as the signal-to-noise ratio (SNR) increases. Even averaging \nearly stopping solutions did not help when SNR = 100 (E(wE(~~;top) = 1.245 \u00b1 0.018 \nwhen N v = ~ and 1.307 \u00b1 0.021 for N v = ~). For the results shown, d = 11, N = 30 \ntraining examples were used, N v is the number of validation examples. \n\n6 Extensions \n\nWhen the input probability distribution or the covariance matrix of inputs, instead of test \ninputs are known, YJ/ can be replaced by (xxT) = E and our methods are still applica(cid:173)\nble. \n\nIf the inputs available are not test inputs but just some extra inputs, they can still be incor(cid:173)\nporated into learning. Let us denote the extra K inputs {ZI, ... , ZK} by the matrix ZdxK. \nThen the augmented error becomes: \n\nEa(v) \n\nEo (v) + a \n\nK \n\n(ZZT XXT) \nv T - - - - - v \n\nK+N \n\nK \n\nN \n\nThe augmented new solution and its expected test error are same as in equations (2) and \n(3), except we have Rz = 1 -\n\n( X~T) -1 Z;T instead of R. \n\nNote that for the linear hypothesis case, the augmented error is not necessarily a regularized \nversion of the training error, because the matrix Yl;T - ~ is not necessarily a positive \ndefinite matrix. \n\n7 Conclusions and Future Work \n\nWe have demonstrated a method of incorporating inputs into learning when the target and \nhypothesis functions are linear, and the target is noisy. We are currently working on ex(cid:173)\ntending our method to nonlinear target and hypothesis functions. \n\nAppendix \n\nProof of Theorem 2: When the spectral radius of o.R is less than I (a is small and/or \nNand M are large), we can approximate (1 - aR)-l :::::: 1 + aR [4], and similarly, \n(1 - aRT) -1 :::::: 1 + aRT. Discarding any terms with powers of a greater than 1, and \n\n\fIncorporating Test Inputs into Learning \n\n443 \n\nsolving for 0 in d(E(W;l)\u00b7'X 'y = (d(E~\"'\u00bb). ) = 0: \n\nx ,Y \n\n0* \n\nThe last step follows since we can write y~T = u; (I + h ), X ~T = u; (I - .IN ) \nand (X ~T) -1 = th (I + .IN + Yi-) + 0 (Nb) for matrices Vx and Vy such that \n(Vx}x = (Vy}y = 0 and (Vx2)x and (V;)y are constant with respect to Nand M. For \nlarge M we can approximate (Y~T R2) = u; (R2) \n\n. \nx ,Y \n\nx ,Y \n\nIgnoring terms of 0 (Nt 5 ) , \n(R2 - R) x,y \n(Yi- + ~) X ,y' It can be shown that (Yi) x \ninput distribution. Similarly (~ ) y = ttl. Therefore: \n\n= ~ I for a constant .A depending on the \n\n(2Yi + ~) and (R2) \n\nx ,y \n\nx,y \n\n0* \n\no \n\nAcknowledgments \n\nWe would like to thank the Caltech Learning Systems Group: Prof. Yaser Abu-Mostafa, Dr. \nAmir Atiya, Alexander Nicholson, Joseph Sill and Xubo Song for many useful discussions. \n\nReferences \n[1] Bishop, c. (1995) Neural Networks for Pattern Recognition, Clarendon Press, Oxford, \n1995. \n\n[2] Castelli, V. & Cover T. (1995) On the Exponential Value of Labeled Samples. Pattern \nRecognition Letters, Vol. 16, Jan. 1995, pp. 105-111. \n\n[3] Devijver, P. A. & Kittler, J. (1982) Pattern Recognition: A Statistical Approach, pp. \n434. Prentice-Hall International, London. \n\n[4] Golub, G. H. & Van Loan C. F. (1993) Matrix Computations, The Johns-Hopkins Uni(cid:173)\nversity Press, Baltimore, MD. \n\n[5] Hocking, R. R. (1996) Methods and Applications of Linear Models. John Wiley & \nSons, NY. \n\n[6] Miller, D. J. & Uyar, S. (1996), A Mixture of Experts Classifier with Learning Based \non Both Labeled and Unlabeled Data. In G. Tesauro, D. S. Touretzky and T.K. Leen (eds.), \nAdvances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press. \n\n[7] Shahshahani, B. M. & Landgrebe, D. A. (1994) The Effect of Unlabeled Samples in \nReducing Small Sample Size Problem and Mitigating the Hughes Phonemenon. IEEE \nTransactions on Geoscience and Remote Sensing, Vol. 32 No. 5, Sept 1994, pp. 1087-\n1095. \n\n\f", "award": [], "sourceid": 1347, "authors": [{"given_name": "Zehra", "family_name": "Cataltepe", "institution": null}, {"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": null}]}