{"title": "Kernel Regression and Backpropagation Training With Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1039, "abstract": null, "full_text": "Kernel Regression and \n\nBackpropagation Training with Noise \n\nPetri Koistinen and Lasse Holmstrom \n\nRolf Nevanlinna Institute, University of Helsinki \nTeollisuuskatu 23, SF-0051O Helsinki, Finland \n\nAbstract \n\nOne method proposed for improving the generalization capability of a feed(cid:173)\nforward network trained with the backpropagation algorithm is to use \nartificial training vectors which are obtained by adding noise to the orig(cid:173)\ninal training vectors. We discuss the connection of such backpropagation \ntraining with noise to kernel density and kernel regression estimation. We \ncompare by simulated examples (1) backpropagation, (2) backpropagation \nwith noise, and (3) kernel regression in mapping estimation and pattern \nclassification contexts. \n\n1 \n\nINTRODUCTION \n\nLet X and Y be random vectors taking values in R d and RP, respectively. Suppose \nthat we want to estimate Y in terms of X using a feedforward network whose \ninput-output mapping we denote by y = g(x, w). Here the vector w includes all the \nweights and biases of the network. Backpropagation training using the quadratic \nloss (or error) function can be interpreted as an attempt to minimize the expected \nloss \n\n'\\(w) = ElIg(X, w) _ Y1I2. \n\nSuppose that EIIYW < 00. Then the regression function \n\n(2) \nminimizes the loss Ellb(X) - YI1 2 over all Borel measurable mappings b. Therefore, \nbackpropagation training can also be viewed as an attempt to estimate m with the \nnetwork g. \n\nm(x) = E[YIX = x]. \n\n(1) \n\n1033 \n\n\f1034 \n\nKoistinen and Holmstrom \n\nIn practice, one cannot minimize -' directly because one does not know enough \nabout the distribution of (X, Y). Instead one minimizes a sample estimate \n\n(3) \n\nin the hope that weight vectors w that are near optimal for ~n are also near optimal \nfor -'. In fact, under rather mild conditions the minimizer of ~n actually converges \ntowards the minimizing set of weights for -' as n -+ 00, with probability one (White, \n1989). However, if n is small compared to the dimension of w, minimization of ~n \ncan easily lead to overfitting and poor generalization, i.e., weights that render ~n \nsmall may produce a large expected error -'. \n\nMany cures for overfitting have been suggested. One can divide the available sam(cid:173)\nples into a training set and a validation set, perform iterative minimization using \nthe training set and stop minimization when network performance over the valida(cid:173)\ntion set begins to deteriorate (Holmstrom et al., 1990, Weigend et al., 1990). In \nanother approach, the minimization objective function is modified to include a term \nwhich tries to discourage the network from becoming too complex (Weigend et al., \n1990). Network pruning (see, e.g., Sietsma and Dow, 1991) has similar motivation. \nHere we consider the approach of generating artificial training vectors by adding \nnoise to the original samples. We have recently analyzed such an approach and \nproved its asymptotic consistency under certain technical conditions (Holmstrom \nand Koistinen, 1990). \n\n2 ADDITIVE NOISE AND KERNEL REGRESSION \n\nSuppose that we have n original training vectors (Xi, Yi) and want to generate \nartificial training vectors using additive noise. If the distributions of both X and Y \nare continuous it is natural to add noise to both X and Y components of the sample. \nHowever, if the distribution of X is continuous and that of Y is discrete (e.g., in \npattern classification), it feels more natural to add noiRe to the X components only. \nIn Figure 1 we present sampling procedures for both ca~es. In the x-only case the \nadditive noise is generated from a random vector Sx with density Kx whereas in the \nx-and-y case the noise is generated from a random vector SXy with density Kxy. \nNotice that we control the magnitude of noise with a scalar smoothing parameter \nh > O. \nIn both cases the sampling procedures can be thought of as generating random \nsamples from new random vectors Xkn) and y~n) . Using the same argument as \nin the Introduction we see that a network trained with the artificial samples tends \nto approximate the regression function E[y~n) IXkn)]. Generate I uniformly on \n{1, ... , n} and denote by I and I( .11 = i) the density and conditional density of \nXkn ). Then in the x-only case we get \n\nm~n)(Xkn)) := E[y~n)IXkn)] = LYiP(I = iIXkn)) \n\nn \n\ni=l \n\n\fKernel Regression and Backpropagation Training with Noise \n\n1035 \n\nProcedure 1. \n(Add noise to x only) \n\nProcedure 2. \n(Add noise to both x and y) \n\n1. Select i E {I, ... , n} with equal \n\nprobability for each index. \n\n2. Draw a sample Sx from density \n\nI