{"title": "Training Neural Networks with Deficient Data", "book": "Advances in Neural Information Processing Systems", "page_first": 128, "page_last": 135, "abstract": null, "full_text": "Training Neural Networks with \n\nDeficient Data \n\nVolker Tresp \nSiemens AG \n\nCentral Research \n81730 Munchen \n\nGermany \n\ntresp@zfe.siemens.de \n\nInterval Research Corporation \n\n1801-C Page Mill Rd. \nPalo Alto, CA 94304 \nahmad@interval.com \n\nSubutai Ahmad \n\nRalph N euneier \n\nSiemens AG \n\nCentral Research \n81730 Munchen \n\nGermany \n\nralph@zfe.siemens.de \n\nAbstract \n\nWe analyze how data with uncertain or missing input features can \nbe incorporated into the training of a neural network. The gen(cid:173)\neral solution requires a weighted integration over the unknown or \nuncertain input although computationally cheaper closed-form so(cid:173)\nlutions can be found for certain Gaussian Basis Function (GBF) \nnetworks. We also discuss cases in which heuristical solutions such \nas substituting the mean of an unknown input can be harmful. \n\n1 \n\nINTRODUCTION \n\nThe ability to learn from data with uncertain and missing information is a funda(cid:173)\nmental requirement for learning systems. In the \"real world\" , features are missing \ndue to unrecorded information or due to occlusion in vision, and measurements are \naffected by noise. In some cases the experimenter might want to assign varying \ndegrees of reliability to the data. \n\nIn regression, uncertainty is typically attributed to the dependent variable which is \nassumed to be disturbed by additive noise. But there is no reason to assume that \ninput features might not be uncertain as well or even missing competely. \n\nIn some cases, we can ignore the problem: instead of trying to model the rela(cid:173)\ntionship between the true input and the output we are satisfied with modeling the \nrelationship between the uncertain input and the output. But there are at least two \n\n128 \n\n\fTraining Neural Networks with Deficient Data \n\n129 \n\nreasons why we might want to explicitly deal with uncertain inputs. First, we might \nbe interested in the underlying relationship between the true input and the output \n(e.g. the relationship has some physical meaning). Second, the problem might be \nnon-stationary in the sense that for different samples different inputs are uncertain \nor missing or the levels of uncertainty vary. The naive strategy of training networks \nfor all possible input combinations explodes in complexity and would require suffi(cid:173)\ncient data for all relevant cases. It makes more sense to define one underlying true \nmodel and relate all data to this one model. Ahmad and Tresp (1993) have shown \nhow to include uncertainty during recall under the assumption that the network \napproximates the \"true\" underlying function. In this paper, we first show how in(cid:173)\nput uncertainty can be taken into account in the training of a feedforward neural \nnetwork. Then we show that for networks of Gaussian basis functions it is possible \nto obtain closed-form solutions. We validate the solutions on two applications. \n\n2 THE CONSEQUENCES OF INPUT UNCERTAINTY \n\nConsider the task of predicting the dependent variablel y E ~ from the input \nvector x E ~M consisting of M random variables. We assume that the input \ndata {(xklk = 1,2, ... , K} are selected independently and that P(x) is the joint \nprobability distribution of x. Outputs {(yklk = 1,2, ... , K} are generated following \nthe standard signal-plus-noise model \n\nyk = /(xk) + (k \n\nwhere {(klk = 1,2, ... , K} denote zero-mean ran'dom variables with probability den(cid:173)\nsity Pc(t:). The best predictor (in the mean-squared sense) of y given the input x \nis the regressor defined by E(ylx) = J y P(ylx) dx = f(x), where E denotes the \nexpectation. Unbiased neural networks asymptotically (K -+ 00) converge to the \nregressor. \n\nTo account for uncertainty in the independent variable we assume that we do not \nhave access to x but can only obtain samples from another random vector z E ~M \nwith \nwhere {Ok Ik = 1,2, ... , K} denote independent random vectors containing M random \nvariables with joint density P6(6).2 \nA neural network trained with data {(zk, yk)lk = 1,2, ... , K} approximates \nE(ylz) = P~z) J y P(ylx) P(zlx) P(x) dydx = P~z) J /(x) P6(Z - x) P(x) dx. \n\nzk = xk + Ok \n\n(1) \nThus, in general E(ylz) # /(z) and we obtain a biased solution. Consider the \ncase that the noise processes can be described by Gaussians Pc(() = G((j 0, O'Y) and \nP6(6) = G(Oj 0, 0') where, in our notation, G(Xj m, s) stands for \n\nG(x' m s) -\n-\n\n, \n\n, \n\n(211')M/2 n:;l Sj \n\n11M (x\u00b7 - m\u00b7)2 \n\nexp[-- \"\" J \n\n2 ~ s] \n\nJ] \n\nlOur notation does not distinguish between a random variable and its realization. \n2 At this point, we assume that P6 is independent of x. \n\n\f130 \n\nTresp, Ahmad, and Neuneier \n\n\u00a3 \n\nf(x) \n\nE(y!x) \n\ny \n\nt \n\nE(ylz) \n\nI F \n\n\\j \n\n./ \\ ~ \n\nFigure 1: The top half of the figure shows the probabilistic model. In an example, \nthe bottom half shows E(Ylx) = f( x) ( continuous), the input noise distribution \n(dotted) and E(ylz ) (dashed). \n\nwhere m, s are vectors with the same dimensionality as x (here M). Let us take a \ncloser look at four special cases. \nCertain input. If t7 = 0 (no input noise), the integral collapses and E(ylz) = fez). \nIf P(x) varies much more slowly than P(zlx), Equation 1 de(cid:173)\nUncertain input. \nscribed the convolution of f(x) with the noise process P6(Z - x). Typical noise \nprocesses will therefore blur or smooth the original mapping (Figures 1). It is \nsomewhat surprising that the error on the input results in a (linear) convolution \nintegral. In some special cases we might be able to recover f( x) from an network \ntrained on deficient data by deconvolution, although one should use caution since \ndeconvolution is very error sensitive. \nUnknown input. If t7j -\nmation about Xj and we can consider the jth input to be unknown. Our formalism \ntherefore includes the case of missing inputs as special case. Equation 1 becomes \nan integral over the unknown dimensions weighted by P(x) (Figure 2). \nLinear approximation. If the approximation \n\n00 then the knowledge of Zj does not give us any infor(cid:173)\n\nis valid, the input noise can be transformed into output noise and E(ylz) = fez). \nThis results can also be derived using Equation 1 if we consider that a convolution of \na linear function with a symmetrical kernel does not change the function. This result \ntells us that if f(x) is approximately linear over the range where P6(6) has significant \n\n(2) \n\n\fTraining Neural Networks with Deficient Data \n\n131 \n\n'.\"r---~-~--' \n\n'.2 \n\n... \n\nFigure 2: Left: samples yk = f(xt, x~) are shown (no output noise). Right: with \none input missing, P(yIX1) appears noisy. \n\namplitude we can substitute the noisy input and the network will still approximate \nf(x). Similarly, the mean mean(xi) of an unknown variable can be substituted \nfor an unknown input, if f(x) is linear and xi is independent of the remaining \ninput variables. But in all those cases, one should be aware of the potentially large \nadditional variance (Equation 2). \n\n3 MAXIMUM LIKELIHOOD LEARNING \n\nIn this section, we demonstrate how deficient data can be incorporated into the \ntraining of feedforward networks. In a typical setting, we might have a number of \ncomplete data, a number of incomplete data and a number of data with uncertain \nfeatures. Assuming independent samples and Gaussian noise, the log-likelihood I \nfor a neural network NNw with weight vector W becomes \n\nK \n\n1= 2:logP(zk,yk) = 2: log J G(yk jNNw(x),(1Y) G(zk jX ,(1k) P(x) dx. \n\nK \n\nk=1 \n\nk=1 \n\nNote that now, the input noise variance is allowed to depend on the sample k. The \ngradient of the log-likelihood with respect to an arbitrary weight Wi becomes3 \n\n01 ~ 8IogP(zk, yk) \n8w. = L...J \nk=1 \n\n8w' \n\nl \n\nl \n\n= ((1y)2 L...J P(zk yk) X \n\n1 ~ 1 \nk=1' \n\nJ(yk - NNw (x)) 8N:~(x) G(yk;NNw(x),(1Y) G(zk;X,(1k) P(x) dx. \n\n(3) \n\n--+ 0): 8IogP(zk,yk)/8wi = \nFirst, realize that for a certain sample k ((1k \n(yk _ N Nw(zk))/((1Y)2 8N Nw(zk)/8wi which is the gradient used in normal back(cid:173)\npropagation. For uncertain data, this gradient is replaced by an averaged gra(cid:173)\ndient. The integral averages the gradient over possible true inputs x weighted \nby the probability of P(xlzk,yk) = P(zkl x ) P(ykl x ) P(x)/p(zk,yk). The term \n\n3This equation can also be obtained via the EM formalism. A similar equation was \n\nobtained by Buntine and Weigend (1991) for binary inputs. \n\n\f132 \n\nTresp, Ahmad, and Neuneier \n\nP(ykl x ) == G(ykjNNw(x),D''') is of special importance since it weights the gradi(cid:173)\nent higher when the network prediction NNw (x) agrees with the target yk. This \nterm is also the main reason why heuristics such as substituting the mean value \nfor a missing variable can be harmful: if, at the substituted input, the difference \nbetween network prediction and target is large, the error is also large and the data \npoint contributes significantly to the gradient although it is very unlikely that the \nsubstitutes value was the true input. \nIn an implementation, the integral needs to be approximated by a finite sum (i. e. \nMonte-Carlo integration, finite-difference approximation etc.). In the experiment \ndescribed in Figure 3, we had a 2-D input vector and the data set consisted of both \ncomplete data and data with one missing input. We used the following procedure \n\n1. Train the network using the complete data. Estimate (UIl )2. We used (UII )2 ~ \n(Ec /(K - H\u00bb, where Ec is the training error after the network was trained with \nonly the complete data, and H is the number of hidden units in the network. \n2. Estimate the input density P(x) using Gaussian mixtures (see next section). \n3. Include the incomplete training patterns in the training. \n4. For every incomplete training pattern \n\n\u2022 Let z~ be the certain input and let zt be the missing input, and z1c = (z~, zt) . \n\u2022 Approximate (assuming -1/2 < Xj < 1/2, the hat stands for estimate) \n\n1 \n\n1 \n\nJ/2 \n\n1 ~ \n\n::::: J (ulI)2 p(z~ y1c) . L..J \u00aby1c - N Nw(z~, j / J\u00bb x \n\n, \n\nJ=-J/2 \n\n8 log P(z~, y1c) \n\n8Wi \n\nwhere \n\n4 GAUSSIAN BASIS FUNCTIONS \n\nThe required integration in Equation 1 is computationally expensive and one would \nprefer closed form solutions. Closed form solutions can be found for networks which \nare based on Gaussian mixture densities. 4 Let's assume that the joint density is \ngiven by \n\nN \n\nP(x) == L G(x; Ci, Si) P(Wi), \n\ni=l \n\nwhere Ci is the location of the center of the ith Gaussian and and Sij corresponds to \nthe width of the ith Gaussian in the jth dimension and P(Wi) is the prior probability \nof Wi. Based on this model we can calculate the expected value of any unknown \n\n4Gaussian mixture learning with missing inputs is also addressed by Ghahramani and \n\nJordan (1993). See also their contribution in this volume. \n\n\fTraining Neural Networks with Deficient Data \n\n133 \n\n0.1 \n\n0.1 \n\n28c \n\n28c, \n225m \n\n125c \n\n125c, \n128m \n\n28c \n\n225m \n\n28c, \n225m \n\nmean \nsubst \n\nFigure 3: Regression. Left: We trained a feedforward neural network to predict the \nhousing price from two inputs (average number of rooms, percent of lower status \npopulation (Tresp, Hollatz and Ahmad (1993\u00bb. The training data set contained \nvarying numbers of complete data points (c) and data points with one input missing \n(m). For training, we used the method outlined in Section 3. The test set consisted \nof 253 complete data. The graph (vertical axis: generalization error) shows that by \nincluding the incomplete patterns in the training, the performance is significantly \nimproved. Right: We approximated the joint density by a mixture of Gaussians. \nThe incomplete patterns were included by using the procedure outlined in Sec(cid:173)\ntion 4. The regression was calculated using Equation 4. As before, including the \nincomplete patterns in training improved the performance. Substituting the mean \nfor the missing input (column on the right) on the other hand, resulted in worse \nperformance than training of the network with only complete data. \n\n_. \n\n-.-\n\n0.86 \n\ni \n\n1\u00b0\u00b784 -\u00a7 0.82 \nI 0.8 \n\n1000 \n\n2000 \n\n# of data with miss. feat. \n\n3000 \n\n0.74 r-----.-----~--__, \n\n1) io.n -c: i 0.7 \n\n60.68 \ntf!. \n\n234 \n# of missing features \n\n5 \n\nFigure 4: Left: Classification performance as a function of the number of missing \nfeatures on the task of 3D hand gesture recognition using a Gaussian mixtures \nclassifier (Equation 5). The network had 10 input units, 20 basis functions and 7 \noutput units. The test set contained 3500 patterns. (For a complete description \nof the task see (Ahmad and Tresp, 1993).) Class-specific training with only 175 \ncomplete patterns is compared to the performance when the network is trained with \nan additional 350, 1400, and 3325 incomplete patterns. Either 1 input (continuous) \nor an equal number of 1-3 (dashed) or 1-5 (dotted) inputs where missing. The \nfigure shows clearly that adding incomplete patterns to a data set consisting of \nonly complete patterns improves performance. Right: the plot shows performance \nwhen the network is trained only with 175 incomplete patterns. The performance \nis relatively stable as the number of missing features increases. \n\n\f134 \n\nTresp, Ahmad, and Neuneier \n\nvariable XU from any set of known variables xn using (Tresp, Hollatz and Ahmad, \n1993) \n\nE(xUlxn) = E; ciG(xn; ci,si) P(w.) \n\nE.=1 G(xnj cf, sf) P(Wi) \n\n(4) \n\nNote, that the Gaussians are projected onto the known dimensions. The last equa(cid:173)\ntion describes the normalized basis function network introduced by Moody and \nDarken (1989). \nClassifiers can be built by approximating the class-specific data distributions \nP(xlclassi) by mixtures of Gaussians. Using Bayes formula, the posterior class \nprobability then becomes \n\nP( 1 \n\nI) \n\nc ass, x = \" \n\nP(classi)P(xlclassi) \n\nL..J; P( class; )P(x class;) \n\nI ' \n\n(5) \n\nWe now assume that we do not have access to x but to z where, again, P(zlx) = \nG(z; x, 0'). The log-likelihood of the data now becomes \n\nK \n\nN \n\nK \n\nN \n\n1 = L:log jL:G(X;Ci,Si)P(Wi) G(zk;x,O'k) dx = LlogLG(zk;ci,Sf)P(wi) \n\nk=1 \n\ni=1 \n\nk=1 \n\n.=1 \n\nwhere (Sf;)2 = s~; + (O'j)2. We can use the EM approach (Dempster, Laird and \nRubin, 1977) to obtain the following update equations. Let Ci;, s.; and P(w.) denote \ncurrent parameter estimates and let Of; = (Ci;(O'j)2 + z;s~;)/(Sf;? and Df; = \n\u00abO'j)2 s~;)/(Sf;)2. The new estimates (indicated by a hat) can be obtained using \n\nP(wdzk) \n\nP(w.) \n\nCi; \n\nA2 \n\nsi; \n\nG(zk; Ci, Sf) P(Wi) \n\nEf=1 G(Zk;C;, Sf) pew;) \n\nk \nP(w.lz ) \n\nK \n\n1 L: A \nK \nk=1 \nK \n\nk A \n\nk \nEk=10i; P(wdz ) \nEK \nEf=1[Df; + (Of; - Ci;)2] P(wdzk) \n\nk \nk=1 P(Wil z ) \n\nA \n\nK \n\nEk=1 P(Wi Izk) \n\nA \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\nThese equations can be solved by alternately using Equation 6 to estimate P(wdzk) \nand Equations 7 to 9 to update the parameter estimates. If uk = 0 for all k (only \ncertain data) we obtain the well known EM equations for Gaussian mixtures (Duda \nand Hart (1973), page 200). Setting 0': = 00 represents the fact that the jth input \nis missing in the kth data point and Of; = Cij, Dfj = s~;. Figure 3 and Figure 4 \nshow experimental results for a regression and a classification problem. \n\n\fTraining Neural Networks with Deficient Data \n\n135 \n\n5 EXTENSIONS AND CONCLUSIONS \n\nWe can only briefly address two more aspects. \nregression. We can obtain similar results for classification problems if the cost(cid:173)\nfunction is a log-likelihood function (e.g. the cross-entropy, the signal-plus-noise \nmodel is not appropriate). Also, so far we considered the true input to be unobserved \ndata. Alternatively the true inputs can be considered unknown parameters. In this \ncase, the goal is to substitute the maximum likely input for the unknown or noisy \ninput. We obtain as log-likelihood function \n\nIn Section 3 we only discussed \n\nI ~[_~ (y1: - N Nw (x1c)? _ ~ ~ (x;- zt? + I P( 1:)] \n. \n(X L...J \n1:=1 \n\n2 ~ (q~)2 \n\nog X \n\n1=1 \n\n1 \n\n2 \n\n(qy)2 \n\nThe l.earning frocedure consists of finding optimal values for network weights wand \ntrue mputs x . \n\n6 CONCLUSIONS \n\nOur paper has shown how deficient data can be included in network training. Equa(cid:173)\ntion 3 describes the solution for feedforward networks which includes a computa(cid:173)\ntionallyexpensive integral. Depending on the application, relatively cheap approx(cid:173)\nimations might be feasible. Our paper hinted at possible pitfalls of simple heuris(cid:173)\ntics. Particularly attractive are our results for Gaussian basis functions which allow \nclosed-form solutions. \n\nReferences \n\nAhmad, S. and Tresp, V. (1993). Some solutions to the missing feature problem in vision. \nIn S. J. Hanson, J. D. Cowan and C. 1. Giles, (Eds.), Neural Information Processing \nSystems 5. San Mateo, CA: Morgan Kaufmann. \n\nBuntine, W. L. and Weigend, A. S. (1991). Bayesian Back-Propagation. Complex systems, \nVol. 5, pp. 605-643. \n\nDempster, A. P., La.ird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incom(cid:173)\nplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, pp. 1-38. \n\nDuda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. \nWiley and Sons, New York. \nGhahramani, Z. and Jordan, M. I. (1993). Function approximation via density estimation \nusing an EM approach. MIT Computational Cognitive Sciences, TR 9304. \n\nJohn \n\nMoody, J. E. and Darken, C. (1989). Fast learning in networks oflocally-tuned processing \nunits. Neural Computation, Vol. 1, pp. 281-294. \n\nTresp, V., Hollatz J. and Ahmad, S. (1993). Network structuring and tra.ining using rule(cid:173)\nbased knowledge. In S. J. Hanson, J. D. Cowan and C. L. Giles, (Eds.), Neural Information \nProcessing Systems 5. San Mateo, CA: Morgan Kaufmann. \n\nTresp, V., Ahmad, S. and Neuneier, R. (1993). Uncerta.inty in the Inputs of Neural \nNetworks. Presented at Neural Networks for Computing 1993. \n\n\f", "award": [], "sourceid": 808, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Ralph", "family_name": "Neuneier", "institution": null}]}