{"title": "Efficient Methods for Dealing with Missing Data in Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 696, "abstract": null, "full_text": "Efficient Methods for Dealing with \n\nMissing Data in Supervised Learning \n\nVolker '!'resp\u00b7 \nSiemens AG \n\nCentral Research \nOtto-Hahn-Ring 6 \n\n81730 Miinchen \n\nGermany \n\nRalph Neuneier \n\nSiemens AG \n\nCentral Research \nOtto-Hahn-Ring 6 \n81730 Miinchen \n\nGermany \n\nSubutai Ahmad \n\nInterval Research Corporation \n\n1801-C Page Mill R<;l. \nPalo Alto, CA 94304 \n\nAbstract \n\nWe present efficient algorithms for dealing with the problem of mis(cid:173)\nsing inputs (incomplete feature vectors) during training and recall. \nOur approach is based on the approximation of the input data dis(cid:173)\ntribution using Parzen windows. For recall, we obtain closed form \nsolutions for arbitrary feedforward networks. For training, we show \nhow the backpropagation step for an incomplete pattern can be \napproximated by a weighted averaged backpropagation step. The \ncomplexity of the solutions for training and recall is independent \nof the number of missing features. We verify our theoretical results \nusing one classification and one regression problem. \n\n1 \n\nIntroduction \n\nThe problem of missing data (incomplete feature vectors) is of great practical and \ntheoretical interest. In many applications it is important to know how to react if the \navailable information is incomplete, if sensors fail or if sources of information become \n\nA.t the time of the research for this paper, a visiting researcher at the Center for \n\nBiological and Computational Learning, MIT. E-mail: Volker.Tresp@zfe.siemens.de \n\n\f690 \n\nVoLker Tresp. RaLph Neuneier. Subutai Ahmad \n\nunavailable. As an example, when a sensor fails in a production process, it might \nnot be necessary to stop everything if sufficient information is implicitly contained \nin the remaining sensor data. Furthermore, in economic forecasting, one might want \nto continue to use a predictor even when an input variable becomes meaningless (for \nexample, due to political changes in a country). As we have elaborated in earlier \npapers, heuristics such as the substitution of the mean for an unknown feature can \nlead to solutions that are far from optimal (Ahmad and Tresp, 1993, Tresp, Ahmad, \nand Neuneier, 1994). Biological systems must deal continuously with the problem \nof unknown uncertain features and they are certainly extremely good at it. From \na biological point of view it is therefore interesting which solutions to this problem \ncan be derived from theory and if these solutions are in any way related to the way \nthat biology deals with this problem (compare Brunelli and Poggio, 1991). Finally, \nhaving efficient methods for dealing with missing features allows a novel pruning \nstrategy: if the quality of the prediction is not affected if an input is pruned, we \ncan remove it and use our solutions for prediction with missing inputs or retrain \nthe model without that input (Tresp, Hollatz and Ahmad, 1995). \n\nIn Ahmad and Tresp (1993) and in Tresp, Ahmad and Neuneier (1994) equations for \ntraining and recall were derived using a probabilistic setting (compare also Buntine \nand Weigend, 1991, Ghahramani and Jordan, 1994). For general feedforward neural \nnetworks the solution was in the form of an integral which has to be approxima(cid:173)\nted using numerical integration techniques. The computational complexity of these \nsolutions grows exponentially with the number of missing features. In these two \npublications, we could only obtain efficient algorithms for networks of normalized \nGaussian basis functions. It is of great practical interest to find efficient ways of \ndealing with missing inputs for general feedforward neural networks which are more \ncommonly used in applications. In this paper we describe an efficient approximation \nfor the problem of missing information that is applicable to a large class of learning \nalgorithms, including feedforward networks. The main results are Equation 2 (re(cid:173)\ncall) and Equation 3 (training). One major advantage of the proposed solution is \nthat the complexity does not increase with an increasing number of missing inputs. \nThe solutions can easily be generalized to the problem of uncertain (noisy) inputs. \n\n2 Missing Information During Recall \n\n2.1 Theory \n\nWe assume that a neural network N N(x) has been trained to predict E(ylx), the \nexpectation of y E !R given x E ~. During recall we would like to know the \nnetwork's prediction based on an incomplete input vector x = (XC, XU) where XC \ndenotes the known inputs and XU the unknown inputs. The optimal prediction \ngiven the known features can be written as (Ahmad and Tresp, 1993) \n\n\fEfficient Methods for Dealing with Missing Data in Supervised Learning \n\n691 \n\nu \n~=X \n\n\u2022 \n\no . \nYix;~,a) \nXl: 00 : \no \n000 \no \no \n\nc \nX \n\nFigure 1: The circles indicate 10 Gaussians approximating the input density distri(cid:173)\nbution. XC = Xl indicates the known input, X2 = XU is unknown. \n\nSimilarly, for a network trained to estimate class probabilities, N Ni(X) ~ \nP(classilx), simply substitute P(classdxC) for E(ylx C ) and N Ni(X C , XU) for \nN N (XC, XU) in the last equation. \n\nThe integrals in the last equations Can be problematic. In the worst case they have \nto be approximated numerically (Tresp, Ahmad and Neuneier, 1994) which is costly, \nsince the computation is exponential in the number of missing inputs. For networks \nof normalized Gaussians, there exist closed form solutions to the integrals (Ahmad \nand Tresp, 1993). The following section shows how to efficiently approximate the \nintegral for a large class of algorithms. \n\n2.2 An Efficient Approximation \n\nParzen windows are commonly used to approximate densities. Given N training \ndata {(xk, yk)lk = 1, ... , N}, we can approximate \n\n1 N \n\nP(x) ~ N L:G(x;xk,O') \n\nk=l \n\n(1) \n\nwhere \n\nk 11 k 2 \nG( X; X ,0') = (211'0'2)D /2 exp( - 20'211x - X II ) \n\nis a multidimensional properly normalized Gaussian centered at data xk with va(cid:173)\nriance 0'2. It has been shown (Duda and Hart (1973)) that Parzen windows appro(cid:173)\nximate densities for N -\n\n00 arbitrarily well, if 0' is appropriately scaled. \n\n\f692 \n\nVolker Tresp, Ralph Neuneier, Subutai Ahmad \n\nUsing Parzen windows we may write \n\nwhere we have used the fact that \n\nand where G( xc; xc,le, u) is a Gaussian projected onto the known input dimensions \n(by simply leaving out the unknown dimensions in the exponent and in the norma(cid:173)\nlization, see Ahmad and Tresp, 1993). xc,le are the components of the training data \ncorresponding to the known input (compare Figure 1). \n\nNow, if we assume that the network prediction is approximately constant over the \n\"width\" of the Gaussians, u, we can approximate \n\nJ NN(xC, x\") G(XC,x\";xle,u) dx\" ~ NN(xC,x\",Ie) G(XC;xc,le,u) \n\nwhere N N(xC, x\",Ie) is the network prediction which we obtain if we substituted the \ncorresponding components of the training data for the unknown inputs. \n\nWith this approximation, \n\n(2) \n\nInterestingly, we have obtained a network of normalized Gaussians which are \ncentered at the known components of the data points. The\" output weights\" \nN N(xC, x\",Ie) consist of the neural network predictions where for the unknown input \nthe corresponding components of the training data points have been substituted. \nNote, that we have obtained an approximation which has the same structure as the \nsolution for normalized Gaussian basis functions (Ahmad and Tresp, 1994). \n\nIn many applications it might be easy to select a reasonable value for u using prior \nknowledge but there are also two simple ways to obtain a good estimate for u using \nleave-one-out methods. The first method consists of removing the k - th pattern \nfrom the training data and calculating P(xle ) ~ N:l L:f:l,l# G(xle ; xl, u). Then \nselect the u for which the log likelihood L:1e log P(xle ) is maximum. The second \nmethod consists of treating an input of the k - th training pattern as missing and \nthen testing how well our algorithm (Equation 2) can predict the target. Select the \nu which gives the best performance. In this way it would even be possible to select \ninput-dimension-specific widths Ui leading to \"elliptical\", axis-parallel Gaussians \n(Ahmad and Tresp, 1993). \n\n\fEfficient Methods for Dealing with Missing Data in Supervised Learning \n\n693 \n\nNote that the complexity of the solution is independent of the number of missing \ninputs! In contrast, the complexity of the solution for feedforward networks sugge(cid:173)\nsted in Tresp, Ahmad and Neuneier (1994) grows exponentially with the number of \nmissing inputs. Although similar in character to the solution for normalized RBFs, \nhere we have no restrictions on the network architecture which allows us to choose \nthe network most appropriate for the application. \n\nIf the amount of training data is large, one can use the following approximations: \n\n\u2022 Select only the K nearest data points. The distance is determined based \non the known inputs. K can probably be reasonably small \u00ab 10). In the \nextreme case, K = 1 and we obtain a nearest-neighbor solution. Efficient \ntree-based algorithms exist for computing the K-nearest neighbors. \n\n\u2022 Use Gaussian mixtures instead of Parzen windows to estimate the input \ndata distribution . Use the centers and variances of the components in \nEquation 2. \n\n\u2022 Use a clustering algorithm and use the cluster centers instead of the data \n\npoints in Equation 2. \n\nNote that the solution which substitutes the components of the training data closest \nto the input seems biologically plausible. \n\n2.3 Experimental Results \n\nWe tested our algorithm using the same data as in Ahmad and Tresp, 1993. The \ntask was to recognize a hand gesture based on its 2D projection. As input, the \nclassifier is given the 2D polar coordinates of the five finger tip positions relative to \nthe 2D center of mass of the hand (the input space is therefore 10-D). A multi-layer \nperceptron was trained on 4368 examples (624 poses for each gesture) and tested on \na similar independent test set. The inputs were normalized to a variance of one and \nu was set to 0.1. (For a complete description of the task see (Ahmad and Tresp, \n1993).) As in (Ahmad & Tresp, 1993) we defined a correct classification as one \nin which the correct class was either classified as the most probable or the second \nmost probable. Figure 2 shows experimental results. On the horizontal axis, the \nnumber of randomly chosen missing inputs is shown. The continuous line shows \nthe performance using Equation 2 where we used only the 10 nearest neighbors in \nthe approximation. Even with 5 missing inputs we obtain a score of over 90 % \nwhich is slightly better than the solution we obtained in Ahmad and Tresp (1993) \nfor normalized RBFs. We expect our new solution to perform very well in general \nsince we can always choose the best network for prediction and are not restricted \nin the architecture. As a benchmark we also included the case where the mean of \nthe missing input was substituted. With 5 missing inputs, the performance is less \nthan 60 %. \n\n\f694 \n\nVolker Tresp. Ralph Neuneier. Subutai Ahmad \n\n0.9 \n\n0.8 \n\n!.7 \n\n0.8 \n\n0.5 \n\n3D - Hanel Geetu\", R8CC91ftIOn : 10 inpuIa \n\n----.----\n\n'. \n\n'. !II. \n\n.... \n\n. \u2022.. \" . \n\n\" . .\u2022... \n\n2 \n\n3 \n\n4 \n\n5 \n\nrunber 01 rristing Inputs \n\n\". \" . .\u2022........ \n\n.... , \n\n8 \n\n7 \n\nFigure 2: Experimental results using a generalization data set. The continuous line \nindicates the performance using our proposed method. The dotted lines indicate \nthe performance if the mean of the missing input variable is substituted. As a \ncomparison, we included the results obtained in Ahmad and Tresp (1993) using the \nclosed-form solution for RBF -networks (dashed). \n\n3 Training (Backpropagation) \n\nFor a complete pattern (x.l:, yk), the weight update of a backpropagation step for \nweight Wj is \n\n\u2022 \n\nA \nU.wJ (X Y \n\n(k _ NN. ( k))8NNw (xk) \n. \n\nw x \n\n8 \nWj \n\nUsing the approximation of Equation 1, we obtain for an incomplete data point \n(compare Tresp, Ahmad and Neuneier, 1994) \n\nHere, IE compl indicates the sum over complete patterns in the training set, and (111 \nis the standard deviation of the output noise. Note that the gradient is a network \nof normalized Gaussian basis functions where the \"output-weight\" is now \n\n(3) \n\n\fEfficient Methods for Dealing with Missing Data in Supervised Learning \n\n695 \n\nThe derivation of the last equation can be found in the Appendix. Figure 3 shows \nexperimental results. \n\n0.15r------,r-------,r------,--~--__r--__._--\"\"\"T\"\"--...., \n\nBoston housing data: 13 inputs \n\n0.14 \n\n0.13 \n\nsub6titute mean \n\n0.12 \n\n0.1 \n\nB \n\n10.11 \n\n28 CQI11)Iete pattems \n\n..... \n\n--------------------~~--------------\n\n0.09 \n\n0.08 \n\nO\u00b7061'----2'------J3'------J4'-----L5--~6-----'-7----'-8 ----'9 \n\nnumber of missing inputs \n\nFigure 3: In the experiment, we used the Boston housing data set, which consists of \n506 samples. The task is to predict the housing price from 13 variables which were \nthought to influence the housing price in a neighborhood. The network (multi(cid:173)\nlayer perceptron) was trained with 28 complete patterns plus an additional 225 \nincomplete samples. The horizontal axis indicates how many inputs were missing \nin these 225 samples. The vertical axis shows the generalization performance. The \ncontinuous line indicates the performance of our approach and the dash-dotted line \nindicates the performance, if the mean is substituted for a missing variable. The \ndashed line indicates the performance of a network only trained with the 28 complete \npatterns. \n\n4 Conclusions \n\nWe have obtained efficient and robust solutions for the problem of recall and training \nwith missing data. Experimental results verified our method. All of our results can \neasily be generalized to the case of noisy inputs. \n\nAcknowledgement \n\nValuable discussions with Hans-Georg Zimmermann, Tomaso Poggio, Michael Jor(cid:173)\ndan and Zoubin Ghahramani are greatfully acknowledged. The first author would \nlike to thank the Center for Biological and Computational Learning (MIT) for pro(cid:173)\nviding and excellent research environment during the summer of 1994. \n\n\f696 \n\nVolker Tresp, Ralph Neuneier, Subutai Ahmad \n\n5 Appendix \n\nAssuming the standard signal-plus-Gaussian-noise model we obtain for a complete \nsample \n\nP(xk,ykl{wi}) = G(yk;NNw(xk),(1y) P(xk) \n\nwhere {Wi} is the set of weights in the network. For an incomplete sample \n\nUsing the same approximation as in Section 2.2, \n\np(xc,k, ykl{ W;}) ~ L G(yk; N Nw(xc,k, xu,I), (1y) G(xc,k; xc, I, (1) \n\nl\u20accompl \n\nwhere I sums over all complete samples. As before, we substitute for the mis(cid:173)\nsing components the ones from the complete training data. The log-likelihood C \n(a function of the network weights {Wi}) can be calculated as (x k can be either \ncomplete or incomplete) C = Lf=llog p(xk, yk I{ Wi}). The maximum likelihood \nsolution consists of finding weights {Wi} which maximize the log-likelihood. Using \nthe approximation of Equation 1, we obtain for an incomplete sample as gradient \nEquation 3 (compare Tresp, Ahmad and Neuneier, 1994). \n\nReferences \n\nAhmad, S. and Tresp, V. (1993). Some Solutions to the Missing Feature Problem in Vision. \nIn S. J. Hanson, J. D. Cowan and C. L. Giles, (Eds.), Advances in Neural Information \nProcessing Systems 5, San Mateo, CA: Morgan Kaufmann. \n\nBrunelli, R. and Poggio, T. (1991). HyperBF Networks for Real Object Recognition. IJCAL \nBuntine, W. L. and Weigend, A. S. (1991). Bayesian Back-Propagation. Complex systems, \nVol. 5, pp. 605-643. \n\nDuda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. \nWiley and Sons, New York. \n\nJohn \n\nGhahramani, Z. and Jordan, M. 1. (1994) . Supervised Learning from Incomplete Data via \nan EM approach. In: Cowan, J. D., Tesauro, G., and Alspector, J., eds., Advances in \nNeural Information Processing Systems 6, San Mateo, CA, Morgan Kaufman. \n\nTresp, V., Ahmad, S. and Neuneier, R. (1994). Training Neural Networks with Defi(cid:173)\ncient Data. In: Cowan, J. D., Tesauro, G., and Alspector, J., eds., Advances in Neural \nInformation Processing Systems 6, San Mateo, CA, Morgan Kaufman. \n\nTresp, V., Hollatz, J. and Ahmad, S. (1995). Representing Probabilistic Rules with Net(cid:173)\nworks of Gaussian Basis Functions. Accepted for publication in Machine Learning. \n\n\f", "award": [], "sourceid": 985, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Ralph", "family_name": "Neuneier", "institution": null}, {"given_name": "Subutai", "family_name": "Ahmad", "institution": null}]}