{"title": "Robust Neural Network Regression for Offline and Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 407, "page_last": 413, "abstract": null, "full_text": "Robust Neural Network Regression for Offline \n\nand Online Learning \n\nThomas Briegel* \n\nVolker Tresp \n\nSiemens AG, Corporate Technology \n\nSiemens AG, Corporate Technology \n\nD-81730 Munich, Germany \n\nthomas.briegel@mchp.siemens.de \n\nD-81730 Munich, Germany \nvolker.tresp@mchp.siemens.de \n\nAbstract \n\nWe replace the commonly used Gaussian noise model in nonlinear \nregression by a more flexible noise model based on the Student-t(cid:173)\ndistribution. The degrees of freedom of the t-distribution can be chosen \nsuch that as special cases either the Gaussian distribution or the Cauchy \ndistribution are realized. The latter is commonly used in robust regres(cid:173)\nsion. Since the t-distribution can be interpreted as being an infinite mix(cid:173)\nture of Gaussians, parameters and hyperparameters such as the degrees \nof freedom of the t-distribution can be learned from the data based on an \nEM-learning algorithm. We show that modeling using the t-distribution \nleads to improved predictors on real world data sets. In particular, if \noutliers are present, the t-distribution is superior to the Gaussian noise \nmodel. In effect, by adapting the degrees of freedom, the system can \n\"learn\" to distinguish between outliers and non-outliers. Especially for \nonline learning tasks, one is interested in avoiding inappropriate weight \nchanges due to measurement outliers to maintain stable online learn(cid:173)\ning capability. We show experimentally that using the t-distribution as \na noise model leads to stable online learning algorithms and outperforms \nstate-of-the art online learning methods like the extended Kalman filter \nalgorithm. \n\n1 INTRODUCTION \nA commonly used assumption in nonlinear regression is that targets are disturbed by inde(cid:173)\npendent additive Gaussian noise. Although one can derive the Gaussian noise assumption \nbased on a maximum entropy approach, the main reason for this assumption is practica(cid:173)\nbility: under the Gaussian noise assumption the maximum likelihood parameter estimate \ncan simply be found by minimization of the squared error. Despite its common use it is far \nfrom clear that the Gaussian noise assumption is a good choice for many practical prob(cid:173)\nlems. A reasonable approach therefore would be a noise distribution which contains the \nGaussian as a special case but which has a tunable parameter that allows for more flexible \ndistributions. In this paper we use the Student-t-distribution as a noise model which con(cid:173)\ntains two free parameters - the degrees of freedom 1/ and a width parameter (72. A nice \nfeature of the t-distribution is that if the degrees of freedom 1/ approach infinity, we recover \nthe Gaussian noise model. If 1/ < 00 we obtain distributions which are more heavy-tailed \nthan the Gaussian distribution including the Cauchy noise model with 1/ = 1. The latter \n\n*Now with McKinsey & Company, Inc. \n\n\f408 \n\nT. Briegel and V. Tresp \n\nis commonly used for robust regression. The first goal of this paper is to investigate if the \nadditional free parameters, e.g. v, lead to better generalization performance for real world \ndata sets if compared to the Gaussian noise assumption with v = 00. The most common \nreason why researchers depart from the Gaussian noise assumption is the presence of out(cid:173)\nliers. Outliers are errors which occur with low probability and which are not generated by \nthe data-generation process that is subject to identification. The general problem is that a \nfew (maybe even one) outliers of high leverage are sufficient to throw the standard Gaus(cid:173)\nsian error estimators completely off-track (Rousseeuw & Leroy, 1987). In the second set of \nexperiments we therefore compare how the generalization performance is affected by out(cid:173)\nliers, both for the Gaussian noise assumption and for the t-distribution assumption. Dealing \nwith outliers is often of critical importance for online learning tasks. Online learning is of \ngreat interest in many applications exhibiting non-stationary behavior like tracking, sig(cid:173)\nnal and image processing, or navigation and fault detection (see, for instance the NIPS*98 \nSequential Learning Workshop). Here one is interested in avoiding inappropriate weight \nchances due to measurement outliers to maintain stable online learning capability. Outliers \nmight result in highly fluctuating weights and possible even instability when estimating the \nneural network weight vector online using a Gaussian error assumption. State-of-the art \nonline algorithms like the extended Kalman filter, for instance, are known to be nonrobust \nagainst such outliers (Meinhold & Singpurwalla, 1989) since they are based on a Gaussian \noutput error assumption. \n\nThe paper is organized as follows. In Section 2 we adopt a probabilistic view to outlier \ndetection by taking as a heavy-tailed observation error density the Student-t-distribution \nwhich can be derived from an infinite mixture of Gaussians approach. In our work we use \nthe multi-layer perceptron (MLP) as nonlinear model. In Section 3 we derive an EM algo(cid:173)\nrithm for estimating the MLP weight vector and the hyperparameters offline. Employing \na state-space representation to model the MLP's weight evolution in time we extend the \nbatch algorithm of Section 3 to the online learning case (Section 4). The application of the \ncomputationally efficient Fisher scoring algorithm leads to posterior mode weight updates \nand an online EM-type algorithm for approximate maximum likelihood (ML) estimation \nof the hyperparameters. In in the last two sections (Section 5 and Section 6) we present \nexperiments and conclusions, respectively. \n\n2 THE t-DENSITY AS A ROBUST ERROR DENSITY \nWe assume a nonlinear regression model where for the t-th data point the noisy target \nYt E R is generated as \n\n(1) \n\nand Xt E Rk is a k-dimensional known input vector. g(.;Wt) denotes a neural network \nmodel characterized by weight vector Wt E Rn , in our case a multi-layer perceptron \n(MLP). In the offline case the weight vector Wt is assumed to be a fixed unknown constant \nvector, i.e. Wt == w. Furthermore, we assume that Vt is uncorrelated noise with density \nPv, (. ). In the offline case, we assume Pv, ( .) to be independent of t, i.e. Pv, (.) == Pv (.). In \nthe following we assume that Pv (.) is a Student-t-density with v degrees of freedom with \n\nPv(z)=T(zI0-2,v)= y'7W2qv)(1+-2-) \n\n, v,o->O. \n\nr(!\u00b1l) \n2\" \n1W \n\n0-\n\nz2-~ \n0- V \n\n(2) \n\nIt is immediately apparent that for v = 1 we recover the heavy-tailed Cauchy density. What \nis not so obvious is that for v -t 00 we obtain a Gaussian density. For the derivation of \nthe EM-learning rules in the next section it is important to note that the t-denstiy can be \nthought of as being an infinite mixture of Gaussians of the form \n\n(3) \n\n\fRobust Neural Network Regression for Offline and Online Learning \n\n409 \n\n3 \n\n2 \n\n-1 ._-._._ ..... _ .. __ ...... _ \u2022\u2022 _\" \n\n2 \n\n-\n-3 \n\n..... ~-...... -... -......... ;:~,' \n\n... ~,:,. \n\n-e \n\n-4 \n\n, \n\n-2 \n\n, \n\". \n\n~ . \n\nv-T5 . \n\n,'; .. ,..................... ..~ .... \n;,-_. __ .-......... ::-: \n.... . - .-.-\n\nBoeton Houaing ct.,_ with addtttve oult .... \n\n1.2r-~-~--\"-.--~----...\" \n\n1.1 \n\nliJO.9 \n\n0 .8 \n\n0 .7 \n\n\u00b0 z \n\n2 \n\n4 \n\n\" \n\n0.5'----:---5~-7;;IO:-----:15~---;20~--::;!25 \n\nnu_ Of 0UIIIen ('K.] \n\nFigure 1: Left: \u00a2(.)-functions for the Gaussian density (dashed) and t-densities with II = \n1,4,15 degrees of freedom. Right: MSE on Boston Housing data test set for additive \noutliers. The dashed line shows results using a Gaussian error measure and the continuous \nline shows the results using the Student-t-distribution as error measure. \n\nwhere T(zI0'2, II) is the Student-t-density with II degrees of freedom and width parameter \n0'2, N(zIO, 0'2/U) is a Gaussian density with center 0 and variance 0'2/U and U '\" X~/II \nwhere X~ is a Chi-square distribution with II degrees of freedom evaluated at U > O. \nTo compare different noise models it is useful to evaluate the \"\u00a2-function\" defined as (Hu(cid:173)\nber,1964) \n\n\u00a2(z) = -ologpv(z)/oz \n\n(4) \ni.e. the negative score-function of the noise density. In the case of i.i.d. samples the \u00a2(cid:173)\nfunction reflects the influence of a single measurement on the resulting estimator. Assum(cid:173)\ning Gaussian measurement errors Pv(z) = N(zIO,0'2) we derive \u00a2(z) = z/0'2 which \nmeans that for Izl -+ 00 a single outlier z can have an infinite leverage on the estimator. In \ncontrast, for constructing robust estimators West (1981) states that large outliers should not \nhave any influence on the estimator, i.e. \u00a2(z) -+ 0 for Izl -+ 00. Figure 1 (left) shows \u00a2(z) \nfor different II for the Student-t-distribution. It can be seen that the degrees of freedom II \ndetermine how much weight outliers obtain in influencing the regression. In particular, for \nfinite II, the influence of outliers with Izl -+ 00 approaches zero. \n\n3 ROBUST OFFLINE REGRESSION \nAs stated in Equation (3), the t-density can be thought of as being generated as an infinite \nmixture of Gaussians. Maximum likelihood adaptation of parameters and hyperparameters \ncan therefore be performed using an EM algorithm (Lange et al., 1989). For the t-th sample, \na complete data point would consist of the triple (Xt, Yt, Ut) of which only the first two are \nknown and Ut is missing. \nIn the E-step we estimate for every data point indexed by t \n\nwhere at = E[ut IYt, Xt] is the expected value of the unknown Ut given the available data \n(Xt,Yt) andwhere\u00ab5t = (Yt - g(Xti W0 1d \u00bb)2 /0'2,old. \nIn the M-step the weights wand the hyperparameters 0'2 and II are optimized using \n\n(5) \n\nT \n\nargm~n{LOt(Yt - g(Xt iW\u00bb)2} \n\nt=l \n\n(6) \n\n\f410 \n\nT. Briegel and V. Tresp \n\nT \n\n~ L at [(Yt - g(Xt; wnew))2] \n\nt=l \n\nv \n\n{ Tv \n\nv \n\nargm:x T log 2\" - Tlog{f( 2\")} \n+( ~ - 1) L (3t - ~ L at} \n\nT \n\nT \n\nt=l \n\nt=l \n\n(7) \n\n(8) \n\n(9) \n\nwhere \n\nwith the Digamma function DG(z) = 8f(z)/8z. Note that the M-step for v is a one(cid:173)\ndimensional nonlinear optimization problem. Also note that the M-steps for the weights in \nthe MLP reduce to a weighted least squares regression problem in which outliers tend to \nbe weighted down. The exception of course is the Gaussian case with v ~ 00 in which all \nterms obtain equal weight. \n\n4 ROBUST ONLINE REGRESSION \nw can change over time, i.e. w = Wt. In particular we assume that Wt follows a first order \nFor robust online regression, we assume that the model Equation (1) is still valid but that \n\nrandom walk with normally distributed increments, i.e. \n\nand where Wo is normally distributed with center ao and covariance Qo. Clearly, due to \nthe nonlinear nature of 9 and due to the fact that the noise process is non-Gaussian, a fully \nBayesian online algorithm - which for the linear case with Gaussian noise can be realized \nusing the Kalman filter -\n\nis clearly infeasible. \n\nOn the other hand, if we consider data 'D = {Xt, yt}f=l' the negative log-posterior \n-logp(WTI'D) of the parameter sequence WT = (wJ, ... , w~) T is up to a normaliz(cid:173)\ning constant \n\n(10) \n\n-logp(WTI'D) \n\nex: \n\nand can be used as the appropriate cost function to derive the posterior mode estimate \nW\u00a5AP for the weight sequence. The two differences to the presentation in the last section \nare that first, Wt is allowed to change over time and that second, penalty terms, stemming \nfrom the prior and the transition density, are included. The penalty terms are penalizing \nroughness of the weight sequence leading to smooth weight estimates. \n\nA suitable way to determine a stationary point of -logp(WTI'D), the posterior mode es(cid:173)\ntimate of W T , is to apply Fisher scoring. With the current estimate WT1d we get a better \nestimate wTew = wTld +171' for the unknown weight sequence WT where 'Y is the solution \nof \n\nwith the negative score function S(WT) = -8logp(WT1'D)/8WT and the expected infor(cid:173)\nmation matrix S(WT) = E[82 10gp(WTI'D)/8WT8WT ]. By applying the ideas given in \nFahrmeir & Kaufmann (1991) to robust neural network regression it turns out that solving \n(12), i.e. to compute the inverse of the expected information matrix, can be performed by \n\n(12) \n\n\fRobust Neural Network Regression/or Offline and Online Learning \n\n411 \n\nCholesky decomposition in one forward and backward pass through the set of data 'D. Note \nthat the expected information matrix is a positive definite block-tridiagonal matrix. The \nforward-backward steps have to be iterated to obtain the posterior mode estimate W.pAP \nfor WT. \n\nFor online posterior mode smoothing, it is of interest to smooth backwards after each filter \nstep t. If Fisher scoring steps are applied sequentially for t = 1,2, ... , then the posterior \nmode smoother at time-step t - 1, wl~~~P = (W~t-l\"'\" wi-I lt-l) T together with the \nstep-one predictor Wtlt-l = Wt-I lt-l is a reasonable starting value for obtaining the pos(cid:173)\nterior mode smoother WtMAP at time t. One can reduce the computational load by limiting \nthe backward pass to a sliding time window, e.g. the last Tt time steps, which is reasonable \nin non-stationary environments for online purposes. Furthermore, if we use the underly(cid:173)\ning assumption that in most cases a new measurement Yt should not change estimates too \ndrastically then a single Fisher scoring step often suffices to obtain the new posterior mode \nestimate at time t. The resulting single Fisher scoring step algorithm with lookback param(cid:173)\neter Tt has in fact just one additional line of code involving simple matrix manipulations \ncompared to online Kalman smoothing and is given here in pseudo-code. Details about the \nalgorithm and a full description can be found in Briegel & Tresp (1999). \n\nOnline single Fisher scoring step algorithm (pseudo-code) \n\nfor t = 1,2, ... repeat the following four steps: \n\u2022 Evaluate the step-one predictor Wt lt-l. \n\u2022 Perform the forward recursions for s = t - Tt, ... , t. \n\u2022 New data point (Xt, yd arrives: evaluate the corrector step Wtlt. \n\u2022 Perform the backward smoothing recursions ws-Ilt for s = t, ... , t - Tt. \n\nFor the adaptation of the parameters in the t-distribution, we apply results from Fahrmeir \n& Kunstler (1999) to our nonlinear assumptions and use an online EM-type algorithm for \napproximate maximum likelihood estimation of the h yperparameters lit and (7F. We assume \nthe scale factors (7F and the degrees of freedom lit being fixed quantities in a certain time \nwindow of length ft, e.g. (7F = (72, lit = 11, t E {t - ft, t}. For deriving online EM update \nequations we treat the weight sequence Wt together with the mixing variables Ut as missing. \nBy linear Taylor series expansion of g(.; w s ) about the Fisher scoring solutions Wslt and by \napproximating posterior expectations E[ws I'D] with posterior modes Wslt, S E {t - ft, t} \nand posterior covariances cov[ws I'D] with curvatures :Eslt = E[( Ws - Wslt) (ws - Wslt) T I'D] \nin the E-step, a somewhat lengthy derivation results in approximate maximum likelihood \nupdate rules for (72 and 11 similar to those given in Section 3. Details about the online \nEM-type algorithm can be found in Briegel & Tresp (1999). \n\n5 EXPERIMENTS \n1. Experiment: Real World Data Sets. In the first experiment we tested if the Student(cid:173)\nt-distribution is a useful error measure for real-world data sets. In training, the Student(cid:173)\nt-distribution was used and both, the degrees of freedom 11 and the width parameter (72 \nwere adapted using the EM update rules from Section 3. Each experiment was repeated \n50 times with different divisions into training and test data. As a comparison we trained \nthe neural networks to minimize the squared error cost function (including an optimized \nweight decay term). On the test data set we evaluated the performance using a squared \nerror cost function. Table 1 provides some experimental parameters and gives the test \nset performance based on the 50 repetitions of the experiments. The additional explained \nvariance is defined as [in percent] 100 x (1 - MSPE, IMSPEN) where MSPE, is the \nmean squared prediction error using the t-distribution and MSPEN is the mean squared \nprediction error using the Gaussian error measure. Furthermore we supply the standard \n\n\f412 \n\nT. Briegel and V. Tresp \n\nTable I: Experimental parameters and test set performance on real world data sets. \n\nData Set \n\nBoston Housing \nSunspot \nFraser River \n\nI # Inputs/Hidden I Training I Test I Add.Exp.Var. [%] I Std. [%] I \n\n(13/6) \n(1217) \n(1217) \n\n400 \n221 \n600 \n\n106 \n47 \n334 \n\n4.2 \n5.3 \n5.4 \n\n0.93 \n0.67 \n0.75 \n\nerror based on the 50 experiments. In all three experiments the networks optimized with \nthe t-distribution as noise model were 4-5% better than the networks optimized using the \nGaussian as noise model and in all experiments the improvements were significant based on \nthe paired t-test with a significance level of 1 %. The results show clearly that the additional \nfree parameter in the Student-t-distribution does not lead to overfitting but is used in a \nsensible way by the system to value down the influence of extreme target values. Figure 2 \nshows the normal probability plots. Clearly visible is the derivation from the Gaussian \ndistribution for extreme target values. We also like to remark that we did not apply any \npreselection process in choosing the particular data sets which indicates that non-Gaussian \nnoise seems to be the rule rather than the exception for real world data sets. \n\n:: \n\n0.99 \n0.\" \n\n~095 \n1090 \n1075 \n\ni oso \ni0 25 \n1010 \n\n; 005 \n002 \n00\\ \n0003 \n000' ''-. _0-, ~--=\"-<>5o-----O-0 -\n\n---,07' -\n\n-',---' \n\n,e~\"\"\"MWlgWllllh~ .. -8fTttdenllly \n\nFigure 2: Normal probability plots of the three training data sets after learning with the \nGaussian error measure. The dashed line show the expected normal probabilities. The \nplots show clearly that the residuals follow a more heavy-tailed distribution than the normal \ndistribution. \n2. Experiment: Outliers. In the second experiment we wanted to test how our approach \ndeals with outliers which are artificially added to the data set. We started with the Boston \nhousing data set and divided it into training and test data. We then randomly selected a \nsubset of the training data set (between 0.5% and 25%) and added to the targets a uniformly \ngenerated real number in the interval [-5,5]. Figure I (right) shows the mean squared error \non the test set for different percentages of added outliers. The error bars are derived from \n20 repetitions of the experiment with different divisions into training and test set. It is \napparent that the approach using the t-distribution is consistently better than the network \nwhich was trained based on a Gaussian noise assumption. \n\n3. Experiment: Online Learning. In the third experiment we examined the use of the \nt-distribution in online learning. Data were generated from a nonlinear map y = 0.6X2 + \nbsin(6x) - 1 where b = -0.75, -0.4, -0.1,0.25 for the first, second, third and fourth \nset of 150 data points, respectively. Gaussian noise with variance 0.2 was added and for \ntraining, a MLP with 4 hidden units was used. In the first experiment we compare the \nperformance of the EKF algorithm with our single Fisher scoring step algorithm. Figure 3 \n(left) shows that our algorithm converges faster to the correct map and also handles the \ntransition in the model (parameter b) much better than the EKE In the second experiment \nwith a probability of 10% outliers uniformly drawn from the interval [-5,5] were added to \nthe targets. Figure 3 (middle) shows that the single Fisher scoring step algorithm using the \n\n\fRobust Neural Network Regression/or Offline and Online Learning \n\n413 \n\nt-distribution is consistently better than the same algorithm using a Gaussian noise model \nand the EKE The two plots on the right in Figure 3 compare the nonlinear maps learned \nafter 150 and 600 time steps, respectively. \n\n0.5 \n\nw \n\n~ . \n! \n\nIO~O~-C'00::---=200~-=311l;--::::\"\"'--5;:;;1Il-----:!1Dl \n\nr\",. \n\n'. \n\n1~' \n\n10 0 \n\n100 \n\n2CIO \n\n300 \nr\",. \n\n400 \n\n5CXl \n\nall \n\nFigure 3: Left & Middle: Online MSE over each of the 4 sets of training data. On the \nleft we compare extended Kalman filtering (EKF) (dashed) with the single Fisher scoring \nstep algorithm with Tt = 10 (GFS-lO) (continuous) for additive Gaussian noise. The \nsecond figure shows EKF (dashed-dotted), Fisher scoring with Gaussian error noise (GFS-\n1 0) (dashed) and t-distributed error noise (TFS-l 0) (continuous), respectively for data with \nadditive outliers. Right: True map (continuous), EKF learned map (dashed-dotted) and \nTFS-I0 map (dashed) after T = 150 and T = 600 (data sets with additive outliers). \n\n6 CONCLUSIONS \nWe have introduced the Student-t-distribution to replace the standard Gaussian noise as(cid:173)\nsumption in nonlinear regression. Learning is based on an EM algorithm which estimates \nboth the scaling parameters and the degrees of freedom of the t-distribution. Our results \nshow that using the Student-t-distribution as noise model leads to 4-5% better test errors \nthan using the Gaussian noise assumption on real world data set. This result seems to in(cid:173)\ndicate that non-Gaussian noise is the rule rather than the exception and that extreme target \nvalues should in general be weighted down. Dealing with outliers is particularly important \nfor online tasks in which outliers can lead to instability in the adaptation process. We in(cid:173)\ntroduced a new online learning algorithm using the t-distribution which leads to better and \nmore stable results if compared to the extended Kalman filter. \n\nReferences \nBriegel, T. and Tresp, V. (1999) Dynamic Neural Regression Models, Discussion Paper, Seminar flir \nStatistik, Ludwig Maximilians Universitat Milnchen. \nde Freitas, N., Doucet, A. and Niranjan, M. (1998) Sequential Inference and Learning, NIPS*98 \nWorkshop, Breckenridge, CO. \nFahrmeir, L. and Kaufmann, H. (1991) On Kalman Filtering, Posterior Mode Estimation and Fisher \nScoring in Dynamic Exponential Family Regression, Metrika 38, pp. 37-60. \nFahrmeir, L. and Kilnstler, R. (1999) Penalized Likelihood smoothing in robust state space models, \nMetrika 49, pp. 173-191. \nHuber, p.r. (1964) Robust Estimation of Location Parameter, Annals of Mathematical Statistics 35, \npp.73-101. \nLange, K., Little, L., Taylor, J. (989) Robust Statistical Modeling Using the t-Distribution, JASA \n84, pp. 881-8%. \nMeinhold, R. and SingpurwaIla, N. (1989) Robustification of Kalman Filter Models, JASA 84, pp. \n470-496. \nRousseeuw, P. and Leroy, A. (1987) Robust Regression and Outlier Detection, John Wiley & Sons. \nWest, M. (1981) Robust Sequential Approximate Bayesian Estimation, JRSS B 43, pp. 157-166. \n\n\f", "award": [], "sourceid": 1768, "authors": [{"given_name": "Thomas", "family_name": "Briegel", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}