{"title": "A Solution for Missing Data in Recurrent Neural Networks with an Application to Blood Glucose Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 977, "abstract": "", "full_text": "A Solution for Missing Data in Recurrent Neural \nNetworks With an Application to Blood Glucose \n\nPrediction \n\nVolker Tresp and Thomas Briegel * \n\nSiemens AG \n\nCorporate Technology \n\nOtto-Hahn-Ring 6 \n\n81730 Miinchen, Germany \n\nAbstract \n\nWe consider neural network models for stochastic nonlinear dynamical \nsystems where measurements of the variable of interest are only avail(cid:173)\nable at irregular intervals i.e. most realizations are missing. Difficulties \narise since the solutions for prediction and maximum likelihood learn(cid:173)\ning with missing data lead to complex integrals, which even for simple \ncases cannot be solved analytically. In this paper we propose a spe(cid:173)\ncific combination of a nonlinear recurrent neural predictive model and \na linear error model which leads to tractable prediction and maximum \nlikelihood adaptation rules. In particular, the recurrent neural network \ncan be trained using the real-time recurrent learning rule and the linear \nerror model can be trained by an EM adaptation rule, implemented us(cid:173)\ning forward-backward Kalman filter equations. The model is applied to \npredict the glucose/insulin metabolism of a diabetic patient where blood \nglucose measurements are only available a few times a day at irregular \nintervals. The new model shows considerable improvement with respect \nto both recurrent neural networks trained with teacher forcing or in a free \nrunning mode and various linear models. \n\n1 \n\nINTRODUCTION \n\nIn many physiological dynamical systems measurements are acquired at irregular intervals. \nConsider the case of blood glucose measurements of a diabetic who only measures blood \nglucose levels a few times a day. At the same time physiological systems are typically \nhighly nonlinear and stochastic such that recurrent neural networks are suitable models. \nTypically, such networks are either used purely free running in which the networks predic(cid:173)\ntions are iterated, or in a teacher forcing mode in which actual measurements are substituted \n\n\u2022 {volker.tresp, thomas.briegel} @mchp.siemens.de \n\n\f972 \n\nV. Tresp and T. Briegel \n\nif available. In Section 2 we show that both approaches are problematic for highly stochas(cid:173)\ntic systems and if many realizations of the variable of interest are unknown. The traditional \nsolution is to use a stochastic model such as a nonlinear state space model. The problem \nhere is that prediction and training missing data lead to integrals which are usually con(cid:173)\nsidered intractable (Lewis, 1986). Alternatively, state dependent linearizations are used for \nprediction and training, the most popular example being the extended Kalman filter. In this \npaper we introduce a combination of a nonlinear recurrent neural predictive model and a \nlinear error model which leads to tractable prediction and maximum likelihood adaptation \nrules. The recurrent neural network can be used in all generality to model the nonlinear \ndynamics of the system. The only limitation is that the error model is linear which is not \na major constraint in many applications. The first advantage of the proposed model is that \nfor single or multiple step prediction we obtain simple iteration rules which are a combi(cid:173)\nnation of the output of the iterated neural network and a linear Kalman filter which is used \nfor updating the linear error model. The second advantage is that for maximum likelihood \nlearning the recurrent neural network can be trained using the real-time recurrent learning \nrule RTRL and the linear error model can be trained by an EM adaptation rule, imple(cid:173)\nmented using forward-backward Kalman filter equations. We apply our model to develop a \nmodel of the glucose/insulin metabolism of a diabetic patient in which blood glucose mea(cid:173)\nsurements are only available a few times a day at irregular intervals and compare results \nfrom our proposed model to recurrent neural networks trained and used in the free running \nmode or in the teacher forcing mode as well as to various linear models. \n\n2 RECURRENT SYSTEMS WITH MISSING DATA \n\nY, \n\n/ \n\n.. \n\n\u2022 \n\nreasonable estimate for 1 =:t. 6 \n\n0 ~ .. '_'m~__ I \n\no teacher forC ing \n\n/ncosuremcnt at (,me t=7 y. \n\nfree runn ing \n\n' .. -. \n\n6 \n\n~CQsurcment at lIm~ 1= 7 \n\nUIIIIdrrrrI \n\n12 13 \n\n10 \n\nJ I \n\n1 \n\n2 \n\nS \n\n7 \n\n8 \n\n9 \n\n\" \n\n4 \n\nFigure 1: A neural network predicts the next value of a time-series based on the latest two \nprevious measurements (left). As long as no measurements are available (t = 1 to t = 6), \nthe neural network is iterated (unfilled circles). In a free-running mode, the neural network \nwould ignore the measurement at time t = 7 to predict the time-series at time t = 8. In a \nteacher forcing mode, it would substitute the measured value for one of the inputs and use \nthe iterated value for the other (unknown) input. This appears to be suboptimal since our \nknowledge about the time-series at time t = 7 also provides us with information about the \ntime-series at time t = 6. For example the dotted circle might be a reasonable estimate. By \nusing the iterated value for the unknown input, the prediction of the teacher forced system \nis not well defined and will in general lead to unsatisfactory results. A sensible response \nis shown on the right where the first few predictions after the measurement are close to the \nmeasurement. This can be achieved by including a proper error model (see text). \n\nConsider a deterministic nonlinear dynamical model of the form \n\nYt = !w(Yt-l,\" \u00b7,Yt-N,Ut) \n\nof order N, with input Ut and where ! w (.) is a neural network model with parameter(cid:173)\nvector w. Such a recurrent model is either used in a free running mode in which network \npredictions are used in the input of the neural network or in a teacher forcing mode where \nmeasurements are substituted in the input of the neural network whenever these are avail(cid:173)\nable. \n\n\fMissing Data in RNNs with an Application to Blood Glucose Prediction \n\n973 \n\nFigure 2: Left: The proposed architecture. Right: Linear impulse response. \n\n- - .. -.:::...:..- -~- -~ -- ---\n\nBoth can lead to undesirable results when many realizations are missing and when the sys(cid:173)\ntem is highly stochastic. Figure 1 (left) shows that a free running model basically ignores \nthe measurement for prediction and that the teacher forced model substitutes the measured \nvalue but leaves the unknown states at their predicted values which also might lead to un(cid:173)\ndesirable responses. The traditional solution is to include a model of the error which leads \nto nonlinear stochastical models, the simplest being \n\nYt = fw (Yt-l,.'\" Yt-N, utJ + lOt \n\nwhere lOt is assumed to be additive uncorrelated zero-mean noise with probability den(cid:173)\nsity P\u20ac (f) and represents unmodeled system dynamics. For prediction and learning with \nmissing values we have to integrate over the unknowns which leads to complex integrals \nwhich, for nonlinear models, have to be approximated. for example, using Monte Carlo \nintegration.l In general, those integrals are computationally too expensive to solve and, in \npractice, one relies on locally linearized approximations of the nonlinearities typically in \nform of the extended Kalman filter. The extended Kalman filter is suboptimal and summa(cid:173)\nrizes past data by an estimate of the means and the covariances of the variables involved \n(Lewis, 1986). \n\nIn this paper we pursue an alternative approach. Consider the model with state updates \n\n* Yt \n\nXt \n\nYt \n\nfW(Y;-l\"' \" Y;-N' ut} \n\nK 2: (JiXt-i + lOt \n\ni=l \n\nY; + Xt = fw (Y;-l' ... , Y;-N, ue) + L (JiXt-i + lOt \n\nK \n\ni=l \n\n(1) \n\n(2) \n\n(3) \n\nand with measurement equation \n\nZt=Yt+Ot. \n\n(4) \nwhere lOt and Ot denote additive noise. The variable of interest Yt is now the sum of the \ndeterministic response of the recurrent neural network Y; and a linear system error model \nXt (Figure 2). Zt is a noisy measurement of Yt. In particular we are interested in the special \ncases that Yt can be measured with certainty (variance of Ot is zero) or that a measurement \nis missing (variance of Ot is infinity). The nice feature is now that Y; can be considered \na deterministic input to the state space model consisting of the equations (2)- (3). This \nmeans that for optimal one-step or multiple-step prediction, we can use the linear Kalman \nfilter for equations (2)- (3) and measurement equation (4) by treating Y; as deterministic \ninput. Similarly, to train the parameters in the linear part of the system (i.e. {Oi }f:l) we can \nuse an EM adaptation rule, implemented using forward-backward Kalman filter equations \n(see the Appendix). The deterministic recurrent neural network is adapted with the residual \nerror which cannot be explained by the linear model, i.e. target~nn = y\": - :Wnear \n\n1 For maximum likelihood learning of linear models we obtain EM equations which can be solved \n\nusing forward-backward Kalman equations (see Appendix). \n\n\f974 \n\nV. Tresp and T. Briegel \n\nwhere Y~ is a measurement ofYt at time t and where f)/near is the estimate of the linear \nmodel. After the recurrent neural network is adapted the linear model can be retrained \nusing the residual error which cannot be explained by the neural network. then again the \nneural network is retrained and so on until no further improvement can be achieved. \n\nThe advantage of this approach is that all of the nonlinear interactions are modeled by \na recurrent neural network which can be trained deterministically. The linear model is \nresponsible for the noise model which can be trained using powerful learning algorithms \nfor linear systems. The constraint is that the error model cannot be nonlinear which often \nmight not be a major limitation. \n\n3 BLOOD GLUCOSE PREDICTION OF A DIABETIC \nThe goal of this work is to develop a predictive model of the blood glucose ofa person with \ntype 1 Diabetes mellitus. Such a model can have several useful applications in therapy: \nit can be used to warn a person of dangerous metabolic states, it can be used to make \nrecommendations to optimize the person's therapy and, finally, it can be used in the design \nof a stabilizing control system for blood glucose regulation, a so-called \"artificial beta cell\" \n(Tresp, Moody and Delong, 1994). We want the model to be able to adapt using patient data \ncollected under normal every day conditions rather than the controlled conditions typical \nof a clinic. In a non-clinical setting, only a few blood glucose measurements per day are \navailable. \n\nOur data set consists of the protocol ofa diabetic over a period of almost six months. Dur(cid:173)\ning that time period, times and dosages of insulin injections (basal insulin ut and normal \ninsulin u;), the times and amounts of food intake (fast u~, intermediate ut and slow u~ \ncarbohydrates), the times and durations of exercise (regular u~ or intense ui) and the blood \nglucose level Yt (measured a few times a day) were recorded. The u{, j = 1, ... ,7 are \nequal to zero except if there is an event, such as food intake, insulin injection or exercise. \nFor our data set, inputs u{ were recorded with 15 minute time resolution. We used the first \n43 days for training the model (containing 312 measurements of the blood glucose) and the \nfollowing 21 days for testing (containing 151 measurements of the blood glucose). This \nmeans that we have to deal with approximately 93% of missing data during training. \n\nThe effects on insulin, food and exercise on the blood glucose are delayed and are approx(cid:173)\nimated by linear response functions. v{ describes the effect of input u{ on glucose. As an \nexample, the response vt of normal insulin u; after injection is determined by the diffusion \nof the subcutaneously injected insulin into the blood stream and can be modeled by three \nfirst order compartments in series or, as we have done, by a response function of the form \nvt = l:T g2(t - r)u; withg2(t) = a2t2e-b2t (see figure 2 for a typical impulse response). \nThe functional mappings gj (.) for the digestive tract and for exercise are less well known. \nIn our experiments we followed other authors and used response functions of the above \nform. \n\nThe response functions 9 j ( .) describe the delayed effect of the inputs on the blood glucose. \nWe assume that the functional form of gj (.) is sufficient to capture the various delays of the \ninputs and can be tuned to the physiology of the patient by varying the parameters aj ,bj . \nTo be able to capture the highly nonlinear physiological interactions between the response \nfunctions vi and the blood glucose level Yt, which is measured only a few times a day, we \nemploy a neural network in combination with a linear error model as described in Section 2. \nIn our experiments fw (.) is a feedforward multi-layer perceptron with three hidden units. \nThe five inputs to the network were insulin (in; = vi + v;>, food (in; = vf + vt + vt), \nexercise (inr = vf + vi) and the current and previous estimate of the blood glucose. To be \nspecific, the second order nonlinear neural network model is \n. 2 \n\nYt = Yt-l + W Yt_llYt_2,lnt ,znt ,znt \n* \n\nf (. \n\n(5) \n\n* \n\n* \n\n. 1 \n\n. 3) \n\n\fMissing Data in RNNs with an Application to Blood Glucose Prediction \n\nFor the linear error model we also use a model of order 2 \n\n975 \n\n(6) \n\nTable 1 shows the explained variance of the test set for different predictive models. 2 \nIn the first experiment (RNN-FR) we estimate the blood glucose at time t as the output \nof the neural network Yt = y;. The neural network is used in the free running mode for \ntraining and prediction. We use RTRL to both adapt the weights in the neural network as \nwell as all parameters in the response functions 9j (.). The RNN-FR model explains 14.1 \npercent of the variance. The RNN-TF model is identical to the previous experiment except \nthat measurements are substituted whenever available. RNN-TF could explain more of the \nvariance (18.8%). The reason for the better performance is, of course, that information \nabout measurements of the blood glucose can be exploited. \n\nThe model RNN-LEM2 (error model with order 2) corresponds to the combination of the \nrecurrent neural network and the linear error model as introduced in Section 2. Here, \nYt = Xt + Y; models the blood glucose and Zt = Yt + 8t is the measurement equation \nwhere we set the variance of 8t = 0 for a measurement of the blood glucose at time t and \nthe variance of 8t = 00 for missing values. For ft we assume Gaussian independent noise. \nFor prediction, equation (5) is iterated in the free running mode. The blood glucose at time \nt is estimated using a linear Kalman filter, treating Y; as deterministic input in the state \nspace model Yt = x t + Y; ,Zt = Yt + 8t . We adapt the parameters in the linear error model \n(i.e. (h, O2 , the variance of ft) using an EM adaptation rule, implemented using forward(cid:173)\nbackward Kalman filter equations (see Appendix). The parameters in the neural network \nare adapted using RTRL exactly the same way as in the RNN-FR model, except that the \ntarget is now target~nn = yr - iftinear where yr is a measurement of Yt at time t and \nwhere iftinear is the estimate of the linear error model (based on the linear Kalman filter). \nThe adaptation of the linear error model and the neural network are performed aIternatingly \nuntil no significant further improvement in performance can be achieved. \n\nAs indicated in Table 1, the RNN-LEM2 model achieves the best prediction performance \nwith an explained variance of 44.9% (first order error model RNN-LEMI: 43.7%). As \na comparison, we show the performance of just the linear error model LEM (this model \nignores all inputs), a linear model (LM-FR) without an error model trained with RTRL and \na linear model with an error model (LM-LEM). Interestingly, the linear error model which \ndoes not see any of the inputs can explain more variance (12.9%) than the LM-FR model \n(8.9%). The LM-LEM model, which can be considered a combination of both can ex(cid:173)\nplain more than the sum of the individual explained variances (31.5%) which indicates that \nthe combined training gives better perfonnance than training both submodels individually. \nNote also, that the nonlinear models (RNN-FR, RNN-TF, RNN-LEM) give considerably \nbetter results than their linear counterparts, confirming that the system is highly nonlinear. \n\nFigure 3 (left) shows an example of the responses of some of the models. We see that \nthe free running neural network (dotted line) has relatively small amplitudes and cannot \npredict the three measurements very well. The RNN-TF model (dashed line) shows a \nbetter response to the measurements than the free running network. The best prediction of \nall measurements is indeed achieved by the RNN-LEM model (continuous line). \n\nBased on the linear iterated Kalman filter we can calculate the variance of the prediction. \nAs shown in Figure 3 (right) the standard deviation is small right after a measurement is \navailable and then converges to a constant value. Based on the prediction and the estimated \nvariance, it will be possible to do a risk analysis for the diabetic (i.e a warning of dangerous \nmetabolic states). \n\n2MSPE(model) is the mean squared prediction error on the test set of the model and \n\nMSPE( mean) is the mean squared prediction error of predicting the mean. \n\n\f976 \n\nV. Tresp and T. Briegel \n\n240~--~----~--~--~ \n\n230 p. \n220 ~\\ \n\n..,-210 1\\\\ \n\n, \n\nJ \n\n\\ \n\n\\ \n\n\\ \n\n~ I\n=-200 I \\ \nI \n~ 190 \n. \n1 \n'. ~ \n- ': '.~'::\"\" \n-g lBO \n~ \n\n1 70 \u00b7 \n\n'.\" ~ \n\n, \" -\n\n\" \n\n\u2022 \n\n_ -\n\n........ \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ::;-;~ .. -~..,,\"-\n\n160 \n\n150 \n\n2 . 5 \n\n5 \n\ntime [hOurS] \n\n7 . 5 \n\n2.5 \n\n5 \n\ntime [hOurS] \n\n7.5 \n\nFigure 3: Left: Responses of some models to three measurements. Note, that the prediction \nof the first measurement is bad for all models but that the RNN-LEM model (continuous \nline) predicts the following measurements much better than both the RNN -FR (dotted) and \nthe RNN-TF (dashed) model. Right: Standard deviation of prediction error ofRNN-LEM. \n\nTable 1: Explained variance on test set [in percent]: 100 . (1 - ~;~~ ::~~ ) \n\nMODEL % \n\nMODEL \n\n% \n\nmean \nLM \nLEM \nRNN-FR \n\nRNN-TF \nLM-LEM \n\n0 \n8.9 \n12.9 RNN-LEMl \n14.1 RNN-LEM2 \n\n18.8 \n3l.4 \n43.7 \n44.9 \n\n4 CONCLUSIONS \n\nWe introduced a combination of a nonlinear recurrent neural network and a linear error \nmodel. Applied to blood glucose prediction it gave significantly better results than both \nrecurrent neural networks alone and various linear models. Further work might lead to a \npredictive model which can be used by a diabetic on a daily bases. We believe that our re(cid:173)\nsults are very encouraging. We also expect that our specific model can find applications in \nother stochastical nonlinear systems in which measurements are only available at irregular \nintervals such that in wastewater treatment, chemical process control and various physio(cid:173)\nlogical systems. Further work will include error models for the input measurements (for \nexample, the number of food calories are typically estimated with great uncertainty). \n\nAppendix: EM Adaptation Rules for Training the Linear Error Model \nModel and observation equations of a general model are3 \n\nXt = eXt-l + ft \n\nZt = MtXt + 8t . \n\n(7) \nwhere e is the K x K transition matrix ofthe K -order linear error model. The K x 1 noise \nterms (t are zero-mean uncorrelated normal vectors with common covariance matrix Q. 8t \nis m-dimensional 4 zero-mean uncorrelated normal noise vector with covariance matrix \nRt \u2022 Recall that we consider certain measurements and missing values as special cases of \n\n3Note, that any linear system of order K can be transformed into a first order linear system of \n\ndimension K. \n\n4 m indicates the dimension of the output of the time-series. \n\n\fMissing Data in RNNs with an Application to Blood Glucose Prediction \n\n977 \n\nnoisy measurements. The initial state of the system is assumed to be a normal vector with \nmean Jl and covariance E. \n\nWe describe the EM equations for maximizing the likelihood of the model. Define the \nestimated parameters at the (r+ l)st iterate of EM as the values Jl, E, e, Q which maximize \n(8) \n\nG(Jl, E, e, Q) = Er (log Llzl, ... , zn) \n\nwhere log L is log-likelihood of the complete data Xo, Xl, \u2022\u2022\u2022 , X n , ZI, \u2022 .\u2022 , Zn and Er de(cid:173)\nnotes the conditional expectation relative to a density containing the rth iterate values \nJl(r), E(r), e(r) and Q(r). Recall that missing targets are modeled implicitly by the defi(cid:173)\nnition of M t and Rt . \n\nFor calculating the conditional expectation defined in (8) the following set of recursions \nare used (using standard Kalman filtering results, see (Jazwinski, 1970)). First, we use the \nforward recursion \n\nt \n\nt-l \nX t \npt-l \nK t \nXt \nt \nPi \n\ne t-l \n- x t _ l \nept-leT + Q \nt-l \nP/-lMtT(MtP/-IMtT + Rt)-1 \nt-l + r.' (* M \nx t \np tt-l - K t M t P;-1 \n\nI'q Yt -\n\nt-l) \n\ntXt \n\nwhere we take xg = Jl and p3 = E. Next, we use the backward recursion \n\nJ \nt-l \nX~_l \nPtn_l \npr-l,t-2 \n\nt \n\nP t-leT(pt-l)-l \nt-l \nx~=~ + Jt-l(X~ - ex~=D \np:~t + Jt-dPr - p;-l)J?'_l \nP/~11JL2 + Jt-dPtt-l - ePtt~l)Jt~2 \n\n(9) \n\n(10) \n\n(11 ) \n\nwith initialization p;: n-l = (1 - KnMn)ep;:::l. One forward and one backward recur(cid:173)\nsion completes the E-'step of the EM algorithm. \n\nTo derive the M-step first realize that the conditional expectations in (8) yield to the fol(cid:173)\nlowing equation: \n\nG = -~ log IEI- !tr{E-l(Pon + (xo - Jl)(xo - Jl)T)} \n\n-% log IQI-1tr{Q-l(C - BeT - eBT - eAeT)} \n-% log IRtl- !tr{R;-l E~l[(Y; - Mtxt)(y; - Mtxd T + MtprMtT]} \n\nwhere tr{.} denotes the trace, A = E~=l (pr-l + xr_lx~:d, \nB = E~=l(Pt~t-l + X~X~:l) and C = E~=l (pr + x~x~ T). \ne(r + 1) = BA- 1 and Q(r + 1) = n-1(C - BA- l BT) maximize the log-likelihood \nequation (11). Jl (r + 1) is set to Xo and E may be fixed at some reasonable baseline level. \nThe derivation of these equations can be found in (Shumway & Stoffer, 1981). \n\nThe E- (forward and backward Kalman filter equations) and M-steps are alternated repeat(cid:173)\nedly until convergence to obtain the EM solution. \n\nReferences \nJazwinski, A. H. (1970) Stochastic Processes and Filtering Theory, Academic Press, N.Y. \nLewis, F. L. (1986) Optimal Estimation, John Wiley, N.Y. \nShumway, R. H. and Stoffer, D. S. (1981) TIme Series Smoothing and Forecasting Using \nthe EM Algorithm, Technical Report No. 27, Division of Statistics, UC Davis. \nTresp, v., Moody, 1. and Delong, W.-R. (1994) Neural Modeling of Physiological Pro(cid:173)\ncesses, in Comput. Leaming Theory and Natural Leaming Sys. 2, S. Hanson et al., eds., \nMIT Press. \n\n\f", "award": [], "sourceid": 1348, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Thomas", "family_name": "Briegel", "institution": null}]}