{"title": "GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 7379, "page_last": 7390, "abstract": "Modeling real-world multidimensional time series can be particularly challenging when these are sporadically observed (i.e., sampling is irregular both in time and across dimensions)\u2014such as in the case of clinical patient data. To address these challenges, we propose (1) a continuous-time version of the Gated Recurrent Unit, building upon the recent Neural Ordinary Differential Equations (Chen et al., 2018), and (2) a Bayesian update network that processes the sporadic observations. We bring these two ideas together in our GRU-ODE-Bayes method. We then demonstrate that the proposed method encodes a continuity prior for the latent process and that it can exactly represent the Fokker-Planck dynamics of complex processes driven by a multidimensional stochastic differential equation. Additionally, empirical evaluation shows that our method outperforms the state of the art on both synthetic data and real-world data with applications in healthcare and climate forecast. What is more, the continuity prior is shown to be well suited for low number of samples settings.", "full_text": "GRU-ODE-Bayes: Continuous modeling of\n\nsporadically-observed time series\n\nEdward De Brouwer\u21e4\u2020\n\nESAT-STADIUS\nKU LEUVEN\n\nLeuven, 3001, Belgium\n\nedward.debrouwer@esat.kuleuven.be\n\nAdam Arany\nESAT-STADIUS\nKU LEUVEN\n\nLeuven, 3001, Belgium\n\nadam.arany@esat.kuleuven.be\n\nJaak Simm\u21e4\nESAT-STADIUS\nKU LEUVEN\n\nLeuven, 3001, Belgium\n\njaak.simm@esat.kuleuven.be\n\nYves Moreau\nESAT-STADIUS\nKU LEUVEN\n\nLeuven, 3001, Belgium\n\nmoreau@esat.kuleuven.be\n\nAbstract\n\nModeling real-world multidimensional time series can be particularly challeng-\ning when these are sporadically observed (i.e., sampling is irregular both in time\nand across dimensions)\u2014such as in the case of clinical patient data. To address\nthese challenges, we propose (1) a continuous-time version of the Gated Recurrent\nUnit, building upon the recent Neural Ordinary Differential Equations (Chen et al.,\n2018), and (2) a Bayesian update network that processes the sporadic observations.\nWe bring these two ideas together in our GRU-ODE-Bayes method. We then\ndemonstrate that the proposed method encodes a continuity prior for the latent\nprocess and that it can exactly represent the Fokker-Planck dynamics of complex\nprocesses driven by a multidimensional stochastic differential equation. Addition-\nally, empirical evaluation shows that our method outperforms the state of the art on\nboth synthetic data and real-world data with applications in healthcare and climate\nforecast. What is more, the continuity prior is shown to be well suited for low\nnumber of samples settings.\n\n1\n\nIntroduction\n\nMultivariate time series are ubiquitous in various domains of science, such as healthcare (Jensen et al.,\n2014), astronomy (Scargle, 1982), or climate science (Schneider, 2001). Much of the methodology for\ntime-series analysis assumes that signals are measured systematically at \ufb01xed time intervals. However,\nmuch real-world data can be sporadic (i.e., the signals are sampled irregularly and not all signals are\nmeasured each time). A typical example is patient measurements, which are taken when the patient\ncomes for a visit (e.g., sometimes skipping an appointment) and where not every measurement is\ntaken at every visit. Modeling then becomes challenging as such data violates the main assumptions\nunderlying traditional machine learning methods (such as recurrent neural networks).\nRecently, the Neural Ordinary Differential Equation (ODE) model (Chen et al., 2018) opened the\nway for a novel, continuous representation of neural networks. As time is intrinsically continuous,\nthis framework is particularly attractive for time-series analysis. It opens the perspective of tackling\n\n\u21e4Both authors contributed equally\n\u2020Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe issue of irregular sampling in a natural fashion, by integrating the dynamics over whatever time\ninterval needed. Up to now however, such ODE dynamics have been limited to the continuous\ngeneration of observations (e.g., decoders in variational auto-encoders (VAEs) (Kingma & Welling,\n2013) or normalizing \ufb02ows (Rezende et al., 2014)).\nInstead of the encoder-decoder architecture where the ODE part is decoupled from the input pro-\ncessing, we introduce a tight integration by interleaving the ODE and the input processing steps.\nConceptually, this allows us to drive the dynamics of the ODE directly by the incoming sporadic\ninputs. To this end, we propose (1) a continuous time version of the Gated Recurrent Unit and (2) a\nBayesian update network that processes the sporadic observations. We combine these two ideas to\nform the GRU-ODE-Bayes method.\n\nFigure 1: Comparison of GRU-ODE-Bayes and NeuralODE-VAE on a 2D Ornstein-Uhlenbeck\nprocess with highly correlated Wiener processes (\u21e2 = 0.99). Dots are the values of the actual\nunderlying process (dotted lines) from which the sporadic observations are obtained. Solid lines and\nshaded areas are the inferred means and 95% con\ufb01dence intervals. Note the smaller errors and smaller\nvariance of GRU-ODE-Bayes vs. NeuralODE-VAE. Note also that GRU-ODE-Bayes can infer that a\njump in one variable also implies a jump in the other unobserved one (red arrows). Similarly, it also\nlearns the reduction of variance resulting from a new incoming observation.\n\nThe tight coupling between observation processing and ODE dynamics allows the proposed method to\nmodel \ufb01ne-grained nonlinear dynamical interactions between the variables. As illustrated in Figure 1,\nGRU-ODE-Bayes can (1) quickly infer the unknown parameters of the underlying stochastic process\nand (2) learn the correlation between its variables (red arrows in Figure 1). In contrast, the encoder-\ndecoder based method NeuralODE-VAE proposed by Chen et al. (2018) captures the general structure\nof the process without being able to recover detailed interactions between the variables (see Section 4\nfor detailed comparison).\nOur model enjoys important theoretical properties. We frame our analysis in a general way by\nconsidering that observations follow the dynamics driven by a stochastic differential equation (SDE).\nIn Section 4 and Appendix I, we show that GRU-ODE-Bayes can exactly represent the correspond-\ning Fokker-Planck dynamics in the special case of the Ornstein-Uhlenbeck process, as well as in\ngeneralized versions of it. We further perform an empirical evaluation and show that our method\noutperforms the state of the art on healthcare and climate data (Section 5).\n\n1.1 Problem statement\nWe consider the general problem of forecasting on N sporadically observed D-dimensional time\nseries. For example, data from N patients where D clinical longitudinal variables can potentially\nbe measured. Each time series i 2{ 1, . . . , N} is measured at Ki time points speci\ufb01ed by a vector\n\n2\n\n\fof observation times ti 2 RKi. The values of these observations are speci\ufb01ed by a matrix of\nobservations yi 2 RKi\u21e5D and an observation mask mi 2{ 0, 1}Ki\u21e5D (to indicate which of the\nvariables are measured at each time point).\nWe assume that observations yi are sampled from the realizations of a D-dimensional stochastic\nprocess Y(t) whose dynamics is driven by an unknown SDE:\n\ndY(t) = \u00b5(Y(t))dt + (Y(t))dW(t),\n\n(1)\nwhere dW(t) is a Wiener process. The distribution of Y(t) then evolves according to the celebrated\nFokker-Planck equation (Risken, 1996). We refer to the mean and covariance parameters of its\nprobability density function (PDF) as \u00b5Y(t) and \u2303Y(t).\nOur goal will be to model the unknown temporal functions \u00b5Y(t) and \u2303Y(t) from the sporadic\nmeasurements yi. These are obtained by sampling the random vectors Y(t) at times ti with some\nobservation noise \u270f. Not all dimensions are sampled each time, resulting in missing values in yi. In\ncontrast to classical SDE inference (S\u00e4rkk\u00e4 & Solin, 2019), we consider the functions \u00b5Y(t) and\n\u2303Y(t) are parametrized by neural networks.\nThis SDE formulation is general.\nIt embodies the natural assumption that seemingly identical\nprocesses can evolve differently because of unobserved information. In the case of intensive care, as\ndeveloped in Section 5, it re\ufb02ects the evolving uncertainty regarding the patient\u2019s future condition.\n\n2 Proposed method\n\nAt a high level, we propose a dual mode system consisting of (1) a GRU-inspired continuous-time state\nevolution (GRU-ODE) that propagates in time the hidden state h of the system between observations\nand (2) a network that updates the current hidden state to incorporate the incoming observations\n(GRU-Bayes). The system switches from propagation to update and back whenever a new observation\nbecomes available.\nWe also introduce an observation model fobs(h(t)) mapping h to the estimated parameters of the\nobservations distribution \u00b5Y(t) and \u2303Y(t) (details in Appendix E). GRU-ODE then explicitly learns\nthe Fokker-Planck dynamics of Eq. 1. This procedure allows end-to-end training of the system to\nminimize the loss with respect to the sporadically sampled observations y.\n\n2.1 GRU-ODE derivation\nTo derive the GRU-based ODE, we \ufb01rst show that the GRU proposed by Cho et al. (2014) can be\nwritten as a difference equation. First, let rt, zt, and gt be the reset gate, update gate, and update\nvector of the GRU:\n\n(2)\n\nrt = (Wrxt + Urht1 + br)\nzt = (Wzxt + Uzht1 + bz)\ngt = tanh(Whxt + Uh(rt ht1) + bh),\n\nht = zt ht1 + (1 zt) gt.\n\nwhere is the elementwise product. Then the standard update for the hidden state h of the GRU is\n\nWe can also write this as ht = GRU(ht1, xt). By subtracting ht1 from this state update equation\nand factoring out (1 zt), we obtain a difference equation\n\nht = ht ht1 = zt ht1 + (1 zt) gt ht1\n\n= (1 zt) (gt ht1).\n\nThis difference equation naturally leads to the following ODE for h(t):\n\ndh(t)\n\ndt\n\n= (1 z(t)) (g(t) h(t)),\n\n(3)\n\nwhere z, g, r and x are the continuous counterpart of Eq. 2. See Appendix A for the explicit form.\nWe name the resulting system GRU-ODE. Similarly, we derive the minimal GRU-ODE, a variant\nbased on the minimal GRU (Zhou et al., 2016), described in appendix G.\n\n3\n\n\fIn case continuous observations or control signals are available, they can be naturally fed to the\nGRU-ODE input x(t). For example, in the case of clinical trials, the administered daily doses of the\ndrug under study can be used to de\ufb01ne a continuous input signal. If no continuous input is available,\nthen nothing is fed as x(t) and the resulting ODE in Eq. 3 is autonomous, with g(t) and z(t) only\ndepending on h(t).\n\n2.2 General properties of GRU-ODE\n\nGRU-ODE enjoys several useful properties:\nBoundedness. First, the hidden state h(t) stays within the [1, 1] range3. This restriction is crucial\nfor the compatibility with the GRU-Bayes model and comes from the negative feedback term in\nEq. 3, which stabilizes the resulting system. In detail, if the j-th dimension of the starting state h(0)\nis within [1, 1], then h(t)j will always stay within [1, 1] because\n\ndh(t)j\n\ndt\n\nt:h(t)j =1 \uf8ff 0\n\nand\n\ndh(t)j\n\ndt\n\nt:h(t)j =1 0.\n\nThis can be derived from the ranges of z and g in Eq. 2. Moreover, would h(0) start outside of the\n[1, 1] region, the negative feedback would quickly push h(t) into this region, making the system\nalso robust to numerical errors.\nContinuity. Second, GRU-ODE is Lipschitz continuous with constant K = 2. Importantly, this\nmeans that GRU-ODE encodes a continuity prior for the latent process h(t). This is in line with\nthe assumption of a continuous hidden process generating observations (Eq. 1). In Section 5.5, we\ndemonstrate empirically the importance of this prior in the small-sample regime.\nGeneral numerical integration. As a parametrized ODE, GRU-ODE can be integrated with any\nnumerical solver. In particular, adaptive step size solvers can be used. Our model can then afford large\ntime steps when the internal dynamics is slow, taking advantage of the continuous time formulation\nof Eq. 3. It can also be made faster with sophisticated ODE integration methods. We implemented\nthe following methods: Euler, explicit midpoint, and Dormand-Prince (an adaptive step size method).\nAppendix C illustrates that the Dormand-Prince method requires fewer time steps.\n\n2.3 GRU-Bayes\n\nGRU-Bayes is the module that processes the sporadically incoming observations to update the hidden\nvectors, and hence the estimated PDF of Y(t). This module is based on a standard GRU and thus\noperates in the region [1, 1] that is required by GRU-ODE. In particular, GRU-Bayes is able to\nupdate h(t) to any point in this region. Any adaptation is then within reach with a single observation.\nTo feed the GRU unit inside GRU-Bayes with a non-fully-observed vector, we \ufb01rst preprocess it with\nan observation mask using fprep, as described in Appendix D. For a given time series, the resulting\nupdate for its k-th observation y[k] at time t = t[k] with mask m[k] and hidden vector h(t) is\n\nh(t+) = GRU(h(t), fprep(y[k], m[k], h(t))),\n\n(4)\nwhere h(t) and h(t+) denote the hidden representation before and after the jump from GRU-\nBayes update. We also investigate an alternative option where the h(t) is updated by each observed\ndimension sequentially. We call this variant GRU-ODE-Bayes-seq (see Appendix F for more details).\nIn Appendix H, we run an ablation study of the proposed GRU-Bayes architecture by replacing it\nwith a MLP and show that the aforementioned properties are crucial for good performance.\n\n2.4 GRU-ODE-Bayes\n\nThe proposed GRU-ODE-Bayes combines GRU-ODE and GRU-Bayes. The GRU-ODE is used to\nevolve the hidden state h(t) in continuous time between the observations and GRU-Bayes transforms\nthe hidden state, based on the observation y, from h(t) to h(t+). As best illustrated in Figure 2,\nthe alternation between GRU-ODE and GRU-Bayes results in an ODE with jumps, where the jumps\nare at the locations of the observations.\n\n3We use the notation [1, 1] to also mean multi-dimensional range (i.e., all elements are within [1, 1]).\n\n4\n\n\fGRU-ODE-Bayes is best understood as a \ufb01ltering approach. Based on previous observations (until\ntime tk), it can estimate the probability of future observations. Like the celebrated Kalman \ufb01lter, it\nalternates between a prediction (GRU-ODE) and a \ufb01ltering (GRU-Bayes) phase. Future values of\nthe time series are predicted by integrating the hidden process h(t) in time, as shown on the green\nsolid line in Figure 2. The update step discretely updates the hidden state when a new measurement\nbecomes available (dotted blue line). Let\u2019s note that unlike the Kalman \ufb01lter, our approach is able to\nlearn complex dynamics for the hidden process.\n\nFigure 2: GRU-ODE-Bayes uses GRU-ODE to evolve the hidden state between two observation\ntimes t[k] and t[k + 1]. GRU-Bayes processes the observations and updates the hidden vector h in a\ndiscrete fashion, re\ufb02ecting the additional information brought in by the observed data.\n\nObjective function\nTo train the model using sporadically-observed samples, we introduce two losses. The \ufb01rst loss,\nLosspre, is computed before the observation update and is the negative log-likelihood (NegLL) of\nthe observations. For the observation of a single sample, we have (for readability we drop the time\nindexing):\n\nLosspre = \n\nDXj=1\n\nmj log p(yj|\u2713 = fobs(h)j),\n\nwhere mj is the observation mask and fobs(h)j are the parameters of the distribution before the\nupdate, for dimension j. Thus, the error is only computed on the observed values of y.\nFor the second loss, let ppre denote the predicted distributions (from h) before GRU-Bayes.\nWith pobs, the PDF of Y(t) given the noisy ob-\nservation (with noise vector \u270f), we \ufb01rst compute\nthe analogue of the Bayesian update:\n\nAlgorithm 1 GRU-ODE-Bayes\n\nInput: Initial state h0,\n\npBayes,j / ppre,j \u00b7 pobs,j.\n\nLet ppost denote the predicted distribution (from\nh+) after applying GRU-Bayes. We then de-\n\ufb01ne the post-jump loss as the KL-divergence\nbetween pBayes and ppost:\n\nLosspost =\n\nDXj=1\n\nmjDKL(pBayes,j||ppost,j).\n\nIn this way, we force our model to learn to mimic\na Bayesian update.\nSimilarly to the pre-jump loss, Losspost is com-\nputed only for the observed dimensions. The\ntotal loss is then obtained by adding both losses\nwith a weighting parameter .\n\n5\n\nobservations y, mask m,\nobservation times t, \ufb01nal time T .\nInitialize time = 0, loss = 0, h = h0.\nfor k = 1 to K do\n\n{ODE evolution to t[k]}\nh = GRU-ODE(h, time, t[k])\ntime = t[k]\n{Pre-jump loss}\nloss += Losspre(y[k], m[k], h)\n{Update}\nh = GRU-Bayes(y[k], m[k], h)\n{Post-jump loss}\nloss += . Losspost(y[k], m[k], h)\n\nend for\n{ODE evolution to T }\nh = GRU-ODE(h, t[NK], T )\nreturn (h, loss)\n\nGRU-ODEGRU-Bayestimeh(t)LosspreLosspostt[k]t[k+1]\fFor binomial and Gaussian distributions, computing Losspost can be done analytically. In the case of\nGaussian distribution we can compute the Bayesian updated mean \u00b5Bayes and variance 2\n\nBayes as\n\n\u00b5Bayes =\n\n2\nBayes =\n\n2\nobs\n\n\u00b5pre +\n\n2\npre\n\npre + 2\n2\n\nobs\n\n\u00b5obs\n\n,\n\nobs\n\npre + 2\n2\n2\npre.2\nobs\npre + 2\n2\n\nobs\n\nwhere for readability we dropped the dimension sub-index. In many real-world cases, the observation\npre, in which case pBayes is just the observation distribution: \u00b5Bayes = \u00b5obs and\nnoise 2\n2\nBayes = 2\n\nobs \u2327 2\n\nobs.\n\nImplementation\n\n2.5\nThe pseudocode of GRU-ODE-Bayes is depicted in Algorithm 1, where a forward pass is shown for a\nsingle time series 4. For mini-batching several time series we sort the observation times across all time\nseries and for each unique time point t[k], we create a list of the time series that have observations.\nThe main loop of the algorithm iterates over this set of unique time points. In the GRU-ODE step, we\npropagate all hidden states jointly. The GRU-Bayes update and the loss calculation are only executed\non the time series that have observation at that particular time point. The complexity of our approach\nthen scales linearly with the number of observations and quadratically with the dimension of the\nobservations. When memory cost is a bottleneck, the gradient can be computed using the adjoint\nmethod, without backpropagating through the solver operations (Chen et al., 2018).\n\n3 Related research\n\nMachine learning has a long history in time series modelling (Mitchell, 1999; Gers et al., 2000;\nWang et al., 2006; Chung et al., 2014). However, recent massive real-world data collection, such as\nelectronic health records (EHR), increase the need for models capable of handling such complex data\n(Lee et al., 2017). As stated in the introduction, their sporadic nature is the main dif\ufb01culty.\nTo address the nonconstant sampling, a popular approach is to recast observations into \ufb01xed duration\ntime bins. However, this representation results in missing observation both in time and across features\ndimensions. This makes the direct usage of neural network architectures tricky. To overcome this\nissue, the main approach consists in some form of data imputation and jointly feeding the observation\nmask and times of observations to the recurrent network (Che et al., 2018; Choi et al., 2016a; Lipton\net al., 2016; Du et al., 2016; Choi et al., 2016b; Cao et al., 2018). This approach strongly relies on the\nassumption that the network will learn to process true and imputed samples differently. Despite some\npromising experimental results, there is no guarantee that it will do so. Some researchers have tried to\nalleviate this limitation by introducing more meaningful data representation for sporadic time series\n(Rajkomar et al., 2018; Razavian & Sontag, 2015; Ghassemi et al., 2015), like tensors (De Brouwer\net al., 2018; Simm et al., 2017).\nOthers have addressed the missing data problem with generative probabilistic models. Among those,\n(multitask) Gaussian processes (GP) are the most popular by far (Bonilla et al., 2008). They have been\nused for smart imputation before a RNN or CNN architecture (Futoma et al., 2017; Moor et al., 2019),\nfor modelling a hidden process in joint models (Soleimani et al., 2018), or to derive informative\nrepresentations of time series (Ghassemi et al., 2015). GPs have also been used for direct forecasting\n(Cheng et al., 2017). However, they usually suffer from high uncertainty outside the observation\nsupport, are computationally intensive (Qui\u00f1onero-Candela & Rasmussen, 2005), and learning the\noptimal kernel is tricky. Neural Processes, a neural version of GPs, have also been introduced by\nGarnelo et al. (2018). In contrast with our work that focuses on continuous-time real-valued time\nseries, continuous time modelling of time-to-events has been addressed with point processes (Mei &\nEisner, 2017) and continuous time Bayesian networks (Nodelman et al., 2002). Yet, our continuous\nmodelling of the latent process allows us to straightforwardly model a continuous intensity function\nand thus handle both real-valued and event type of data. This extension was left for future work.\n\n4Code is available in the following anonymous repository : https://github.com/edebrouwer/gru_\n\node_bayes\n\n6\n\n\fMost recently, the seminal work of Chen et al. (2018) suggested a continuous version of neural\nnetworks that overcomes the limits imposed by discrete-time recurrent neural networks. Coupled\nwith a variational auto-encoder architecture (Kingma & Welling, 2013), it proposed a natural way\nof generating irregularly sampled data. However, it transferred the dif\ufb01cult task of processing\nsporadic data to the encoder, which is a discrete-time RNN. In a work submitted concomitantly to\nours (Rubanova et al., 2019), the authors proposed a convincing new VAE architecture that uses a\nNeural-ODE architecture for both encoding and decoding the data.\nRelated auto-encoder approaches with sequential latents operating in discrete time have also been\nproposed (Krishnan et al., 2015, 2017). These models rely on classical RNN architectures in their\ninference networks, hence not addressing the sporadic nature of the data. What is more, if they have\nbeen shown useful for smoothing and counterfactual inference, their formulation is less suited for\nforecasting. Our method also has connections to the Extended Kalman Filter (EKF) that models the\ndynamics of the distribution of processes in continuous time. However, the practical applicability\nof the EKF is limited because of the linearization of the state update and the dif\ufb01culties involved in\nidentifying its parameters. Importantly, the ability of the GRU to learn long-term dependencies is a\nsigni\ufb01cant advantage.\nFinally, other works have investigated the relationship between deep neural networks and partial\ndifferential equations. An interesting line of research has focused on deriving better deep architectures\nmotivated by the stability of the corresponding patial differential equations (PDE) (Haber & Ruthotto,\n2017; Chang et al., 2019). Despite their PDE motivation, those approaches eventually designed new\ndiscrete architectures and didn\u2019t explore the application on continuous inputs and time.\n\n4 Application to synthetic SDEs\n\nFigure 1 illustrates the capabilities of our approach compared to NeuralODE-VAE on data generated\nfrom a process driven by a multivariate Ornstein-Uhlenbeck (OU) SDE with random parameters.\nCompared to NeuralODE-VAE, which retrieves the average dynamics of the samples, our approach\ndetects the correlation between both features and updates its predictions more \ufb01nely as new obser-\nvations arrive. In particular, note that GRU-ODE-Bayes updates its prediction and con\ufb01dence on\na feature even when only the other one is observed, taking advantage from the fact that they are\ncorrelated. This can be seen on the left pane of Figure 1 where at time t = 3, Dimension 1 (blue) is\nupdated because of the observation of Dimension 2 (green).\nBy directly feeding sporadic inputs into the ODE, GRU-ODE-Bayes sequentially \ufb01lters the hidden\nstate and thus estimates the PDF of the future observations. This is the core strength of the proposed\nmethod, allowing it to perform long-term predictions.\nIn Appendix I, we further show that our model can exactly represent the dynamics of multivariate OU\nprocess with random variables. Our model can also handle nonlinear SDEs as shown in Appendix J\nwhere we present an example inspired by the Brusselator (Prigogine, 1982), a chaotic ODE.\n\n5 Empirical evaluation\n\nWe evaluated our model on two data sets from different application areas: healthcare and climate\nforecasting. In both applications, we assume the data consists of noisy observations from an underly-\ning unobserved latent process as in Eq. 1. We focused on the general task of forecasting the time\nseries at future time points. Models are trained to minimize negative log-likelihood.\n\n5.1 Baselines\nWe used a comprehensive set of state-of-the-art baselines to compare the performance of our method.\nAll models use the same hidden size representation and comparable number of parameters.\nNeuralODE-VAE (Chen et al., 2018). We model the time derivative of the hidden representation\nas a 2-layer MLP. To take missingness across features into account, we add a mechanism to feed an\nobservation mask.\nImputation Methods. We implemented two imputation methods as described in Che et al. (2018):\nGRU-Simple and GRU-D.\n\n7\n\n\fSequential VAEs (Krishnan et al., 2015, 2017). We extended the deep Kalman \ufb01lter architecture by\nfeeding an observation mask and updating the loss function accordingly.\nT-LSTM (Baytas et al., 2017). We reused the proposed time-aware LSTM cell to design a forecasting\nRNN with observation mask.\n\n5.2 Electronic health records\n\nElectronic Health Records (EHR) analysis is crucial to achieve data-driven personalized medicine (Lee\net al., 2017; Goldstein et al., 2017; Esteva et al., 2019). However, ef\ufb01cient modeling of this type\nof data remains challenging. Indeed, it consists of sporadically observed longitudinal data with the\nextra hurdle that there is no standard way to align patients trajectories (e.g., at hospital admission,\npatients might be in very different state of progression of their condition). Those dif\ufb01culties make\nEHR analysis well suited for GRU-ODE-Bayes.\nWe use the publicly available MIMIC-III clinical database (Johnson et al., 2016), which contains\nEHR for more than 60,000 critical care patients. We select a subset of 21,250 patients with suf\ufb01cient\nobservations and extract 96 different longitudinal real-valued measurements over a period of 48 hours\nafter patient admission. We refer the reader to Appendix K for further details on the cohort selection.\nWe focus on the predictions of the next 3 measurements after a 36-hour observation window.\n\n5.3 Climate forecast\n\nFrom short-term weather forecast to long-range prediction or assessment of systemic changes, such\nas global warming, climatic data has always been a popular application for time-series analysis. This\ndata is often considered to be regularly sampled over long periods of time, which facilitates their\nstatistical analysis. Yet, this assumption does not usually hold in practice. Missing data are a problem\nthat is repeatedly encountered in climate research because of, among others, measurement errors,\nsensor failure, or faulty data acquisition. The actual data is then sporadic and researchers usually\nresort to imputation before statistical analysis (Junninen et al., 2004; Schneider, 2001).\nWe use the publicly available United State Historical Climatology Network (USHCN) daily data\nset (Menne et al.), which contains measurements of 5 climate variables (daily temperatures, precipita-\ntion, and snow) over 150 years for 1,218 meteorological stations scattered over the United States. We\nselected a subset of 1,114 stations and an observation window of 4 years (between 1996 and 2000).\nTo make the time series sporadic, we subsample the data such that each station has an average of\naround 60 observations over those 4 years. Appendix L contains additional details regarding this\nprocedure. The task is then to predict the next 3 measurements after the \ufb01rst 3 years of observation.\n\n5.4 Results\n\nWe report the performance using 5-fold cross-validation. Hyperparameters (dropout and weight\ndecay) are chosen using an inner holdout validation set (20%) and performance are assessed on a\nleft-out test set (10%). Those folds are reused for each model we evaluated for sake of reproducibility\nand fair comparison (More details in Appendix O). Performance metrics for both tasks (NegLL\nand MSE) are reported in Table 1. GRU-ODE-Bayes handles the sporadic data more naturally\nand can more \ufb01nely model the dynamics and correlations between the observed features, which\nresults in higher performance than other methods for both data sets. In particular, GRU-ODE-Bayes\nunequivocally outperforms all other methods on both data sets.\n\n5.5\n\nImpact of continuity prior\n\nTo illustrate the capabilities of the derived GRU-ODE cell presented in Section 2.1, we consider the\ncase of time series forecasting with low sample size. In the realm of EHR prediction, this could be\nframed as a rare disease setup, where data is available for few patients only. In this case of scarce\nnumber of samples, the continuity prior embedded in GRU-ODE is crucially important as it provides\nimportant prior information about the underlying process.\nTo highlight the importance of the GRU-ODE cell, we compare two versions of our model : the\nclassical GRU-ODE-Bayes and one where the GRU-ODE cell is replaced by a discretized autonomous\nGRU. We call the latter GRU-Discretized-Bayes. Table 2 shows the results for MIMIC-III with varying\n\n8\n\n\fTable 1: Forecasting results.\n\nMODEL\nNEURALODE-VAE\nNEURALODE-VAE-MASK\nSEQUENTIAL VAE\nGRU-SIMPLE\nGRU-D\nT-LSTM\nGRU-ODE-BAYES\n\nUSHCN-DAILY\nMSE\n0.96 \u00b1 0.11\n0.83 \u00b1 0.10\n0.83 \u00b1 0.07\n0.75 \u00b1 0.12\n0.53 \u00b1 0.06\n0.59 \u00b1 0.11\n0.43 \u00b1 0.07\n\nNEGLL\n1.46 \u00b1 0.10\n1.36 \u00b1 0.05\n1.37 \u00b1 0.06\n1.23 \u00b1 0.10\n0.99 \u00b1 0.07\n1.67 \u00b1 0.50\n0.84 \u00b1 0.11\n\nMIMIC-III\nMSE\n0.89 \u00b1 0.01\n0.89 \u00b1 0.01\n0.92 \u00b1 0.09\n0.82 \u00b1 0.05\n0.79 \u00b1 0.06\n0.62 \u00b1 0.05\n0.48 \u00b1 0.01\n\nNEGLL\n1.35 \u00b1 0.01\n1.36 \u00b1 0.01\n1.39 \u00b1 0.07\n1.21 \u00b1 0.04\n1.16 \u00b1 0.05\n1.02 \u00b1 0.02\n0.83 \u00b1 0.04\n\nnumber of patients in the training set. While our discretized version matches the continuous one\non the full data set, GRU-ODE cell achieves higher accuracy when the number of samples is low,\nhighlighting the importance of the continuity prior. Log-likelihood results are given in Appendix M.\n\nTable 2: Comparison between GRU-ODE and discretized version in the small-sample regime (MSE).\n\nMODEL\nNEURALODE-VAE-MASK\nGRU-DISCRETIZED-BAYES\nGRU-ODE-BAYES\n\n1,000 PATIENTS\n\n2,000 PATIENTS\n\nFULL\n\n0.94 \u00b1 0.01\n0.87 \u00b1 0.02\n0.77 \u00b1 0.01\n\n0.94 \u00b1 0.01\n0.77 \u00b1 0.02\n0.72 \u00b1 0.01\n\n0.89 \u00b1 0.01\n0.46 \u00b1 0.05\n0.48\u2020 \u00b10.01\n\n6 Conclusion and future work\n\nWe proposed a model combining two novel techniques, GRU-ODE and GRU-Bayes, which allows\nfeeding sporadic observations into a continuous ODE dynamics describing the evolution of the\nprobability distribution of the data. Additionally, we showed that this \ufb01ltering approach enjoys\nattractive representation capabilities. Finally, we demonstrated the value of GRU-ODE-Bayes on\nboth synthetic and real-world data. Moreover, while a discretized version of our model performed\nwell on the full MIMIC-III data set, the continuity prior of our ODE formulation proves particularly\nimportant in the small-sample regime, which is particularly relevant for real-world clinical data where\nmany data sets remain relatively modest in size.\nIn this work, we focused on time-series data with Gaussian observations. However, GRU-ODE-Bayes\ncan also be extended to binomial and multinomial observations since the respective NegLL and\nKL-divergence are analytically tractable. This allows the modeling of sporadic observations of both\ndiscrete and continuous variables.\n\nAcknowledgements\n\nEdward De Brouwer is funded by a FWO-SB grant. Yves Moreau is funded by (1) Research Council\nKU Leuven: C14/18/092 SymBioSys3; CELSA-HIDUCTION, (2) Innovative Medicines Initiative:\nMELLODY, (3) Flemish Government (ELIXIR Belgium, IWT, FWO 06260) and (4) Impulsfonds\nAI: VR 2019 2203 DOC.0318/1QUATER Kenniscentrum Data en Maatschappij. Computational\nresources and services used in this work were provided by the VSC (Flemish Supercomputer Center),\nfunded by the Research Foundation - Flanders (FWO) and the Flemish Government \u2013 department\nEWI. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the\nTitan Xp GPU used for this research.\n\n\u2020Statistically not different from best (p-value > 0.6 with paired t-test).\n\n9\n\n\fReferences\nBaytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. Patient subtyping via\ntime-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD international conference on\nknowledge discovery and data mining, pp. 65\u201374. ACM, 2017.\n\nBonilla, E. V., Chai, K. M., and Williams, C. Multi-task gaussian process prediction. In Advances in\n\nneural information processing systems, pp. 153\u2013160, 2008.\n\nCao, W., Wang, D., Li, J., Zhou, H., Li, L., and Li, Y. Brits: Bidirectional recurrent imputation\nfor time series. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and\nGarnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6776\u20136786. Curran\nAssociates, Inc., 2018.\n\nChang, B., Chen, M., Haber, E., and Chi, E. H. Antisymmetricrnn: A dynamical system view on\n\nrecurrent neural networks. arXiv preprint arXiv:1902.09689, 2019.\n\nChe, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y. Recurrent neural networks for multivariate\n\ntime series with missing values. Scienti\ufb01c reports, 8(1):6085, 2018.\n\nChen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations.\n\nIn Advances in Neural Information Processing Systems, 2018, 2018.\n\nCheng, L.-F., Darnell, G., Chivers, C., Draugelis, M. E., Li, K., and Engelhardt, B. E. Sparse multi-\noutput gaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112,\n2017.\n\nCho, K., van Merrienboer, B., G\u00fcl\u00e7ehre, \u00c7., Bougares, F., Schwenk, H., and Bengio, Y. Learning\nphrase representations using RNN encoder-decoder for statistical machine translation. CoRR,\nabs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.\n\nChoi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F., and Sun, J. Doctor ai: Predicting clinical events\nvia recurrent neural networks. In Machine Learning for Healthcare Conference, pp. 301\u2013318,\n2016a.\n\nChoi, E., Bahadori, M. T., Sun, J., Kulas, J., Schuetz, A., and Stewart, W. Retain: An interpretable\npredictive model for healthcare using reverse time attention mechanism. In Advances in Neural\nInformation Processing Systems, pp. 3504\u20133512, 2016b.\n\nChung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural\n\nnetworks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\nDe Brouwer, E., Simm, J., Arany, A., and Moreau, Y. Deep ensemble tensor factorization for\n\nlongitudinal patient trajectories classi\ufb01cation. arXiv preprint arXiv:1811.10501, 2018.\n\nDu, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., and Song, L. Recurrent marked\ntemporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555\u20131564.\nACM, 2016.\n\nEsteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G.,\nThrun, S., and Dean, J. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.\nFutoma, J., Hariharan, S., and Heller, K. Learning to detect sepsis with a multitask gaussian process\n\nrnn classi\ufb01er. arXiv preprint arXiv:1706.04152, 2017.\n\nGarnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural\n\nprocesses. arXiv preprint arXiv:1807.01622, 2018.\n\nGers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with lstm.\n\nNeural Computation, 12(10):2451\u20132471, 2000.\n\nGhassemi, M., Pimentel, M. A., Naumann, T., Brennan, T., Clifton, D. A., Szolovits, P., and Feng, M.\nA multivariate timeseries modeling approach to severity of illness assessment and forecasting in\nicu with sparse, heterogeneous clinical data. In AAAI, pp. 446\u2013453, 2015.\n\n10\n\n\fGoldstein, B. A., Navar, A. M., Pencina, M. J., and Ioannidis, J. Opportunities and challenges in\ndeveloping risk prediction models with electronic health records data: a systematic review. Journal\nof the American Medical Informatics Association, 24(1):198\u2013208, 2017.\n\nHaber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):\n\n014004, 2017.\n\nJensen, A. B., Moseley, P. L., Oprea, T. I., Elles\u00f8e, S. G., Eriksson, R., Schmock, H., Jensen, P. B.,\nJensen, L. J., and Brunak, S. Temporal disease trajectories condensed from population-wide\nregistry data covering 6.2 million patients. Nature communications, 5:4022, 2014.\n\nJohnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits,\nP., Celi, L. A., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scienti\ufb01c data,\n3:160035, 2016.\n\nJunninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., and Kolehmainen, M. Methods for\nimputation of missing values in air quality data sets. Atmospheric Environment, 38(18):2895\u20132907,\n2004.\n\nKingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\nKrishnan, R. G., Shalit, U., and Sontag, D. Deep kalman \ufb01lters. arXiv preprint arXiv:1511.05121,\n\n2015.\n\nKrishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space\n\nmodels. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\nLee, C., Luo, Z., Ngiam, K. Y., Zhang, M., Zheng, K., Chen, G., Ooi, B. C., and Yip, W. L. J. Big\nhealthcare data analytics: Challenges and applications. In Handbook of Large-Scale Distributed\nComputing in Smart Healthcare, pp. 11\u201341. Springer, 2017.\n\nLipton, Z. C., Kale, D., and Wetzel, R. Directly modeling missing data in sequences with rnns:\nImproved classi\ufb01cation of clinical time series. In Machine Learning for Healthcare Conference,\npp. 253\u2013270, 2016.\n\nMei, H. and Eisner, J. M. The neural hawkes process: A neurally self-modulating multivariate point\n\nprocess. In Advances in Neural Information Processing Systems, pp. 6754\u20136764, 2017.\n\nMenne, M., Williams Jr, C., and Vose, R. Long-term daily climate records from stations across the\n\ncontiguous united states.\n\nMitchell, T. M. Machine learning and data mining. Communications of the ACM, 42(11):30\u201336,\n\n1999.\n\nMoor, M., Horn, M., Rieck, B., Roqueiro, D., and Borgwardt, K. Temporal convolutional networks\nand dynamic time warping can drastically improve the early prediction of sepsis. arXiv preprint\narXiv:1902.01659, 2019.\n\nNodelman, U., Shelton, C. R., and Koller, D. Continuous time bayesian networks. In Proceedings of\nthe Eighteenth conference on Uncertainty in arti\ufb01cial intelligence, pp. 378\u2013387. Morgan Kaufmann\nPublishers Inc., 2002.\n\nPrigogine, I. From being to becoming. Freeman, 1982.\n\nQui\u00f1onero-Candela, J. and Rasmussen, C. E. A unifying view of sparse approximate gaussian process\n\nregression. Journal of Machine Learning Research, 6(Dec):1939\u20131959, 2005.\n\nRajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., Liu, P. J., Liu, X., Marcus, J.,\nSun, M., et al. Scalable and accurate deep learning with electronic health records. npj Digital\nMedicine, 1(1):18, 2018.\n\nRazavian, N. and Sontag, D. Temporal convolutional neural networks for diagnosis from lab tests.\n\narXiv preprint arXiv:1511.07938, 2015.\n\n11\n\n\fRezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference\nin deep generative models. In International Conference on Machine Learning, pp. 1278\u20131286,\n2014.\n\nRisken, H. Fokker-planck equation. In The Fokker-Planck Equation, pp. 63\u201395. Springer, 1996.\nRubanova, Y., Chen, R. T., and Duvenaud, D. Latent odes for irregularly-sampled time series. arXiv\n\npreprint arXiv:1907.03907, 2019.\n\nS\u00e4rkk\u00e4, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University\n\nPress, 2019.\n\nScargle, J. D. Studies in astronomical time series analysis. ii-statistical aspects of spectral analysis of\n\nunevenly spaced data. The Astrophysical Journal, 263:835\u2013853, 1982.\n\nSchneider, T. Analysis of incomplete climate data: Estimation of mean values and covariance matrices\n\nand imputation of missing values. Journal of climate, 14(5):853\u2013871, 2001.\n\nSimm, J., Arany, A., Zakeri, P., Haber, T., Wegner, J. K., Chupakhin, V., Ceulemans, H., and Moreau,\nY. Macau: Scalable bayesian factorization with high-dimensional side information using mcmc. In\n2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp.\n1\u20136. IEEE, 2017.\n\nSoleimani, H., Hensman, J., and Saria, S. Scalable joint models for reliable uncertainty-aware event\nprediction. IEEE transactions on pattern analysis and machine intelligence, 40(8):1948\u20131963,\n2018.\n\nWang, J., Hertzmann, A., and Fleet, D. J. Gaussian process dynamical models. In Advances in neural\n\ninformation processing systems, pp. 1441\u20131448, 2006.\n\nZhou, G.-B., Wu, J., Zhang, C.-L., and Zhou, Z.-H. Minimal gated unit for recurrent neural networks.\n\nInternational Journal of Automation and Computing, 13(3):226\u2013234, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4018, "authors": [{"given_name": "Edward", "family_name": "De Brouwer", "institution": "KU Leuven"}, {"given_name": "Jaak", "family_name": "Simm", "institution": "KU Leuven"}, {"given_name": "Adam", "family_name": "Arany", "institution": "University of Leuven"}, {"given_name": "Yves", "family_name": "Moreau", "institution": "KU Leuven"}]}