{"title": "BRITS: Bidirectional Recurrent Imputation for Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 6775, "page_last": 6785, "abstract": "Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. \nIn this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data.\nWe evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity.\nExperiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.", "full_text": "BRITS: Bidirectional Recurrent Imputation for Time\n\nSeries\n\nWei Cao\u2217\n\nTsinghua University\nBytedance AI Lab\n\ncao-13@tsinghua.org.cn\n\nHao Zhou\n\nBytedance AI Lab\n\nhaozhou0806@gmail.com\n\nDong Wang\n\nDuke University\n\ndong.wang363@duke.edu\n\nYitan Li\n\nBytedance AI Lab\n\nliyitan@bytedance.com\n\nJian Li\n\nTsinghua University\n\nlijian83@mail.tsinghua.edu.cn\n\nLei Li\n\nBytedance AI Lab\n\nlileilab@bytedance.com\n\nAbstract\n\nTime series are ubiquitous in many classi\ufb01cation/regression applications. However,\nthe time series data in real applications may contain many missing values. Hence,\ngiven multiple (possibly correlated) time series data, it is important to \ufb01ll in\nmissing values and at the same time to predict their class labels. Existing imputation\nmethods often impose strong assumptions of the underlying data generating process,\nsuch as linear dynamics in the state space. In this paper, we propose a novel method,\ncalled BRITS, based on recurrent neural networks for missing value imputation\nin time series data. Our proposed method directly learns the missing values in a\nbidirectional recurrent dynamical system, without any speci\ufb01c assumption. The\nimputed values are treated as variables of RNN graph and can be effectively\nupdated during backpropagation. BRITS has three advantages: (a) it can handle\nmultiple correlated missing values in time series; (b) it generalizes to time series\nwith nonlinear dynamics underlying; (c) it provides a data-driven imputation\nprocedure and applies to general settings with missing data. We evaluate our\nmodel on three real-world datasets, including an air quality dataset, a health-\ncare dataset, and a localization dataset for human activity. Experiments show\nthat our model outperforms the state-of-the-art methods in both imputation and\nclassi\ufb01cation/regression.\n\n1\n\nIntroduction\n\nMultivariate time series data are abundant in many application areas, such as \ufb01nancial marketing\n[5, 4], health-care [10, 22], meteorology [31, 26], and traf\ufb01c engineering [29, 35]. Time series\nare widely used as signals for classi\ufb01cation/regression in various applications of corresponding\nareas. However, missing values in time series are very common, due to unexpected accidents, such\nas equipment damage or communication error, and may signi\ufb01cantly harm the performance of\ndownstream applications.\nMuch prior work proposed to \ufb01x the missing data problem with statistics and machine learning\napproaches. However most of them require fairly strong assumptions on missing values. We can \ufb01ll\nthe missing values using classical statistical time series models such as ARMA or ARIMA (e.g., [1]).\nBut these models are essentially linear (after differencing). Kreindler et al. [19] assume that the data\nare smoothable, i.e., there is no sudden wave in the periods of missing values, hence imputing missing\n\n\u2217Work done while Wei Cao was a research intern at Bytedance AI Lab\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fvalues can be done by smoothing over nearby values. Matrix completion has also been used to\naddress missing value problems (e.g., [30, 34]). But it typically applies to only static data and requires\nstrong assumptions such as low-rankness. We can also predict missing values by \ufb01tting a parametric\ndata-generating model with the observations [14, 2], which assumes that data of time series follow the\ndistribution of hypothetical models. These assumptions make corresponding imputation algorithms\nless general, and the performance less desirable when the assumptions do not hold.\nIn this paper, we propose BRITS, a novel method for \ufb01lling the missing values for multiple correlated\ntime series. Internally, BRITS adapts recurrent neural networks (RNN) [16, 11] for imputing missing\nvalues, without any speci\ufb01c assumption over the data. Much prior work uses non-linear dynamical\nsystems for time series prediction [9, 24, 3]. In our method, we instantiate the dynamical system as a\nbidirectional RNN, i.e., imputing missing values with bidirectional recurrent dynamics. In particular,\nwe make the following technical contributions:\n\n\u2022 We design a bidirectional RNN model for imputing missing values. We directly use RNN for\npredicting missing values, instead of tuning weights for smoothing as in [10]. Our method\ndoes not impose speci\ufb01c assumption, hence works more generally than previous methods.\n\u2022 We regard missing values as variables of the bidirectional RNN graph, which are involved\nin the backpropagation process. In such case, missing values get delayed gradients in both\nforward and backward directions with consistency constraints, which makes the estimation\nof missing values more accurate.\n\n\u2022 We simultaneously perform missing value imputation and classi\ufb01cation/regression of ap-\nplications jointly in one neural graph. This alleviates the error propagation problem from\nimputation to classi\ufb01cation/regression and makes the classi\ufb01cation/regression more accurate.\n\u2022 We evaluate our model on three real-world datasets, including an air quality dataset, a\nhealth-care dataset and a localization dataset of human activities. Experimental results show\nthat our model outperforms the state-of-the-art models for both imputation and classi\ufb01ca-\ntion/regression accuracies.\n\n2 Related Work\n\nThere is a large body of literature on the imputation of missing values in time series. We only mention\na few closely related ones. The interpolate method tries to \ufb01t a \"smooth curve\" to the observations\nand thus reconstruct the missing values by the local interpolations [19, 14]. Such method discards\nany relationships between the variables over time. The autoregressive method, including ARIMA,\nSARIMA etc., eliminates the non-stationary parts in the time series data and \ufb01t a parameterized\nstationary model. The state space model further combines ARIMA and Kalman Filter [14, 15], which\nprovides more accurate results. Multivariate Imputation by Chained Equations (MICE) [2] \ufb01rst\ninitializes the missing values arbitrarily and then estimates each missing variable based on the chain\nequations. The graphical model [20] introduces a latent variable for each missing value, and \ufb01nds\nthe latent variables by learning their transition matrix. There are also some data-driven methods\nfor missing value imputation. Yi et al. [32] imputed the missing values in air quality data with\ngeographical features. Wang et al. [30] imputed missing values in recommendation system with\ncollaborative \ufb01ltering. Yu et al. [34] utilized matrix factorization with temporal regularization to\nimpute the missing values in regularly sampled time series data.\nRecently, some researchers attempted to impute the missing values with recurrent neural networks\n[7, 10, 21, 12, 33]. The recurrent components are trained together with the classi\ufb01cation/regression\ncomponent, which signi\ufb01cantly boosts the accuracy. Che et al. [10] proposed GRU-D, which imputes\nmissing values in health-care data with a smooth fashion. It assumes that a missing variable can be\nrepresented as the combination of its corresponding last observed value and the global mean. GRU-D\nachieves remarkable performance on health-care data. However, it has many limitations on general\ndatasets [10]. A closely related work is M-RNN proposed by Yoon et al. [33]. M-RNN also utilizes\nbi-directional RNN to impute missing values. Differing from our model, it drops the relationships\nbetween missing variables. The imputed values in M-RNN are treated as constants and cannot be\nsuf\ufb01ciently updated.\n\n2\n\n\f3 Preliminary\n\nWe \ufb01rst present the problem formulation and some necessary preliminaries.\n\nDe\ufb01nition 1 (Multivariate Time Series) We denote a multivariate\nseries X =\n{x1, x2, . . . , xT} as a sequence of T observations. The t-th observation xt \u2208 RD consists\nof D features {x1\nt }, and was observed at timestamp st (the time gap between different\ntimestamps may not be the same). In reality, due to unexpected accidents, such as equipment damage\nor communication error, xt may have the missing values (e.g., in Fig. 1, x3\n1 in x1 is missing). To\nrepresent the missing values in xt, we introduce a masking vector mt where,\n\nt , . . . , xD\n\nt , x2\n\ntime\n\nIn many cases, some features can be missing for consecutive timestamps (e.g., blue blocks in Fig. 1).\nWe de\ufb01ne \u03b4d\n\nt to be the time gap from the last observation to the current timestamp st, i.e.,\n\n1\n\nmd\n\nt =\n\n(cid:26) 0\n\uf8f1\uf8f2\uf8f3 st \u2212 st\u22121 + \u03b4d\n\nst \u2212 st\u22121\n0\n\n\u03b4d\nt =\n\nt is not observed\n\nif xd\notherwise\n\n.\n\nt\u22121\n\nif t > 1, md\nif t > 1, md\nif t = 1\n\nt\u22121 = 0\nt\u22121 = 1\n\n.\n\nSee Fig. 1 for an illustration.\n\nFigure 1: An example of multivariate time series with missing values. x1 to x6 are observed at\ns1...6 = 0, 2, 7, 9, 14, 15 respectively. Considering the 2nd feature in x6, the last observation of the\n2nd feature took place at s2 = 2, and we have that \u03b42\n\n6 = s6 \u2212 s2 = 13.\n\nIn this paper, we study a general setting for time series classi\ufb01cation/regression problems with missing\nvalues. We use y to denote the label of corresponding classi\ufb01cation/regression task. In general, y\ncan be either a scalar or a vector. Our goal is to predict y based on the given time series X. In the\nmeantime, we impute the missing values in X as accurate as possible. In another word, we aim to\ndesign an effective multi-task learning algorithm for both classi\ufb01cation/regression and imputation.\n\n4 BRITS\n\nIn this section, we describe the BRITS. Differing from the prior work which uses RNN to impute\nmissing values in a smooth fashion [10], we learn the missing values directly in a recurrent dynamical\nsystem [25, 28] based on the observed data. The missing values are thus imputed according to the\nrecurrent dynamics, which signi\ufb01cantly boosts both the imputation accuracy and the \ufb01nal classi\ufb01ca-\ntion/regression accuracy. We start the description with the simplest case: the variables observed at\nthe same time are mutually uncorrelated. For such case, we propose algorithms for imputation with\nunidirectional recurrent dynamics and bidirectional recurrent dynamics, respectively. We further\npropose an algorithm for correlated multivariate time series in Section 4.3.\n\n4.1 Unidirectional Uncorrelated Recurrent Imputation\n\nt may\nt(cid:48)(cid:54)=t). We \ufb01rst propose an imputation algorithm by unidirectional recurrent\n\nFor the simplest case, we assume that for the t-th step, xi\nbe correlated with some xj\ndynamics, denoted as RITS-I.\nIn a unidirectional recurrent dynamical system, each value in the time series can be derived by its\npredecessors with a \ufb01xed arbitrary function [9, 24, 3]. Thus, we iteratively impute all the variables in\n\nt are uncorrelated if i (cid:54)= j (but xi\n\nt and xj\n\n3\n\n\fFigure 2: Imputation with unidirectional dynamics.\n\nthe time series according to the recurrent dynamics. For the t-th step, if xt is actually observed, we\nuse it to validate our imputation and pass xt to the next recurrent steps. Otherwise, since the future\nobservations are correlated with the current value, we replace xt with the obtained imputation, and\nvalidate it by the future observations. To be more concrete, let us consider an example.\nExample 1 Suppose we are given a time series X = {x1, x2, . . . , x10}, where x5, x6 and x7 are\nmissing2. According to the recurrent dynamics, at each time step t, we can obtain an estimation \u02c6xt\nbased on the previous t\u2212 1 steps. In the \ufb01rst 4 steps, the estimation error can be obtained immediately\nby calculating the estimation loss function Le(\u02c6xt, xt) for t = 1, . . . , 4. However, when t = 5, 6, 7,\nwe cannot get the error immediately since the true values are missing. Nevertheless, note that at the\n8-th step, \u02c6x8 depends on \u02c6x5 to \u02c6x7. We thus obtain a \u201cdelayed\" error for \u02c6xt=5,6,7 at the 8-th step.\n\n4.1.1 Algorithm\n\nWe introduce a recurrent component and a regression component for imputation. The recurrent\ncomponent is achieved by a recurrent neural network and the regression component is achieved by a\nfully-connected network. A standard recurrent network [17] can be represented as\n\nht = \u03c3(Whht\u22121 + Uhxt + bh),\n\nwhere \u03c3 is the sigmoid function, Wh, Uh and bh are parameters, and ht is the hidden state of\nprevious time steps.\nIn our case, since xt may have missing values, we cannot use xt as the input directly as in the above\nequation. Instead, we use a \u201ccomplement\" input xc\nt derived by our algorithm when xt is missing.\nFormally, we initialize the initial hidden state h0 as an all-zero vector and then update the model by:\n(1)\n(2)\n(3)\n(4)\n(5)\n\n\u02c6xt = Wxht\u22121 + bx,\nt = mt (cid:12) xt + (1 \u2212 mt) (cid:12) \u02c6xt,\nxc\n\u03b3t = exp{\u2212 max(0, W\u03b3\u03b4t + b\u03b3)},\nht = \u03c3(Wh[ht\u22121 (cid:12) \u03b3t] + Uh[xc\n(cid:96)t = (cid:104)mt,Le(xt, \u02c6xt)(cid:105) .\n\nt \u25e6 mt] + bh),\n\nEq. (1) is the regression component which transfers the hidden state ht\u22121 to the estimated vector\n\u02c6xt. In Eq. (2), we replace missing values in xt with corresponding values in \u02c6xt, and obtain the\ncomplement vector xc\nt. Besides, since the time series may be irregularly sampled, in Eq. (3), we\nfurther introduce a temporal decay factor \u03b3t. Such factor represents the missing patterns in the time\nseries which is critical to imputation [10]. In Eq. (4), based on the decayed hidden state, we predict\nthe next state ht. Here, \u25e6 indicates the concatenate operation. In the mean time, we calculate the\nestimation error by the estimation loss function Le in Eq. (5). In our experiment, we use the mean\nabsolute error for Le. Finally, we predict the task label y as\n\nT(cid:88)\n\n\u02c6y = fout(\n\n\u03b1ihi),\n\n2Without loss of generality, we assume all D features are missing at those steps for the sake of clarity.\n\ni=1\n\n4\n\n\fwhere fout can be either a fully-connected layer or a softmax layer depending on the speci\ufb01c task,\nand \u03b1i is the weight for different hidden state which can be derived by the attention mechanism or\nT . We calculate the output loss by Lout(y, \u02c6y). Our model\nthe mean pooling mechanism3, i.e., \u03b1i = 1\nis then updated by minimizing the accumulated loss 1\nT\n\n(cid:80)T\nt=1 (cid:96)t + Lout(y, \u02c6y).\n\n4.1.2 Practical Issues\n\nIn practice, we use LSTM as the recurrent component in Eq. (4) to prevent the gradient vanish-\ning/exploding problems in vanilla RNN [17]. Standard RNN models including LSTM treat \u02c6xt as a\nconstant. During backpropagation, gradients are cut at \u02c6xt. This makes the estimation errors backprop-\nagate insuf\ufb01ciently. For example, in Example 1, the estimation errors of \u02c6x5 to \u02c6x7 are obtained at the\n8-th step as the delayed errors. However, if we treat \u02c6x5 to \u02c6x7 as constants, such delayed error cannot\nbe fully backpropagated. To overcome such issue, we treat \u02c6xt as a variable of RNN graph. We let the\nestimation error passes through \u02c6xt during the backpropagation. Fig. 2 shows how RITS-I method\nworks in Example 1. In this example, the gradients are backpropagated through the opposite direction\nof solid lines. Thus, the delayed error (cid:96)8 is passed to steps 5, 6 and 7. In the experiment, we \ufb01nd that\nour models are unstable during training if we treat \u02c6xt as a constant. See the appendix for details.\n\n4.2 Bidirectional Uncorrelated Recurrent Imputation\n\nIn the RITS-I, errors of estimated missing values are delayed until the presence of the next observation.\nFor example, in Example 1, the error of \u02c6x5 is delayed until x8 is observed. Such error delay makes\nthe model converge slowly and in turn leads to inef\ufb01ciency in training. In the meantime, it also leads\nto the bias exploding problem [6], i.e., the mistakes made early in the sequential prediction are fed as\ninput to the model and may be quickly ampli\ufb01ed. In this section, we propose an improved version\ncalled BRITS-I. The algorithm alleviates the above-mentioned issues by utilizing the bidirectional\nrecurrent dynamics on the given time series, i.e., besides the forward direction, each value in time\nseries can be also derived from the backward direction by another \ufb01xed arbitrary function.\nTo illustrate the intuition of BRITS-I algorithm, again, we consider Example 1. Consider the\nbackward direction of the time series. In bidirectional recurrent dynamics, the estimation \u02c6x4 reversely\ndepends on \u02c6x5 to \u02c6x7. Thus, the error in the 5-th step is back-propagated from not only the 8-th step in\nthe forward direction (which is far from the current position), but also the 4-th step in the backward\ndirection (which is closer). Formally, the BRITS-I algorithm performs the RITS-I as shown in\nEq. (1) to Eq. (5) in forward and backward directions, respectively. In the forward direction, we\nobtain the estimation sequence {\u02c6x1, \u02c6x2, . . . , \u02c6xT} and the loss sequence {(cid:96)1, (cid:96)2, . . . , (cid:96)T}. Similarly,\nT} and another\nin the backward direction, we obtain another estimation sequence {\u02c6x(cid:48)\nloss sequence {(cid:96)1\n(cid:48)}. We enforce the prediction in each step to be consistent in both\ndirections by introducing the \u201cconsistency loss\u201d:\n\n2, . . . , \u02c6x(cid:48)\n\n, . . . , (cid:96)T\n\n1, \u02c6x(cid:48)\n\n(cid:48)\n\n(cid:48)\n\n, (cid:96)2\n\n(6)\nwhere we also use the mean absolute error as the discrepancy in our experiment. The \ufb01nal estimation\nloss is obtained by accumulating the forward loss (cid:96)t, the backward loss (cid:96)(cid:48)\nt and the consistency loss\n(cid:96)cons\nt\n\n. The \ufb01nal estimation in the t-th step is the mean of \u02c6xt and \u02c6xt\n\n(cid:96)cons\nt\n\n(cid:48).\n\n= Discrepancy(\u02c6xt, \u02c6x(cid:48)\nt)\n\n4.3 Correlated Recurrent Imputation\n\nThe previously proposed algorithms RITS-I and BRITS-I assume that features observed at the same\ntime are mutually uncorrelated. This may be not true in some scenarios. For example, in the air\nquality data [32], each feature represents one measurement in a monitoring station. Obviously, the\nobserved measurements are spatially correlated. In general, the measurement in one monitoring\nstation is close to those values observed in its neighboring stations. In this case, we can estimate a\nmissing measurement according to both its historical data, and the measurements in its neighbors.\nIn this section, we propose an algorithm, which utilizes the feature correlations in the unidirectional\nrecurrent dynamical system. We refer to such algorithm as RITS. The feature correlated algorithm\nfor bidirectional case (i.e., BRITS) can be derived in the same way. Note that in Section 4.1, the\nestimation \u02c6xt is only correlated with the historical observations, but irrelevant with the the current\n\n3In this paper, we simply adopt mean pooling. The design of attention mechanism is out of this paper\u2019s scope.\n\n5\n\n\fobservation. We refer to \u02c6xt as a \u201chistory-based estimation\". In this section, we derive another\nt , based on the other features at time st. Speci\ufb01cally, at the t-th\n\u201cfeature-based estimation\" for each xd\nstep, we \ufb01rst obtain the complement observation xc\nt by Eq. (1) and Eq. (2). Then, we de\ufb01ne our\nfeature-based estimation as \u02c6zt where\n\n(7)\nWz, bz are corresponding parameters. We restrict the diagonal of parameter matrix Wz to be all\nzeros. Thus, the d-th element in \u02c6zt is exactly the estimation of xd\nt , based on the other features. We\nfurther combine the historical-based estimation \u02c6xt and the feature-based estimation \u02c6zt. We denote\nthe combined vector as \u02c6ct, and we have that\n\n\u02c6zt = Wzxc\n\nt + bz,\n\n\u03b2t = \u03c3(W\u03b2[\u03b3t \u25e6 mt] + b\u03b2)\n\u02c6ct = \u03b2t (cid:12) \u02c6zt + (1 \u2212 \u03b2t) (cid:12) \u02c6xt.\n\n(8)\n(9)\nHere we use \u03b2t \u2208 [0, 1]D as the weight of combining the history-based estimation \u02c6xt and the\nfeature-based estimation \u02c6zt. Note that \u02c6zt is derived from xc\nt can be\nhistory-based estimations or truly observed values, depending on the presence of the observations.\nThus, we learn the weight \u03b2t by considering both the temporal decay \u03b3t and the masking vector mt\nas shown in Eq. (8). The rest parts are similar as the feature uncorrelated case. We \ufb01rst replace\nmissing values in xt with the corresponding values in \u02c6ct. The obtained vector is then fed to the next\nrecurrent step to predict memory ht:\n\nt by Eq. (7). The elements of xc\n\nt = mt (cid:12) xt + (1 \u2212 mt) (cid:12) \u02c6ct\ncc\nht = \u03c3(Wh[ht\u22121 (cid:12) \u03b3t] + Uh[cc\n\n(10)\n(11)\nHowever, the estimation loss is slightly different with the feature uncorrelated case. We \ufb01nd that\nsimply using (cid:96)t = Le(xt, \u02c6ct) leads to a very slow convergence speed. Instead, we accumulate all the\nestimation errors of \u02c6xt, \u02c6zt and \u02c6ct:\n\nt \u25e6 mt] + bh).\n\n(cid:96)t = Le(xt, \u02c6xt) + Le(xt, \u02c6zt) + Le(xt, \u02c6ct).\n\n5 Experiment\n\nOur proposed methods are applicable to a wide variety of applications. We evaluate our methods on\nthree different real-world datasets. The download links of the datasets, as well as the implementation\ncodes can be found in the GitHub page4.\n\n5.1 Dataset Description\n\n5.1.1 Air Quality Data\n\nWe evaluate our models on the air quality dataset, which consists of PM2.5 measurements from\n36 monitoring stations in Beijing. The measurements are hourly collected from 2014/05/01 to\n2015/04/30. Overall, there are 13.3% values are missing. For this dataset, we do pure imputation\ntask. We use exactly the same train/test setting as in prior work [32], i.e., we use the 3rd, 6th, 9th and\n12th months as the test data and the other months as the training data. See the appendix for details.\nTo train our model, we randomly select every 36 consecutive steps as one time series.\n\n5.1.2 Health-care Data\n\nWe evaluate our models on health-care data in PhysioNet Challenge 2012 [27], which consists of\n4000 multivariate clinical time series from intensive care unit (ICU). Each time series contains 35\nmeasurements such as Albumin, heart-rate etc., which are irregularly sampled at the \ufb01rst 48 hours\nafter the patient\u2019s admission to ICU. We stress that this dataset is extremely sparse. There are up to\n78% missing values in total. For this dataset, we do both the imputation task and the classi\ufb01cation\ntask. To evaluate the imputation performance, we randomly eliminate 10% of observed measurements\nfrom data and use them as the ground-truth. At the same time, we predict the in-hospital death of\neach patient as a binary classi\ufb01cation task. Note that the eliminated measurements are only used for\nvalidating the imputation, and they are never visible to the model.\n\n4https://github.com/caow13/BRITS\n\n6\n\n\f5.1.3 Localization for Human Activity Data\n\nThe UCI localization data for human activity [18] contains records of \ufb01ve people performing different\nactivities such as walking, falling, sitting down etc (there are 11 activities). Each person wore four\nsensors on her/his left/right ankle, chest, and belt. Each sensor recorded a 3-dimensional coordinates\nfor about 20 to 40 millisecond. We randomly select 40 consecutive steps as one time series, and\nthere are 30, 917 time series in total. For this dataset, we do both imputation and classi\ufb01cation tasks.\nSimilarly, we randomly eliminate 10% observed data as the imputation ground-truth. We further\npredict the corresponding activity of observed time series (i.e., walking, sitting, etc).\n\n5.2 Experiment Setting\n\n5.2.1 Model Implementations\n\nTo make a fair comparison, we control the number of parameters of all models as around 80, 000. We\ntrain our model by an Adam optimizer with learning rate 0.001 and batch size 64. For all the tasks,\nwe normalize the numerical values to have zero mean and unit variance for stable training.\nWe use different early stopping strategies for pure imputation task and classi\ufb01cation tasks. For the\nimputation tasks, we randomly select 10% of non-missing values as the validation data. The early\nstopping is thus performed based on the validation error. For the classi\ufb01cation tasks, we \ufb01rst pre-train\nthe model as a pure imputation task and report its imputation accuracy. Then we use 5-fold cross\nvalidation to further optimize both the imputation and classi\ufb01cation losses simultaneously.\nWe evaluate the imputation performance in terms of mean absolute error (MAE) and mean relative\nerror (MRE). Suppose that labeli is the ground-truth of the i-th item, predi is the output of the i-th\nitem, and there are N items in total. Then, MAE and MRE are de\ufb01ned as\n\n(cid:80)\ni |predi \u2212 labeli|\n\nN\n\nMAE =\n\n,\n\nMRE =\n\n(cid:80)\n\n(cid:80)\ni |predi \u2212 labeli|\ni |labeli|\n\n.\n\nFor air quality data, the evaluation is performed on its original data. For heath-care data and activity\ndata, since the numerical values are not in the same scale, we evaluate the performances on their\nnormalized data. To further evaluate the classi\ufb01cation performances, we use area under ROC curve\n(AUC) [8] for health-care data, since such data is highly unbalanced (there are 10% patients who\ndied in hospital). We then use standard accuracy for the activity data, since different activities are\nrelatively balanced.\n\n5.2.2 Baseline Methods\n\nWe compare our model with both RNN-based methods and non-RNN based methods. The non-RNN\nbased imputation methods include:\n\n\u2022 Mean: We simply replace the missing values with corresponding global mean.\n\u2022 KNN: We use k-nearest neighbor [13] (with normalized Euclidean distance) to \ufb01nd the\n\nsimilar samples, and impute the missing values with weighted average of its neighbors.\n\n\u2022 Matrix Factorization (MF): We factorize the data matrix into two low-rank matrices, and\n\n\ufb01ll the missing values by matrix completion [13].\n\n\u2022 MICE: We use Multiple Imputation by Chained Equations (MICE), a widely used impu-\ntation method, to \ufb01ll the missing values. MICE creates multiple imputations with chained\nequations [2].\n\n\u2022 ImputeTS: We use ImputeTS package in R to impute the missing values. ImputeTS is a\nwidely used package for missing value imputation, which utilizes the state space model and\nkalman smoothing [23].\n\n\u2022 STMVL: Speci\ufb01cally, we use STMVL for the air quality data imputation. STMVL is the\nstate-of-the-art method for air quality data imputation. It further utilizes the geo-locations\nwhen imputing missing values [32].\n\n7\n\n\fTable 1: Performance Comparison for Imputation Tasks (in MAE(MRE%))\n\nMethod\n\nNon-RNN\n\nRNN\n\nOurs\n\nMean\nKNN\nMF\nMICE\n\nImputeTS\nSTMVL\nGRU-D\nM-RNN\nRITS-I\nBRITS-I\n\nRITS\nBRITS\n\nAir Quality\n\n55.51 (77.97%)\n29.79 (41.85%)\n27.94 (39.25%)\n27.42 (38.52%)\n19.58 (27.51%)\n12.12 (17.40%)\n\n/\n\n14.05 (20.16%)\n12.45 (17.93%)\n11.58 (16.66%)\n12.19 (17.54%)\n11.56 (16.65%)\n\nHealth-care\n\n0.461 (65.61%)\n0.367 (52.15%)\n0.468 (67.97%)\n0.510 (72.5%)\n0.390 (54.2%)\n\n/\n\n0.559 (77.58%)\n0.445 (61.87%)\n0.385 (53.41%)\n0.361 (50.01%)\n0.292 (40.82%)\n0.278 (38.72%)\n\nHuman Activity\n0.767 (96.43%)\n0.479 (58.54%)\n0.879 (110.44%)\n0.477 (57.94%)\n0.363 (45.65%)\n\n/\n\n0.558 (70.05%)\n0.248 (31.19%)\n0.240 (30.10%)\n0.220 (27.61%)\n0.248 (31.21%)\n0.219 (27.59%)\n\nWe implement KNN, MF and MICE based on the python package fancyimpute5. In recent studies,\nRNN-based models in missing value imputations achieve remarkable performances [10, 21, 12, 33].\nWe also compare our model with existing RNN-based imputation methods, including:\n\n\u2022 GRU-D: GRU-D is proposed for handling missing values in health-care data. It imputes\neach missing value by the weighted combination of its last observation, and the global mean,\ntogether with a recurrent component [10].\n\n\u2022 M-RNN: M-RNN also uses bi-directional RNN. It imputes the missing values according to\nhidden states in both directions in RNN. M-RNN treats the imputed values as constants. It\ndoes not consider the correlations among different missing values [33].\n\nWe compare the baseline methods with our four models: RITS-I (Section 4.1), RITS (Section 4.2),\nBRITS-I (Section 4.3) and BRITS (Section 4.3). We implement all the RNN-based models with\nPyTorch, a widely used package for deep learning. All models are trained with GPU GTX 1080.\n\n5.3 Experiment Results\n\nTable 1 shows the imputation results. As we can see, simply applying na\u00efve mean imputation is\nvery inaccurate. Alternatively, KNN, MF, and MICE perform much better than mean imputation.\nHowever, these methods demonstrate unstable performances in different tasks. For example, the MF\nalgorithm performs well on the health-care data. But it shows a very bad accuracy on the human\nactivity data. ImputeTS achieves the best performance among all the non-RNN methods, especially\nfor the heath-care data (which is smooth and contains few sudden waves). STMVL performs well on\nthe air quality data. However, it is speci\ufb01cally designed for geographical data, and cannot be applied\nin the other datasets. Most of RNN-based methods, except GRU-D, demonstrates signi\ufb01cantly better\nperformances in imputation tasks. We stress that GRU-D imputes missing value implicitly. It actually\nperforms very well on classi\ufb01cation accuracies. M-RNN uses explicitly imputation procedure, and\nachieves remarkable imputation results. Our model BRITS outperforms all the baseline models.\nAccording to the performances of our four models, we also \ufb01nd that both bidirectional recurrent\ndynamics, and the feature correlations are helpful to enhance the model performance.\nWe further compare classi\ufb01cation accuracies as shown in Table 2. Similar as the imputation tasks,\nour model BRITS outperforms all the other RNN-based models in classi\ufb01cation tasks as well. Note\nthat despite GRU-D does not demonstrate accurate imputation results, it actually performs very well\non the classi\ufb01cation tasks. The AUC score of GRU-D is only slightly worse than our RITS model.\nTo further show the correlations between imputation accuracy and classi\ufb01cation accuracy, we do\nthe health-care classi\ufb01cation based on the imputed values by different models, with the classical\nrandom forest algorithm. The results are shown in Fig. 3. Surprisingly, we \ufb01nd that the random forest\n\n5https://github.com/iskandr/fancyimpute\n\n8\n\n\factually works well with simple mean imputation. The AUC score based on the mean imputation\nis even better than that on GRU-D and M-RNN. We guess that since GRU-D and M-RNN does not\nfocus on the imputation accuracy, the imputed values might be harmful to downstream classi\ufb01cations\nwith the other models. Alternatively, our model BRITS uses a multi-task learning mechanism which\neffectively enhances the classi\ufb01cation accuracy.\n\nTable 2: Performance Comparison for Classi\ufb01cation Tasks\n\nMethod Health-care (AUC) Human Activity (Accuracy)\nGRU-D\nM-RNN\nRITS-I\nBRITS-I\n\n0.834 \u00b1 0.002\n0.817 \u00b1 0.003\n0.821 \u00b1 0.007\n0.831 \u00b1 0.003\n0.840 \u00b1 0.004\n0.850 \u00b1 0.002\n\n0.940 \u00b1 0.010\n0.938 \u00b1 0.010\n0.934 \u00b1 0.008\n0.940 \u00b1 0.012\n0.968 \u00b1 0.010\n0.969 \u00b1 0.008\n\nRITS\nBRITS\n\nFigure 3: Health-care Classi\ufb01cation Based on Different Imputations with Random Forest\n\n6 Conclusion\n\nIn this paper, we proposed BRITS, a novel method to use recurrent dynamics to effectively impute\nthe missing values in multivariate time series. Instead of imposing assumptions over the data-\ngenerating process, our model directly learns the missing values in a bidirectional recurrent dynamical\nsystem, without any speci\ufb01c assumption. Our model treats missing values as variables of the\nbidirectional RNN graph. Thus, we get the delayed gradients for missing values in both forward and\nbackward directions, which makes the imputation of missing values more accurate. We performed the\nmissing value imputation and classi\ufb01cation/regression simultaneously within a joint neural network.\nExperiment results show that our model demonstrates more accurate results for both imputation and\nclassi\ufb01cation/regression than state-of-the-art methods.\n\n7 Acknowledgements\n\nWei Cao and Jian Li are supported in part by the National Basic Research Program of China Grant\n2015CB358700, the National Natural Science Foundation of China Grant 61822203 ,61772297,\n61632016, 61761146003, and a grant from Microsoft Research Asia.\n\nReferences\n[1] C. F. Ansley and R. Kohn. On the estimation of arima models with missing values. In Time\n\nseries analysis of irregularly observed data, pages 9\u201337. Springer, 1984.\n\n[2] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf. Multiple imputation by chained equations:\nwhat is it and how does it work? International journal of methods in psychiatric research,\n20(1):40\u201349, 2011.\n\n9\n\n\f[3] A. Basharat and M. Shah. Time series prediction by chaotic modeling of nonlinear dynamical\nsystems. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1941\u20131948.\nIEEE, 2009.\n\n[4] B. Batres-Estrada. Deep learning for multivariate \ufb01nancial time series, 2015.\n\n[5] S. Bauer, B. Sch\u00f6lkopf, and J. Peters. The arrow of time in multivariate time series.\n\nInternational Conference on Machine Learning, pages 2043\u20132051, 2016.\n\nIn\n\n[6] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction\nwith recurrent neural networks. In Advances in Neural Information Processing Systems, pages\n1171\u20131179, 2015.\n\n[7] M. Berglund, T. Raiko, M. Honkala, L. K\u00e4rkk\u00e4inen, A. Vetek, and J. T. Karhunen. Bidirectional\nrecurrent neural networks as generative models. In Advances in Neural Information Processing\nSystems, pages 856\u2013864, 2015.\n\n[8] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning\n\nalgorithms. Pattern recognition, 30(7):1145\u20131159, 1997.\n\n[9] P. Brakel, D. Stroobandt, and B. Schrauwen. Training energy-based models for time-series\n\nimputation. The Journal of Machine Learning Research, 14(1):2771\u20132797, 2013.\n\n[10] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for\n\nmultivariate time series with missing values. Scienti\ufb01c reports, 8(1):6085, 2018.\n\n[11] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. arXiv preprint arXiv:1406.1078, 2014.\n\n[12] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun. Doctor ai: Predicting clinical\nevents via recurrent neural networks. In Machine Learning for Healthcare Conference, pages\n301\u2013318, 2016.\n\n[13] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics Springer, Berlin, 2001.\n\n[14] D. S. Fung. Methods for the estimation of missing values in time series. 2006.\n\n[15] A. C. Harvey. Forecasting, structural time series models and the Kalman \ufb01lter. Cambridge\n\nuniversity press, 1990.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735,\n\n1997.\n\n[17] G. Ian, B. Yoshua, and C. Aaron. Deep learning. Book in preparation for MIT Press, 2016.\n\n[18] B. Kalu\u017ea, V. Mirchevska, E. Dovgan, M. Lu\u0161trek, and M. Gams. An agent-based approach\nto care in independent living. In International joint conference on ambient intelligence, pages\n177\u2013186. Springer, 2010.\n\n[19] D. M. Kreindler and C. J. Lumsden. The effects of the irregular sample and missing data in\ntime series analysis. Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using\nReal Data, page 135, 2012.\n\n[20] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo: Mining and summarization\nIn Proceedings of the 15th ACM SIGKDD\nof coevolving sequences with missing values.\ninternational conference on Knowledge discovery and data mining, pages 507\u2013516. ACM,\n2009.\n\n[21] Z. C. Lipton, D. Kale, and R. Wetzel. Directly modeling missing data in sequences with rnns:\nImproved classi\ufb01cation of clinical time series. In Machine Learning for Healthcare Conference,\npages 253\u2013270, 2016.\n\n10\n\n\f[22] Z. Liu and M. Hauskrecht. Learning linear dynamical systems from multivariate time series:\nIn Proceedings of the 2016 SIAM International\n\nA matrix factorization based framework.\nConference on Data Mining, pages 810\u2013818. SIAM, 2016.\n\n[23] S. Moritz and T. Bartz-Beielstein. imputeTS: Time Series Missing Value Imputation in R. The\n\nR Journal, 9(1):207\u2013218, 2017.\n\n[24] T. Ozaki. 2 non-linear time series models and dynamical systems. Handbook of statistics,\n\n5:25\u201383, 1985.\n\n[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[26] S. Rani and G. Sikka. Recent techniques of clustering of time series data: a survey. International\n\nJournal of Computer Applications, 52(15), 2012.\n\n[27] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality\nof icu patients: The physionet/computing in cardiology challenge 2012. In Computing in\nCardiology (CinC), 2012, pages 245\u2013248. IEEE, 2012.\n\n[28] D. Sussillo and O. Barak. Opening the black box:\n\nlow-dimensional dynamics in high-\n\ndimensional recurrent neural networks. Neural computation, 25(3):626\u2013649, 2013.\n\n[29] D. Wang, W. Cao, J. Li, and J. Ye. Deepsd: supply-demand prediction for online car-hailing ser-\nvices using deep neural networks. In Data Engineering (ICDE), 2017 IEEE 33rd International\nConference on, pages 243\u2013254. IEEE, 2017.\n\n[30] J. Wang, A. P. De Vries, and M. J. Reinders. Unifying user-based and item-based collaborative\n\ufb01ltering approaches by similarity fusion. In Proceedings of the 29th annual international ACM\nSIGIR conference on Research and development in information retrieval, pages 501\u2013508. ACM,\n2006.\n\n[31] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm\nnetwork: A machine learning approach for precipitation nowcasting. In Advances in neural\ninformation processing systems, pages 802\u2013810, 2015.\n\n[32] X. Yi, Y. Zheng, J. Zhang, and T. Li. St-mvl: \ufb01lling missing values in geo-sensory time series\n\ndata. 2016.\n\n[33] J. Yoon, W. R. Zame, and M. van der Schaar. Multi-directional recurrent neural networks: A\n\nnovel method for estimating missing data. 2017.\n\n[34] H.-F. Yu, N. Rao, and I. S. Dhillon. Temporal regularized matrix factorization for high-\ndimensional time series prediction. In Advances in neural information processing systems,\npages 847\u2013855, 2016.\n\n[35] J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residual networks for citywide crowd\n\n\ufb02ows prediction. AAAI, November 2017.\n\n11\n\n\f", "award": [], "sourceid": 3408, "authors": [{"given_name": "Wei", "family_name": "Cao", "institution": "Tsinghua University"}, {"given_name": "Dong", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Jian", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Hao", "family_name": "Zhou", "institution": "Bytedance AI Lab"}, {"given_name": "Lei", "family_name": "Li", "institution": "ByteDance AI Lab"}, {"given_name": "Yitan", "family_name": "Li", "institution": "ByteDance.Inc"}]}