{"title": "Multivariate Time Series Imputation with Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1596, "page_last": 1607, "abstract": "Multivariate time series usually contain a large number of missing values, which hinders the application of advanced analysis methods on multivariate time series data. Conventional approaches to addressing the challenge of missing values, including mean/zero imputation, case deletion, and matrix factorization-based imputation, are all incapable of modeling the temporal dependencies and the nature of complex distribution in multivariate time series. In this paper, we treat the problem of missing value imputation as data generation. Inspired by the success of Generative Adversarial Networks (GAN) in image generation, we propose to learn the overall distribution of a multivariate time series dataset with GAN, which is further used to generate the missing values for each sample. Different from the image data, the time series data are usually incomplete due to the nature of data recording process. A modified Gate Recurrent Unit is employed in GAN to model the temporal irregularity of the incomplete time series. Experiments on two multivariate time series datasets show that the proposed model outperformed the baselines in terms of accuracy of imputation. Experimental results also showed that a simple model on the imputed data can achieve state-of-the-art results on the prediction tasks, demonstrating the benefits of our model in downstream applications.", "full_text": "Multivariate Time Series Imputation with\n\nGenerative Adversarial Networks\n\nYonghong Luo\n\nCollege of Computer Science\n\nNankai University\n\nTianjin, China\n\nXiangrui Cai\n\nCollege of Computer Science\n\nNankai University\n\nTianjin, China\n\nluoyonghong@dbis.nankai.edu.cn\n\ncaixiangrui@dbis.nankai.edu.cn\n\nYing Zhang \u2217\n\nCollege of Computer Science\n\nNankai University\n\nTianjin, China\n\nJun Xu\n\nSchool of Information\n\nRenmin University of China\n\nBeijing, China\n\nXiaojie Yuan\n\nCollege of Computer Science\n\nNankai University\n\nTianjin, China\n\nyingzhang@nankai.edu.cn\n\njunxu@ruc.edu.cn\n\nyuanxj@nankai.edu.cn\n\nAbstract\n\nMultivariate time series usually contain a large number of missing values, which\nhinders the application of advanced analysis methods on multivariate time series\ndata. Conventional approaches to addressing the challenge of missing values,\nincluding mean/zero imputation, case deletion, and matrix factorization-based im-\nputation, are all incapable of modeling the temporal dependencies and the nature of\ncomplex distribution in multivariate time series. In this paper, we treat the problem\nof missing value imputation as data generation. Inspired by the success of Gener-\native Adversarial Networks (GAN) in image generation, we propose to learn the\noverall distribution of a multivariate time series dataset with GAN, which is further\nused to generate the missing values for each sample. Different from the image\ndata, the time series data are usually incomplete due to the nature of data recording\nprocess. A modi\ufb01ed Gate Recurrent Unit is employed in GAN to model the tem-\nporal irregularity of the incomplete time series. Experiments on two multivariate\ntime series datasets show that the proposed model outperformed the baselines in\nterms of accuracy of imputation. Experimental results also showed that a simple\nmodel on the imputed data can achieve state-of-the-art results on the prediction\ntasks, demonstrating the bene\ufb01ts of our model in downstream applications.\n\n1\n\nIntroduction\n\nThe real world is \ufb01lled with multivariate time series data such as network records, medical logs and\nmeteorologic observations. Time series analysis is useful in many situations such as forecasting the\nstock price [22] and indicating \ufb01tness and diagnosis category of patients [7]. However, some of these\ntime series are incomplete due to the broken of collective devices, the collecting errors and willful\ndamages [15]. Besides, the time intervals of the observations in time series are not always \ufb01xed.\nFigure 1 and Figure 2 demonstrate the high missing rate of the Physionet [42] dataset. As time goes\nby, the maximum missing rate at each timestamp is always higher than 95%. We can also observe\nthat most variables\u2019 missing rate are above 80% and the mean of the missing rate is 80.67%. The\nmissing values in time series data make it hard to analyze and mine [14]. Therefore, the processing\nof missing values in time series has become a very important problem.\n\n\u2217Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 2: The lines stand for maximum, min-\nimum and average missing rates at each hour.\nThe global missing rate is 80.67%.\n\nFigure 1: Missing rates in the Physionet dataset.\nThe X-axis is the time. The Y-axis is the se-\nlected 7 variables. Redder the color, higher the\nmissing rate.\nUsually there are two ways to handle the missing values of the dataset. Some researches try to\ndirectly model the dataset with missing values [48]. However, for every dataset, we need to model\nthem separately. The second way is to impute the missing values to get the complete dataset and\nthen use conventional methods to analyze the dataset. Existing missing values processing methods\ncan be categorized into 3 classes. The very \ufb01rst one is case deletion methods [26, 43]. Its main\nidea is to discard the incomplete observations. However, these case deletion methods will ignore\nsome important information. Additionally, the higher the missing rate, the worse the result [18]. The\nsecond kind of algorithms is simple imputation methods such as mean imputation, median imputation,\nand the most common value imputation. The main drawback of these statistical imputation methods\nis the lack of the utilization of the temporal information. The last kind of methods is machine\nlearning based imputation algorithms [4, 19, 34]. These methods contain maximum likelihood\nExpectation-Maximization (EM) based imputation, KNN based imputation and Matrix Factorization\nbased imputation. However, all of these methods rarely take into account the temporal relations\nbetween two observations.\nIn recent years, Goodfellow et al. [17] have introduced the generative adversarial networks (GAN)\nwhich learns the latent distribution of a dataset and is able to generate \u201creal\u201d samples from a random\n\u201cnoise\u201d. GAN has been successfully applied to face completion and sentence generation [5, 30, 33,\n31, 13, 47]. However, before completion of the faces or generation of the sentences, these methods\nrequire the complete training dataset which is impossible in our scenario. There also exists a few\nworks that use GAN to impute the missing values [46]. However, what these works focused is\nnon-sequential dataset and they have not adopted pertinent measures to process the temporal relations.\nHence, these algorithms can not be applied to the imputation of time series data well.\nInspired by the success of GAN in image imputation, we take the advantage of the adversarial model to\ngenerate and impute the original incomplete time series data. In order to learn the latent relationships\nbetween observations with non-\ufb01xed time lags, we propose a novel RNN cell called GRUI which can\ntake into account the non-\ufb01xed time lags and fade the in\ufb02uence of the past observations determined\nby the time lags. In the \ufb01rst phase, by adopting the GRUI in the discriminator and generator in\nGAN, the well trained adversarial model can learn the distribution of the whole dataset, the implicit\nrelationships between observations and the temporal information of the dataset. In the second phase,\nwe train the input \u201cnoise\u201d of the generator of the GAN so that the generated time series is as close\nas possible to the original incomplete time series and the generated data\u2019s probability of being real\nis the biggest. To the best of our knowledge, this is the \ufb01rst work that uses adversarial networks to\nimpute time series dataset. We evaluate our method on a real-world medical dataset and a real-world\nmeteorologic dataset. The results show the superiority of our approach compared to the baselines in\nterms of imputation accuracy. Our model is also superior to the baselines in prediction and regression\ntasks using the imputed datasets.\n\n2 Method\nGiven a collection of multivariate time series with d dimensions, one time series X observed\nin T =(t0,. . .,tn\u22121), is denoted by X=(xt0, . . . , xti, . . . , xtn\u22121)(cid:62) \u2208 Rn\u00d7d, where xti is the tith\nobservation of X, and xj\nti is the jth variable of xti. In the following example, d=4, n=3 and \u201cnone\u201d\nis missing value,\n\n(cid:34) 1\n\n(cid:35)\n\n(cid:34) 0\n\n(cid:35)\n\n6\n\nnone\n\n9\n\nX =\n\n7 none\n9 none none\n\n7\n\nnone\n\n79\n\n, T =\n\n.\n\n5\n13\n\n2\n\n0.250.50.751612182430364248HoursMissing ratemaxmeanmin\fti=1, if xj\n\nti exists, otherwise M j\n\nThe time series X is incomplete, we introduce the mask matrix M \u2208 Rn\u00d7d to present whether the\nvalues of X exist or not, i.e., M j\nIn order to replace missing values in time series data with reasonable values, we \ufb01rst train a GAN\nbased model to learn the distribution of the original time series dataset. In this custom GAN model,\nthe generator which generates fake time series from a random input vector and the discriminator which\ndistinguishes between fake data and real data, will achieve an equilibrium that not only increases the\nrepresentative ability of the generator but also upgrades the discernment ability of the discriminator\n(see Figure 3). Next, we \ufb01x the network structure and optimize the input random vector of the\ngenerator so that the generated fake time series can best replace the missing values. In subsection 2.1,\nwe show the details of the GAN Architecture. Subsection 2.2 demonstrates the method to impute the\nmissing values.\n\nti=0.\n\nFigure 3: The structure of the proposed model.\n\n2.1 GAN Architecture\n\nA GAN is made up of a generator (G) and a discriminator (D). The G learns a mapping G(z) that\ntries to map the random noise vector z to a realistic time series. The D tries to \ufb01nd a mapping D(.)\nthat tell us the input data\u2019s probability of being real. It is noteworthy that the input of the D contains\nthe real but incomplete samples and the fake but complete samples generated by G. Because of the\nmode collapse problem [3], the traditional GAN is hard to train [20, 32, 37]. WGAN [3] is another\ntraining way of GAN which uses the Wasserstein distance that is easier to train than the original.\nWGAN can improve the stability of the learning stage, get out of the problem of mode collapse\nand make it easy for the optimization of the GAN model. In our method, we prefer WGAN [3] to\ntraditional GAN. The following is the loss function of WGAN.\nLG = Ez\u223cPg [\u2212D(G(z))] ,\n\n(1)\n\nLD = Ez\u223cPg [D(G(z))] \u2212 Ex\u223cPr [D(x)] .\n\n(2)\nWhen we design the detail structure of the GAN, we adopt Gated Recurrent Unit (GRU) [10], a\nstate-of-the-art RNN cell, as the basic network of G and D. It is worth noting that, others RNN variants\ncan also be used in this work, such as the Long Short-Term Memory (LSTM) [21] cell. However, the\ntime lags between two consecutive valid observations vary a lot due to data incompleteness, which\nmakes traditional GRU cell or LSTM cell not applicable to our senario. In order to effectively handle\nthe irregular time lags and to learn the implicit information from the time intervals, we propose the\nGRUI cell based on GRU.\nGRUI. To appropriate learn the distribution and characteristic of the original incomplete time series\ndataset, we \ufb01nd that, the time lag between two consecutive valid observations is always in changing\nbecause of the \u201cnone\u201d values. The time lags between observations are very important since they\nfollow an unknown nonuniform distribution. These changeable time lages remind us that the in\ufb02uence\nof the past observations should decay with time if the variable has been missing for a while. In order\nto \ufb01t this decayed in\ufb02uence of the past observations, we propose the Gated Recurrent Unit for data\nImputation (GRUI) cell to model the temporal irregularity of the incomplete time series.\nIn order to record the time lag of two adjacent existent values of X, we introduce the time lag matrix\n\u03b4 \u2208 Rn\u00d7d to record the time lag between current value and last valid value. The followings is the\ncalculation way and calculated results of \u03b4 of the sample dataset.\n\n\uf8f1\uf8f2\uf8f3 ti \u2212 ti\u22121,\n\n\u03b4j\nti =\n\nM j\nti\u22121 + ti \u2212 ti\u22121, M j\n\u03b4j\n0,\n\ni == 0\n\nti\u22121 == 1\nti\u22121 == 0 & i > 0\n\n(cid:34) 0\n\n;\n\n\u03b4 =\n\n5\n8\n\n0\n5\n13\n\n0\n5\n8\n\n0\n5\n13\n\n(cid:35)\n\n.\n\n3\n\nt0t1t2t3FeaturesReal samplest0t1t2t3FeaturesGenerated complete data GRandom noise GenerateDP(real)Gradient feedback\fWe introduce a time decay vector \u03b2 to control the in\ufb02uence of the past observations. Each value of\nthe \u03b2 should be bigger than zero and smaller than one, and the larger the \u03b4, the smaller the decay\nvector. So we model the time decay vector \u03b2 as a combination of \u03b4:\n\n\u03b2ti = 1/emax(0,W\u03b2 \u03b4ti +b\u03b2 ) ,\n\n(3)\nwhere W\u03b2 and b\u03b2 are parameters that need to learn. We use the negative exponential formulation\nto make sure that \u03b2ti \u2208 (0, 1]. Besides, in order to capture the interactions of the \u03b4\u2019s variables, we\nprefer full weight matrix to diagonal matrix for W\u03b2. After we have got the decay vector, we update\nthe GRU hidden state hti\u22121 by element-wise multiplying the decay factor \u03b2. Since we have used\nthe batch normalization [24] technology, the hidden state h is smaller than 1 with a high probability.\nWe choose multiplicative decay way rather than some other decay ways such as h\u03b2. The update\nfunctions of GRUI are:\n\nh(cid:48)\nti\u22121\n\n= \u03b2ti (cid:12) hti\u22121 ,\n\n(cid:104)\n\n(cid:105)\n(cid:104)\n, xti\nrti (cid:12) h(cid:48)\n\nh(cid:48)\nti\u22121\n\n(cid:105)\n\n+ b\u02dch) ,\n\n+ b\u00b5) ,\n\nti\u22121\n\n, xti\n\n\u00b5ti = \u03c3(W\u00b5\n\n\u02dchti = tanh(W\u02dch\n\n(cid:104)\nh(cid:48)\nti\u22121\nhti = (1 \u2212 \u00b5ti ) (cid:12) ht(cid:48)\n\nrti = \u03c3(Wr\n\ni\u22121\n\n(cid:105)\n\n, xti\n+ br) ,\n+ \u00b5ti (cid:12) \u02dchti ,\n\n(4)\n\n(5)\n\n(6)\n\nwhere \u00b5 is update gate, r is reset gate, \u02dch is candidate hidden state, \u03c3 is the sigmoid activation function,\nW\u02dch, Wr, W\u00b5, b\u00b5, br and b\u02dch are training parameters and (cid:12) is an element-wise multiplication.\n\nFigure 4: GRU cell.\n\nFigure 5: GRUI cell.\n\nD and G structure. The D is \ufb01rst composed by a GRUI layer to learn the incomplete or complete\ntime series. Then a full-connection layer is stacked on the top of the last hidden state of GRUI. To\nprevent over\ufb01t, we adopt dropout [44] to the full-connection layer. When we feed original incomplete\nreal time series into D, values at one row of \u03b4 are not the same. When we feed fake time series\ngenerated by G, values in each row of \u03b4 are the same (because there is no missing value). We want to\nmake sure that the time lags of the generated samples are the same as those of the original samples,\nso the G is also made up of a GRUI layer and a full-connection layer. The G is a self-feed network, it\nmeans that the current output of the G will be fed into the next iteration of the same cell. The very\n\ufb01rst input of G is the random noise vector z and every row of the \u03b4 of fake sample is a constant value.\nThat batch normalization [24] is applied both to G and D.\n\n2.2 Missing Values Imputation by GAN\nFrom the GAN architecture, we can know that, the generator G can learn a mapping G(z) = z (cid:55)\u2192 x\nthat maps the random noise vector z to a complete time series which contains no missing value.\nHowever, the problem is the random noise vector z is randomly sampled from a latent space, e.g.,\nGaussian distribution. It means that, the generated samples may change a lot with the changing of the\ninput random noise z. Although the generated samples obey the distribution of the original samples,\nthe distance between the generated samples and the original samples may also be large. In other\nwords, the degree of similarity between x and G(z) is not large enough. For example, the original\nincomplete time series contains two classes, and the G learns a distribution that can \ufb01t these two\nclasses very well. Given a incomplete sample x and a random input vector z, the G(z) may belong\nto the opposite class of x, this is not what we want. Although the G(z) may belong to the true class,\nthe similarity of samples within a class could also be large.\nFor any incomplete time series x, we try to \ufb01nd a best vector z from the latent input space so\nthat the generated sample G(z) is most similar to x. How to replace the missing values with the\nmost reasonable values? Inspired by [41], we introduce a way to measure the degree of imputation\n\ufb01tness. We de\ufb01ne a two-part loss function to evaluate the \ufb01tness of imputation. The \ufb01rst part of this\nloss function is the masked reconstruction loss. It means that the generated sample G(z) should be\n\n4\n\nhxInOut\u03bc rhxInOutr\u03bc \fclose enough to the original incomplete time series x. The another part of this loss function is the\ndiscriminative loss. This part forces the generated sample G(z) as real as possible. The following\nparagraphs will describe the masked reconstruction loss and the discriminative loss in details.\nMasked Reconstruction Loss. The masked reconstruction loss is de\ufb01ned by the masked squared\nerrors between the original sample x and the generated sample G(z). It is noteworthy that we only\ncalculate the non-missing part of the data.\n\nLr(z) = ||X (cid:12) M \u2212 G(z) (cid:12) M||2 .\n\n(7)\nDiscriminative Loss. The discriminative loss stands for the generated sample G(z)\u2019s degree of\nauthenticity. It is based on the output of the discriminator D which represents the con\ufb01dence level of\nthe input sample G(z)\u2019s being real. We feed the noise vector z into G, then we get the generated\nsample G(z), next, we feed G(z) into D, \ufb01nally we get the discriminative loss.\n\n(8)\nImputation Loss. We de\ufb01ne the imputation loss to optimize the random noise vector z. The\nimputation loss is a combination of the masked reconstruction loss and the discriminative loss.\n\nLd(z) = \u2212D(G(z)) .\n\nLimputation(z) = Lr(z) + \u03bbLd(z) ,\n\n(9)\nwhere \u03bb is a hyper-parameter that controls the proportion between the masked reconstruction loss\nand the discriminative loss.\nFor each original time series x, we randomly sample a z from the Gaussian distribution with zero\nmean and unit variance and feed it into the well trained generator G to get G(z). Then we begin\nto train the noise z with the loss function Limputation(z) by back propagation method. After the\nimputation loss converging to the optimal solution, we replace the missing values of x with the\ngenerated G(z) just as the following equation shows,\n\nximputed = x (cid:12) M + (1 \u2212 M ) (cid:12) G(z) .\n\n(10)\n\n3 Experiments\n\nWe evaluate the proposed method in two real-world datasets which include a medical dataset and a air\nquality dataset. In order to demonstrate the imputation results of the proposed method, we compare\nour algorithm with simple imputation methods, matrix factorization based imputation method and\nKNN based imputation method. We also compare our GAN based imputation method against some\nother baselines in the prediction and regression tasks.\n\n3.1 Datasets and Tasks\n\nPhysionet Challenge 2012 dataset (PhysioNet). The Physionet dataset is a public electronic med-\nical record dataset that comes from the PhysioNet Challenge 2012 [42]. This dataset consists of\nrecords from 4,000 intensive care unit (ICU) stays. Every ICU stay is a roughly 48 hours time\nseries with 41 variables such as age, weight, albumin, heart-rate, glucose, etc. One task of the\nPhysioNet Challenge 2012 is the mortality prediction task that predicts whether the patient dies in\nthe hospital. There are 554 (13.85%) patients with positive mortality label. This task is a binary\nclassi\ufb01cation problem with non-balance dataset, so the AUC score is used to judge the performance\nof the classi\ufb01er. Because of the lack of complete dataset, the direct evaluation of missing values\n\ufb01lling accuracy is impossible. Therefore, we use the mortality prediction results calculated by the\nsame classi\ufb01er but trained on different imputed datasets to determine the performance of imputation\nmethods. Machine learning methods must have enough training dataset to learn the potential relation\nbetween samples. We do not use the dataset processed by case deletion methods to train the classi\ufb01er\nwhen we use the PhysioNet dataset because of its high missing rate (80.67%).\n\nKDD CUP 2018 Dataset (KDD). The KDD CUP 2018 dataset is a public air quality dataset that\ncomes from the KDD CUP Challenge 2018 [11]. KDD dataset contains the historical air quality data\nof Beijing. We select 11 common air quality and weather data observatories for our experiments. Each\nobservatory owns records observed every one hour from January 1, 2017 to December 30, 2017. The\nrecords have total 12 variables which include PM2.5 (ug/m3), PM10 (ug/m3), CO (mg/m3), weather,\ntemperature and so on. We split this dataset for every 48 hours just like the PhysioNet dataset, then\n\n5\n\n\fwe get about 182 time series. For the split dataset, we conduct two tasks as the following described.\n1) Time series imputation task: For every 48 hours length time series, we randomly discard p percent\nof the dataset. Then we \ufb01ll the missing values and calculate the imputation accuracy, where p\n\u2208 {20, 30, . . . , 90}. The imputation accuracy is de\ufb01ned as the mean squared error (MSE) between\noriginal values and imputed values. 2) Air quality prediction task: For every 48 hours length time\nseries, we randomly discard 50 percent of the dataset. Then we predict the mean air quality of the\nnext 6 hours. Just like what we did previously, we use the air quality prediction results calculated by\nthe same regression model but trained on different imputed datasets to determine the performance of\nimputation methods.\n\nDataset\nPhysionet\n\nKDD\n\n# of Features\n\n# of Samples Missing Rate\n\n41\n132\n\n4000\n182\n\n80.67%\n\n1%\n\nTable 1: Dataset Statistics.\n\n3.2 Training Settings\n\nNetwork details and training strategies. The discriminator consists of a GRUI layer and a full-\nconnection layer. We feed the real incomplete time series x, the fake but complete time series G(z)\nand their corresponding \u03b4 into GRUI layer. Then the the last hidden state of GRUI layer will be\nfed into full-connection layer with a dropout to get the discriminator\u2019s output. The generator is a\nself-feed network that consists of a GRUI layer and a full-connection layer too. The current hidden\nstate of GRUI layer is fed into full-connection layer with a dropout, then the output of full-connection\nlayer will be treated as the input of the next iteration. All the outputs of full-connection layer are\nconcatenated and batch normalized into the G(z). The very \ufb01rst input of the generator is the random\nnoise z. Before the training of the GAN, the generator is pretrained for some epochs with a squared\nerror loss for predicting the next value in the training time series. For the PhysioNet dataset, the input\ndimension is 41 (we use all the variables of the PhysioNet dataset), batch size is 128, the hidden\nunits number in GRUI of G and D is 64 and the dimension of random noise is also 64. For the KDD\ndataset, the input dimension is 132 (11 observatories \u00d7 12 variables), the batch size is 16, the number\nof hidden units in GRUI of G and D is 64 and the dimension of z is 256.\n\nComparative Methods. When it is feasible to directly evaluate the imputation accuracy (task\n1 of the KDD dataset), we compare the proposed method with simple imputation methods, the\nmatrix factorization imputation method and the KNN imputation method.\nIf it is impractica-\nble to get the complete dataset, we use two tasks to indirectly measure the imputation accuracy.\n1) Classi\ufb01cation task (mortality prediction task): we use different datasets imputed by proposed\nmethod and some other methods to train logistic regression classi\ufb01er, SVM classi\ufb01er, random forest\nclassi\ufb01er and RNN classi\ufb01er. Then we indirectly compare the \ufb01lling accuracy of these methods.\n2) Regression task (air quality prediction task): we use datasets imputed by different imputation\nmethods to train linear regression model, decision tree regression model, random forest regression\nmodel and RNN based regression model. Then we indirectly compare the \ufb01lling accuracy of these\nmethods.\n\n3.3 Results\n\nExperimental results on Physionet dataset. For the PhysioNet dataset, we can not access the\ncomplete samples. Therefore, we measure the \ufb01lling accuracies of our proposed method and some\nother imputation methods indirectly. The hyper-parameters of our method are: the train epochs is\n30, pretrain epochs is 5, learning rate is 0.001, \u03bb is 0.15 and the number of optimization iterations\nof the imputation loss is 400. Figure 6 is the comparison results of the classi\ufb01cation task (mortality\nprediction task). We \ufb01rst complete the dataset by \ufb01lling last value, zero value, mean value and GAN\ngenerated value. The standardization of input dataset is conducted when we impute the missing\nvalues with mean value, last value and GAN generated value. If we also conduct standardization on\nzero value imputation, the zero value imputation will be same as the mean imputation. So we do\nnot standardize the input dataset when we impute with zero value. We train the logistic regression\nclassi\ufb01er, SVM (with RBF kernel, Linear kernel, Poly kernel and Sigmoid kernel) classi\ufb01ers, random\nforest classi\ufb01er and RNN classi\ufb01er on these above imputed complete datasets to indirectly compare\nthe \ufb01lling accuracy of these \ufb01lling methods. The RNN classi\ufb01er is composed by a GRUI layer that\n\n6\n\n\fprocesses complete time series and a full-connection layer that outputs classi\ufb01cation results. We can\nsee that, except for the SVM classi\ufb01er with RBF kernel, the classi\ufb01ers trained on dataset imputed\nby proposed method always gain the best AUC score. These results can prove the success of GAN\nbased imputation method indirectly because of the lack of complete dataset. It is worth noting that,\nwe achieve the new state-of-the-art mortality prediction result with AUC score of 0.8603 by using\nthe dataset imputed by the GAN based imputation method, while the previous state-of-the-art AUC\nscore is 0.848 [25]. Table 2 is the detail description of mortality prediction task results produced by\ndifferent methods.\n\nModel\n\nNeural Network model called GRUD [7]\n\nHazard Markov Chain model [29]\n\nRegularized Logistic Regression model [25]\n\nGAN based imputation & RNN model\n\nResult\n0.8424\n0.8381\n0.848\n0.8603\n\nTable 2: The AUC score of the mortality prediction task on the Physionet dataset. The RNN model\nthat uses the dataset imputed by our method achieves the highest AUC score.\n\nExperimental results on KDD dataset. Table 3 shows the comparison results between the proposed\nGAN based method and some other imputation methods which include imputation method that uses\nthe last observed value to impute missing values (last imputation), method that uses mean value to\n\ufb01ll missing values (mean imputation), KNN based method and matrix factorization based method.\nBefore the starting of the experiments, we have standardized the input dataset. Therefore, \ufb01lling zero\nvalue is the same as \ufb01lling mean value. The hyper-parameters of our method are: the train epochs is\n25, pretrain epochs is 20, learning rate is 0.002, \u03bb is 0.0 and the number of optimization iterations of\nthe imputation loss is 800.\n\nFigure 6: The AUC score of mortality prediction by\ndifferent classi\ufb01cation models trained on different\nimputed datasets.\n\nFigure 7: The MSE of air quality prediction by\ndifferent regression models trained on different\nimputed datasets.\n\nThe \ufb01rst column of the table 3 is the missing-rate which indicates there are how many percent missing\nvalues in the dataset. The remaining columns are the mean squared errors (MSE) of the corresponding\nimputation methods. We can see that, with most missing-rates of the dataset, the proposed method\nowns the best \ufb01lling accuracy. This is because the proposed GAN based method can automatically\nlearn the temporal relationship of the same sample, the similarities between the similar samples, the\nassociation rules of two variables and the distribution of the dataset. By this way, the proposed GAN\nbased method can \ufb01ll the missing holes with the most reasonable values.\nFigure 7 is the experimental results of the regression task. We use the KDD dataset with 50% percent\nmissing values. Just like the settings of the classi\ufb01cation task, we \ufb01rst \ufb01ll the missing values. Then\nwe train some regression models that include decision tree model, linear regression model, random\nforest model and RNN regression model. The RNN regression model is also made up of a GRUI\nlayer and a full-connection layer. The hyper-parameters are the same as direct comparison. Because\nwe have standardized the input dataset, zero \ufb01lling is the same as mean \ufb01lling. Figure 7 shows that\nthe regression model trained with dataset which is imputed by the proposed method always gains the\nminimum MSE value. These results prove the success of GAN based imputation method.\n\n7\n\n0.7010.8160.7190.7730.7920.7550.8600.000.250.500.751.00LRSVM\u2212RBFSVM\u2212LinearSVM\u2212PolySVM\u2212SigmoidRFRnnAUCGANLastMeanZero0.6040.3460.4950.3560.00.20.40.6Decision TreeLinear RegressionRandom ForestRnnMSEGANLastMean\fMissing-rate Last \ufb01lling Mean \ufb01lling KNN \ufb01lling MF \ufb01lling GAN \ufb01lling\n\n90%\n80%\n70%\n60%\n50%\n40%\n30%\n20%\n\n2.870\n1.689\n1.236\n1.040\n0.990\n0.901\n0.894\n1.073\n\n1.002\n0.937\n0.935\n0.973\n0.923\n0.914\n0.907\n0.916\n\n1.243\n0.873\n0.852\n0.856\n0.798\n0.776\n0.803\n0.892\n\n1.196\n0.860\n0.805\n0.834\n0.772\n0.787\n0.785\n0.850\n\n1.018\n0.837\n0.780\n0.803\n0.743\n0.753\n0.780\n0.844\n\nTable 3: The MSE results of the proposed method and other imputation methods on the KDD dataset.\nIn most cases, the proposed method owns the best imputation accuracy.\nComparison with GAN using a non-modi\ufb01ed GRU. We have also compared the proposed method\nwith a GAN that use a non-modi\ufb01ed GRU. In this situation, we do not take the advantage of the time\ninterval information and then treat the time series as \ufb01xed interval data. So we do not model the time\ndecay vector \u03b2 to control the in\ufb02uence of the past observations. We \ufb01nd that, with a non-modi\ufb01ed\nGRU, the \ufb01nal AUC score of the Physionet dataset is 0.8029 while the GRUI is 0.8603. At the mean\ntime, Table 4 shows the advantages of the GRUI cell tested on the KDD dataset. We can see that\nwith the damping of the hidden state, the \ufb01nal performance of the imputation will increase a lot in all\nsituations. The reason is that our model can learn and make use of the \ufb02exible time lags of the dataset\nand then produces better results than a non-modi\ufb01ed GRU cell.\n\nMissing-rate\n\nGRU\nGRUI\n\n90% 80% 70% 60% 50% 40% 30% 20%\n0.849\n1.049\n1.018\n0.844\n\n0.794\n0.743\n\n0.823\n0.803\n\n0.767\n0.753\n\n0.820\n0.780\n\n0.893\n0.837\n\n0.841\n0.780\n\nTable 4: The MSE comparison of a GAN with GRU and a GAN with GRUI on KDD dataset.\n\n3.4 Discussions\n\nThe proportion between discriminative loss and masked reconstruction loss. In this part, we\ninvestigate the most in\ufb02uential hyper-parameter \u03bb. Figure 8 and 9 show the impact of the \u03bb, that is,\nthe impact of the proportion between discriminative loss and masked reconstruction loss. We sample\n13 values from 0.0 to 16.0 for \u03bb and compare the experimental results of these varied \u03bb. When we\nperform the regression task on KDD dataset, we can conclude that with the growth of \u03bb, the MSE\nof the KDD dataset grows near-linearly. It can be interpreted that the masked reconstruction loss\ndominates the imputation loss and the discriminative loss helps a little for the regression task on\nKDD dataset. The classi\ufb01cation task results on PhysioNet dataset show that, the AUC score is small\nwhen the \u03bb is 0.0, and the AUC score reaches the maximum at the point of 0.15, then it decreases over\nthe growth of \u03bb. This phenomenon shows that the discriminative loss helps a lot for the classi\ufb01cation\ntask on PhysioNet dataset, especially with the \u03bb value of 0.15.\n\nFigure 8: The in\ufb02uence of \u03bb in classi\ufb01cation\ntask. AUC score reaches the maximum at \u03bb =\n0.15.\n4 Related Work\n\nFigure 9: The in\ufb02uence of \u03bb in regression\ntask. MSE reaches the minimum at \u03bb = 0.0.\n\nThis part will introduce the related works about missing value processing methods and generative\nadversarial networks.\n\n8\n\n0.700.750.800.850.90.0625.1250.250.51.02.04.08.016.0lAUCPhysioNet (AUC)0.30.40.50.60.7.0625.1250.250.51.02.04.08.016.0lMSEKDD (MSE)\f4.1 Missing Value Processing Methods\n\nThe presence of missing values in datasets will signi\ufb01cantly degrade the data analyses results [8].\nIn order to deal with the missing values in datasets, researchers have proposed many missing data\nhandling methods in recent years. These methods can be classi\ufb01ed into case deletion based methods,\nsimple imputation methods and machine learning based imputation methods.\nDeletion based methods erase all observations/records with missed values, including Listwise\ndeletion [45] and Pairwise deletion [35]. The common drawback of the deletion methods is the loss\nof power when the missing rate is large enough (i.e. bigger than 5%) [18].\nSimple imputation algorithms impute the missing values with some statistical attributes, such as\nreplace missing value with mean value [27], impute with median value [1], impute with most common\nvalue [12] and complete the dataset with last observed valid value [2].\nMachine learning based imputation methods include maximum likelihood Expectation-\nMaximization (EM) based imputation [38], K-Nearest Neighbor (KNN) based imputation [40],\nMatrix Factorization (MF) based imputation and Neural Network (NN) based imputation. The\nEM imputation algorithm is made up of the \u201cexpectation\u201d step and the \u201cmaximization\u201d step which\niteratively updates model parameters and imputed values so that the model best \ufb01ts the dataset. The\nKNN based imputation method uses the mean value of the k nearest samples to impute missing\nvalues. The MF based imputation factorizes the incomplete matrix into low-rank matrices \u2018U\u2019 and\n\u2018V\u2019 solved by gradient descent algorithm, with a L1 sparsity penalty on the elements of \u2018U\u2019 and a L2\npenalty on the elements of \u2018V\u2019. Neural Network based imputation [16] uses the numerous parameters\nof the neural network to learn the distribution of train dataset and then \ufb01lls the missing values.\n\n4.2 Generative Adversarial Networks\n\nAt the year of 2014, Goodfellow et al [17] introduced the generative adversarial networks (GAN),\nwhich is a framework for estimating generative model via an adversarial process. The generative\nadversarial networks is made up of two components: a generator and a discriminator. The generator\ntries to fool the discriminator by generating fake samples from a random \u201cnoise\u201d vector. The\ndiscriminator tries to distinguish between fake and real samples, i.e., to produce the probability that a\nsample comes from real datasets rather than the generator. However, the traditional GAN is hard to\ntrain, WGAN [3] is another training way of GAN, WGAN can improve the stability of learning and\nget out of the problem of mode collapse.\nMany works have shown that the well trained GAN can produce realistic images in computer vision\n\ufb01eld [9, 23, 28, 36]. GAN is also successfully used to complete faces [5, 30, 33, 31]. Only A few\nworks has introduced GAN into sequences generating \ufb01eld [6, 13, 39, 47], such as SeqGAN [47] and\nMaskGAN [13]. However, these works are not suitable for missing values imputation \ufb01eld. That is,\nbefore the generating of the sequences, these methods require the complete train dataset which is\nimpossible in our scenario, yet our model needn\u2019t complete train datasets. Besides, most of GAN\nbased sequence generation methods produce new samples from a random \u201cnoise\u201d vector. With the\nchanging of the random \u201cnoise\u201d vector, the generated samples will change a lot. However, the data\nimputation task requires the imputed value as close as possible to the original incomplete data. There\nalso exists a few work that uses GAN to impute the missing values such as GAIN [46]. The drawback\nof GAIN is the lack of consideration for the imputation of time series.\n\n5 Conclusion\n\nIn this paper, we propose a novel generative adversarial networks for data imputation. In order to\nlearn the un\ufb01xed time lags of two observed values, a modi\ufb01ed GRU cell (called GRUI) is proposed\nfor processing the incomplete time series. After the training of the GAN model with GRUI cell, the\n\u201cnoise\u201d input vector of the generator is trained and generating reasonable values for imputation. In this\nway, the temporal relationships, the inner-class similarities, and the distribution of the dataset can be\nautomatically learned under the adversarial architecture. Experimental results show that our method\ncan outperform the baselines in terms of accuracy of missing value imputation, and has bene\ufb01ts for\ndownstream applications.\n\n9\n\n\f6 Acknowledgements\n\nWe thank the reviewers for their constructive comments. We also thank Zhicheng Dou for his helpful\nsuggestions. This research is supported by National Natural Science Foundation of China (No.\n61772289 and No.61872338).\n\nReferences\n[1] Edgar Acuna and Caroline Rodriguez. The treatment of missing values and its effect on classi\ufb01er\naccuracy. In Classi\ufb01cation, clustering, and data mining applications, pages 639\u2013647. Springer,\n2004.\n\n[2] Mehran Amiri and Richard Jensen. Missing data imputation using fuzzy-rough methods.\n\nNeurocomputing, 205:152\u2013164, 2016.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[4] Gustavo EAPA Batista and Maria Carolina Monard. An analysis of four missing data treatment\n\nmethods for supervised learning. Applied arti\ufb01cial intelligence, 17(5-6):519\u2013533, 2003.\n\n[5] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from\nlossy measurements. In International Conference on Learning Representations (ICLR), 2018.\n\n[6] Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua\nBengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv\npreprint arXiv:1702.07983, 2017.\n\n[7] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent\nneural networks for multivariate time series with missing values. Scienti\ufb01c reports, 8(1):6085,\n2018.\n\n[8] Jehanzeb R Cheema. A review of missing data handling methods in education research. Review\n\nof Educational Research, 84(4):487\u2013508, 2014.\n\n[9] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[10] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[11] KDD Cup. Available on: http://www.kdd.org/kdd2018/, 2018.\n\n[12] A Rogier T Donders, Geert JMG Van Der Heijden, Theo Stijnen, and Karel GM Moons. A gentle\nintroduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087\u2013\n1091, 2006.\n\n[13] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via \ufb01lling\n\nin the _. arXiv preprint arXiv:1801.07736, 2018.\n\n[14] Pedro J Garc\u00eda-Laencina, Pedro Henriques Abreu, Miguel Henriques Abreu, and No\u00e9mia\nAfonoso. Missing data imputation on the 5-year survival prediction of breast cancer patients\nwith unknown discrete values. Computers in biology and medicine, 59:125\u2013133, 2015.\n\n[15] Pedro J Garc\u00eda-Laencina, Jos\u00e9-Luis Sancho-G\u00f3mez, and An\u00edbal R Figueiras-Vidal. Pattern\nclassi\ufb01cation with missing data: a review. Neural Computing and Applications, 19(2):263\u2013282,\n2010.\n\n[16] Iffat A Gheyas and Leslie S Smith. A neural network-based framework for the reconstruction\n\nof incomplete data sets. Neurocomputing, 73(16-18):3039\u20133065, 2010.\n\n10\n\n\f[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[18] John W Graham. Missing data analysis: Making it work in the real world. Annual review of\n\npsychology, 60:549\u2013576, 2009.\n\n[19] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank\nsvd via fast alternating least squares. Journal of Machine Learning Research, 16:3367\u20133402,\n2015.\n\n[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00fcnter Klambauer,\nand Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium.\narXiv preprint arXiv:1706.08500, 2017.\n\n[21] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[22] Tsung Jung Hsieh, Hsiao Fen Hsiao, and Wei Chang Yeh. Forecasting stock markets using\nwavelet transforms and recurrent neural networks: An integrated system based on arti\ufb01cial bee\ncolony algorithm. Applied Soft Computing Journal, 11(2):2510\u20132525, 2011.\n\n[23] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative\nadversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nvolume 2, page 4, 2017.\n\n[24] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[25] Alistair EW Johnson, Andrew A Kramer, and Gari D Clifford. Data preprocessing and mor-\ntality prediction: The physionet/cinc 2012 challenge revisited. In Computing in Cardiology\nConference (CinC), 2014, pages 157\u2013160. IEEE, 2014.\n\n[26] Jiri Kaiser. Dealing with missing values in data. Journal of systems integration, 5(1):42, 2014.\n\n[27] Mehmed Kantardzic. Data mining: concepts, models, methods, and algorithms. John Wiley &\n\nSons, 2011.\n\n[28] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro\nAcosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic\nsingle image super-resolution using a generative adversarial network. arXiv preprint, 2016.\n\n[29] Dae Hyun Lee and Eric Horvitz. Predicting mortality of intensive care patients via learning\n\nabout hazard. In AAAI, pages 4953\u20134954, 2017.\n\n[30] Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. Generative face completion.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1,\npage 6, 2017.\n\n[31] Pengpeng Liu, Xiaojuan Qi, Pinjia He, Yikang Li, Michael R Lyu, and Irwin King. Semantically\nconsistent image completion with \ufb01ne-grained details. arXiv preprint arXiv:1711.09345, 2017.\n\n[32] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence\nproperties of generative adversarial learning. In Advances in Neural Information Processing\nSystems, pages 5551\u20135559, 2017.\n\n[33] Zhihe Lu, Zhihang Li, Jie Cao, Ran He, and Zhenan Sun. Recent progress of face image\n\nsynthesis. arXiv preprint arXiv:1706.04717, 2017.\n\n[34] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for\nlearning large incomplete matrices. Journal of machine learning research, 11(Aug):2287\u20132322,\n2010.\n\n[35] Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo.\n\nMissing data: A gentle introduction. Guilford Press, 2007.\n\n11\n\n\f[36] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[37] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In\n\nAdvances in Neural Information Processing Systems, pages 5591\u20135600, 2017.\n\n[38] Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi Marwala. Missing data: A com-\nparison of neural network and expectation maximization techniques. Current Science, pages\n1514\u20131521, 2007.\n\n[39] Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville.\n\nAdversarial generation of natural language. arXiv preprint arXiv:1705.10929, 2017.\n\n[40] MATLAB Release. The mathworks. Inc., Natick, Massachusetts, United States, 488, 2013.\n\n[41] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg\nLangs. Unsupervised anomaly detection with generative adversarial networks to guide marker\ndiscovery. In International Conference on Information Processing in Medical Imaging, pages\n146\u2013157. Springer, 2017.\n\n[42] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-\nhospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In\nComputing in Cardiology (CinC), 2012, pages 245\u2013248. IEEE, 2012.\n\n[43] Luciana O Silva and Luis E Z\u00e1rate. A brief review of the main approaches for treatment of\n\nmissing data. Intelligent Data Analysis, 18(6):1177\u20131198, 2014.\n\n[44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[45] Werner Wothke. Longitudinal and multigroup modeling with missing data. 2000.\n\n[46] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Gain: Missing data imputation using\n\ngenerative adversarial nets. arXiv preprint arXiv:1806.02920, 2018.\n\n[47] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI, pages 2852\u20132858, 2017.\n\n[48] Kaiping Zheng, Jinyang Gao, Kee Yuan Ngiam, Beng Chin Ooi, and Wei Luen James Yip.\nResolving the bias in electronic medical records. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 2171\u20132180. ACM,\n2017.\n\n12\n\n\f", "award": [], "sourceid": 814, "authors": [{"given_name": "Yonghong", "family_name": "Luo", "institution": "Nankai University"}, {"given_name": "Xiangrui", "family_name": "Cai", "institution": "Nankai University"}, {"given_name": "Ying", "family_name": "ZHANG", "institution": "Nankai Univeristy"}, {"given_name": "Jun", "family_name": "Xu", "institution": "Renmin University of China"}, {"given_name": "Yuan", "family_name": "xiaojie", "institution": "Nankai Univeristy"}]}