{"title": "NAOMI: Non-Autoregressive Multiresolution Sequence Imputation", "book": "Advances in Neural Information Processing Systems", "page_first": 11238, "page_last": 11248, "abstract": "Missing value imputation is a fundamental problem in spatiotemporal modeling, from motion tracking to the dynamics of physical systems. Deep autoregressive models suffer from error propagation which becomes catastrophic for imputing long-range sequences. In this paper, we take a non-autoregressive approach and propose a novel deep generative model: Non-AutOregressive Multiresolution Imputation (NAOMI) to impute long-range sequences given arbitrary missing patterns. NAOMI exploits the multiresolution structure of spatiotemporal data and decodes recursively from coarse to fine-grained resolutions using a divide-and-conquer strategy. We further enhance our model with adversarial training. When evaluated extensively on benchmark datasets from systems of both deterministic and stochastic dynamics. NAOMI demonstrates significant improvement in imputation accuracy (reducing average prediction error by 60% compared to autoregressive counterparts) and generalization for long range sequences.", "full_text": "NAOMI: Non-Autoregressive Multiresolution\n\nSequence Imputation\n\nYukai Liu\nCaltech\n\nyukai@caltech.edu\n\nRose Yu\n\nNortheastern University\n\nroseyu@northeastern.edu\n\nStephan Zheng\u2217\nCaltech, Salesforce\n\nstephan.zheng@salesforce.com\n\nEric Zhan\n\nCaltech\n\nezhan@caltech.edu\n\nYisong Yue\n\nCaltech\n\nyyue@caltech.edu\n\nAbstract\n\nMissing value imputation is a fundamental problem in spatiotemporal modeling,\nfrom motion tracking to the dynamics of physical systems. Deep autoregressive\nmodels suffer from error propagation which becomes catastrophic for imputing\nlong-range sequences. In this paper, we take a non-autoregressive approach and\npropose a novel deep generative model: Non-AutOregressive Multiresolution\nImputation (NAOMI) to impute long-range sequences given arbitrary missing pat-\nterns. NAOMI exploits the multiresolution structure of spatiotemporal data and\ndecodes recursively from coarse to \ufb01ne-grained resolutions using a divide-and-\nconquer strategy. We further enhance our model with adversarial training. When\nevaluated extensively on benchmark datasets from systems of both deterministic\nand stochastic dynamics. In our experiments, NAOMI demonstrates signi\ufb01cant\nimprovement in imputation accuracy (reducing average error by 60% compared to\nautoregressive counterparts) and generalization for long-range sequences.\n\n1\n\nIntroduction\n\nThe problem of missing values often arises in real-life sequential data. For example, in motion\ntracking, trajectories often contain missing data due to object occlusion, trajectories crossing,\nand the instability of camera motion [1]. Missing values can introduce observational bias into\ntraining data, making the learning unstable. Hence, imputing missing values is of critical im-\nportance to the downstream sequence learning tasks. Sequence imputation has been studied for\ndecades in statistics literature [2, 3, 4, 5]. Most\nstatistical techniques are reliant on strong as-\nsumptions on missing patterns such as missing\nat random, and do not generalize well to unseen\ndata. Moreover, existing methods do not work\nwell when the proportion of missing data is high\nand the sequence is long.\nRecent studies [6, 7, 8, 9] have proposed to\nuse deep generative models for learning \ufb02exible\nmissing patterns from sequence data. However,\nall existing deep generative imputation methods\nare autoregressive: they model the value at cur-\n\nFigure 1: Imputation process of NAOMI in a basket-\nball play given two players (purple and blue) and\n5 known observations (black dots). Missing values\nare imputed recursively from coarse resolution to\n\ufb01ne-grained resolution (left to right).\n\n\u2217This work was done while the author was at Caltech.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frent timestamp using the values from previous time-steps and impute missing data in a sequential\nmanner. Hence, autoregressive models are highly susceptible to compounding error, which can\nbecome catastrophic for long-range sequence modeling. We observe in our experiments that existing\nautoregressive approaches struggle on sequence imputation tasks with long-range dynamics.\nIn this paper, we introduce a novel non-autoregressive approach for long-range sequence imputation.\nInstead of conditioning only on the previous values, we model the conditional distribution on both the\nhistory and the (predicted) future. We exploit the multiresolution nature of spatiotemporal sequence,\nand decompose the complex dependency into simpler ones at multiple resolutions. Our model,\nNon-autoregressive Multiresolution Imputation (NAOMI), employs a divide and conquer strategy to\n\ufb01ll in the missing values recursively. Our method is general and can work with various learning\nobjectives. We release an implementation of our model as an open source project.2\nIn summary, our contributions are as follows:\n\ntive with a fully differentiable generator to reduce variance.\n\ncan impute missing values for spatiotemporal sequences with long-range dependencies.\n\n\u2022 We propose a novel non-autoregressive decoding procedure for deep generative models that\n\u2022 We introduce adversarial training using the generative adversarial imitation learning objec-\n\u2022 We conduct exhaustive experiments on benchmark sequence datasets including traf\ufb01c time\nseries, billiards and basketball trajectories. Our method demonstrates 60% improvement in\naccuracy and generates realistic sequences given arbitrary missing patterns.\n\n2 Related Work\n\nMissing Value Imputation Existing missing value imputation approaches roughly fall into two\ncategories: statistical methods and deep generative models. Statistical methods often impose strong\nassumptions on the missing patterns. For example, mean/median averaging [4], linear regression [2],\nMICE [10], and k-nearest neighbours [11] can only handle data missing at random. Latent variables\nmodels with EM algorithm [12] can impute data missing not at random but are restricted to certain\nparametric models. Deep generative model offers a \ufb02exible framework of missing data imputation.\nFor instance, [13, 6, 14] develop variants of recurrent neural networks to impute time series. [8, 9, 7]\nleverage generative adversarial training (GAN) [15] to learn complex missing patterns. However, all\nthe existing imputation models are autoregressive.\nNon-Autoregressive Modeling Non-autoregressive models have gained competitive advantages\nover autoregressive models in natural language processing [16, 17, 18] and speech [19]. For instance,\n[19] uses a normalizing \ufb02ow model [20] to train a parallel feed-forward network for speech synthesis.\nFor neural machine translation, [16] introduce a latent fertility model with a sequence of discrete latent\nvariables. Similarly, [17, 18] propose a fully deterministic model to reduce the amount of supervision.\nAll these works highlight the strength of non-autoregressive models in decoding sequence data in a\nscalable fashion. Our work is the \ufb01rst non-autoregressive model for sequence imputation tasks with a\nnovel recursive decoding algorithm.\nGenerative Adversarial Training Generative adversarial networks (GAN) [15] introduce a dis-\ncriminator to replace maximum likelihood objective, which has sparked a new paradigm of generative\nmodeling. For sequence data, using a discriminator for the entire sequence ignores the sequential\ndependency and can suffer from mode collapse. [21, 22] develop imitation and reinforcement learning\nto train GAN in the sequential setting. [21] propose generative adversarial imitation learning to\ncombine GAN and inverse reinforcement learning. [22] develop GAN for discrete sequences using\nreinforcement learning. We use an imitation learning formula with a differentiable policy.\nMultiresolution Generation Our method bears af\ufb01nity with multiresolution generative models for\nimages such as Progressive GAN [23] and multiscale autoregressive density estimation [24]. The\nkey difference is that [23, 24] only capture spatial multiresolution structures and assume additive\nmodels for different resolutions. We deal with multiresolution spatiotemporal structures and generate\npredictions recursively. Our method is fundamentally different from hierarchical sequence models\n[25, 26, 27], as it only keeps track of the most relevant hidden states and update them on-the-\ufb02y,\nwhich is memory ef\ufb01cient and much faster to train.\n\n2https://github.com/felixykliu/NAOMI\n\n2\n\n\f3 Non-Autoregressive Multiresolution Sequence Imputation\nLet X = (x1, x2, ..., xT ) be a sequence of T observations, where each time step xt \u2208 RD. X have\nmissing data, indicated by a masking sequence M = (m1, m2, ..., mT ). The masking mt is zero\nwhenever xt is missing. Our goal is to replace the missing data with reasonable values for a collection\nof sequences. A common practice for missing value imputation is to directly model the distribution\nt p(xt|xt, I\u2265t),\n\n(1)\n\nt=1\n\nt=1\n\nwhere hf\nabove distributions with a forward RNN ff and a backward RNN fb:\n\nt are the hidden states of the history and the future respectively. We parameterize the\n\nt and hb\n\nq(hf\n\nt |hf\n\nt, I\u2265t) = fb(hb\n\nt+1, It).\n\n(2)\n\nMultiresolution decoder. Given the joint hidden states H := [H f , H b], the decoder learns the\ndistribution of complete sequences p(X|H). We adopt a divide and conquer strategy and decode\nrecursively from coarse to \ufb01ne-grained resolutions. As shown in Figure 2, at each iteration, the\ndecoder \ufb01rst identi\ufb01es two known time steps as pivots (x1 and x5 in this example), and imputes close\nto their midpoint (x3). One pivot is then replaced by the newly imputed step and the process repeats\nat a \ufb01ner resolution for x2 and x4.\n\n3\n\n\fSample complete sequences from training data X\u2217 \u223c C and mask M\nCompute incomplete sequences X = X\u2217 (cid:12) M\nInitialize hf\nwhile X contains missing values do\n\nt using Eqn 2 for 0 \u2264 t \u2264 T\n\nt , hb\n\nFind the smallest i and the smallest j > i s.t. mi = mj = 1 and \u2203t, i < t < j s.t. mt = 0\nFind the smallest r s.t. nr = 2R\u2212r \u2264 (j \u2212 i)/2, thus the imputation point t(cid:63) = i + nr\nDecode xt(cid:63) using p(xt(cid:63)|H) = g(r)(hf\nUpdate the hidden states using hf\n\nAlgorithm 1 Non-AutOregressive Multiresolution Imputation\n1: Initialize generator G\u03b8 and discriminator D\u03c9\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: until Converge\nFormally speaking, a decoder with R resolutions consists of a series of decoding functions\ng(1), . . . , g(R), each of which predicts every nr = 2R\u2212r steps. The decoder \ufb01rst \ufb01nds two known\nsteps i and j as pivots, and then selects the missing step t that is close to the midpoint: [(i + j)/2].\nLet r be the smallest resolution that satis\ufb01es nr \u2264 (j \u2212 i)/2. The decoder updates the hidden states\nat time t(cid:63) using the forward states hf\nj. A decoding function g(r) then\nmaps the hidden states to the distribution over the outputs: p(x(cid:63)\n\nend while\nUpdate generator G\u03b8 by backpropagation\nTrain discriminator D\u03c9 with complete sequences X\u2217 and imputed sequences \u02c6X\n\nj), update X, M\nt\u22121, It),\n\ni and the backward states hb\n\ni , hb\nt = ff (hf\n\nhb\nt = fb(hb\n\nt+1, It)\n\nt|H) = g(r)(hf\n\ni , hb\n\nj).\n\nIf the dynamics are deterministic, g(r) directly outputs the imputed value. For stochastic dynamics,\ng(r) outputs the mean and the standard deviation of an isotropic Gaussian distribution, and the\npredictions are sampled from the Gaussian distribution using the reparameterize trick [28]. The mask\nmt is updated to 1 after imputation and the process proceeds to the next resolution. The details of this\ndecoding process are described in Algorithm 1. We encourage the reader to watch our demo video\nfor a detailed visualization and imputed examples. 3\nEf\ufb01cient hidden states update. NAOMI ef\ufb01ciently updates the hidden states by reusing the previous\ncomputation, which has the same time complexity as autoregressive models. Figure 3 shows an\nexample for a sequence of length nine. Grey blocks are the known time steps. Orange blocks are\nthe target time step to be imputed. Hollow arrows denote forward hidden states updates, and black\narrows represent backward hidden states updates. Grey arrows are the outdated hidden states updates.\nThe dashed arrows represent the decoding steps. Earlier hidden states are stored in the imputed time\nsteps and are reused. Therefore, forward hidden states hf only need to be updated once and backward\nhidden states hb are updated at most twice.\n\n9\u21927 are updated twice when predicting \u02c6x6.\n\nFigure 3: NAOMI hidden states updating rule for a sequence of length nine. Note that backward hidden\nstates hb\nComplexity. The total run-time of NAOMI is O(T ). The memory usage is similar to that of bi-\ndirectional RNN (O(T )), except that we only need to save the latest hidden states for the forward\nencoder. The decoder hyperparameter R is picked such that 2R is close to the most common missing\ninterval size, and the run time scales logarithmically with the length of the sequence.\n\n3https://youtu.be/eoiK42w02w0\n\n4\n\n\f3.2 Learning Objective\nLet C = {X\u2217} be the collection of complete sequences, G\u03b8(X, M ) denote our generative model\nNAOMI parametrized by \u03b8, and p(M ) denote the prior over the masking. The imputation model can\nbe trained by optimizing the following objective:\n\nE\nX\u2217\u223cC,M\u223cp(M ), \u02c6X\u223cG\u03b8(X,M )\n\nmin\n\n\u03b8\n\nL(\u02c6xt, xt)\n\n.\n\n(3)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\nwhere L is some loss function. For deterministic dynamics, we use the mean squared error as our\nloss L(\u02c6xt, xt) = (cid:107)\u02c6xt \u2212 xt(cid:107)2. For stochastic dynamics, we can replace L with a discriminator, which\nleads to the adversarial training objective. We use a similar formulation as generative adversarial\nimitation learning (GAIL) [21], which quanti\ufb01es the distributional difference between generated and\ntraining data at the sequence level.\nAdversarial training. Given the generator G\u03b8 in NAOMI and a discriminator D\u03c9 parameterized by\n\u03c9, the adversarial training objective function is:\n\nmin\n\n\u03b8\n\nmax\n\n\u03c9\n\nEX\u2217\u223cC\n\nlog D\u03c9(\u02c6xt, xt)\n\n+E\n\nX\u2217\u223cC,M\u223cp(M ), \u02c6X\u223cG\u03b8\n\nt=1\n\nt=1\n\nlog(1 \u2212 D\u03c9(\u02c6xt, xt))\n\n,\n\n(4)\n\nGAIL samples the sequences directly from the generator and optimizes the parameters using policy\ngradient. This approach can suffer from high variance and require a large number of samples [29].\nInstead of sampling, we take a model-based approach and make our generator fully differentiable.\nWe apply the reparameterization trick [28] at every time step by mapping the hidden states to mean\nand variance of a Gaussian distribution.\n4 Experiments\n\nneighboring sequences.\n\nfrom two closest known observations.\n\nWe evaluate NAOMI in environments with diverse dynamics: real-world traf\ufb01c time series, billiard\nball trajectories from a physics engine, and team movements from professional basketball gameplay.\nWe compare with the following baselines:\n\u2022 Linear: linear interpolation, missing values are imputed using interpolated predictions\n\u2022 KNN[11]: k nearest neighbours, missing values are imputed as the average of the k nearest\n\u2022 GRUI [9]: autoregressive model with GAN for time series imputation, modi\ufb01ed to handle\n\u2022 MaskGAN[7]: autoregressive model with actor-critic GAN, trained using adversarial imitation\nlearning with discriminator applied to every time step, uses a forward encoder only, and\ndecodes at a single resolution.\n\u2022 SingleRes: autoregressive counterpart of our model, trained using adversarial imitation\nlearning, uses a forward-backward encoder, but decodes at a single resolution. Without\nadversarial training, it reduces to BRITS [14].\n\ncomplete training sequence. The discriminator is applied once to the entire time series.\n\nWe randomly choose the number of steps to be masked, and then randomly sample the speci\ufb01c\nsteps to mask in the sequence. Hence the model learns various missing patterns during training. We\nused the same masking scheme for all methods, including MaskGAN and GRUI. See Appendix for\nimplementation and training details.\n4.1\nImputing Traf\ufb01c Time Series\n\nThe PEMS-SF traf\ufb01c time series [30] data contains 267 training and 173 testing sequences of length\n144 (sampled every 10 mins throughout the day). It is multivariate with 963 dimensions, representing\nthe freeway occupancy rate collected from 963 different sensors. We generate a masking sequence\nfor each data with 122 to 140 missing values.\nImputation accuracy L2 loss between imputed missing values and their ground-truth most accu-\nrately measures the quality of the generated sequence. As clearly shown in table 1, NAOMI outperforms\nothers by a large margin, reducing L2 loss by 23% compared to the autoregressive baselines. KNN\nperforms reasonably well, mostly because of the repeated daily traf\ufb01c patterns in the training data.\nSimply \ufb01nding a similar sequence in the training data is suf\ufb01cient for imputation.\n\n5\n\n\fTable 1: Traf\ufb01c data L2 loss comparison. NAOMI outperforms others, reducing L2 loss by 23% from\nthe autoregressive counterpart.\n\nModels\n\nL2 Loss (10\u22124)\n\nNAOMI\n3.54\n\nSingleRes\n\nMaskGAN\n\n4.51\n\n6.02\n\nKNN\n4.58\n\nGRUI\n15.24\n\nLinear\n15.59\n\nFigure 4: Traf\ufb01c time series imputation visualization. NAOMI successfully captures the multiresolution\npatterns of the data from observed steps, while SingleRes only learns a smoothed version of the\noriginal sequence and frequently deviates from ground truth.\n\nGenerated Sequences. Figure 4 visualizes the predictions from two best performing models:\nNAOMI (blue) and SingleRes (red). Black dots are observed time steps and black curves are\nthe ground truth. NAOMI successfully captures the pattern of the ground truth time series, while\nSingleRes fails. NAOMI learns the multiscale \ufb02uctuation rooted in the ground truth, whereas\nSingleRes only learns some averaged behavior. This demonstrates the clear advantage of using\nmultiresolution modeling.\n\n4.2\n\nImputing Billiards Trajectories\n\nWe generate 4000 training and 1000 test sequences of Billiards ball trajectories in a rectangular world\nusing the simulator from [31]. Each ball is initialized with a random position and random velocity\nand rolled-out for 200 timesteps. All balls have a \ufb01xed size and uniform density, and there is no\nfriction. We generate a masking sequence for each trajectory with 180 to 195 missing values.\nImputation accuracy. Three de\ufb01ning characteristics of the physics in this setting are: (1) moving\nin straight lines; (2) maintaining unchanging speed; and (3) re\ufb02ecting upon hitting a wall. Hence,\nwe adopt four metrics to quantify the learned physics: (1) L2 loss between imputed values and\nground-truth; (2) Sinuosity to measure the straightness of the generated trajectories; (3) Average step\nsize change to measure the speed change of the ball; and (4) Distance between re\ufb02ection point and\nthe wall to check whether the model has learned the physics underlying collision and re\ufb02ection.\nComparison of all models w.r.t. these metrics are shown in Table 2. Expert represents the ground\ntruth trajectories from the simulator. Statistics closer to the expert are better. We observe that NAOMI\nhas the best overall performance across almost all the metrics, followed by SingleRes baseline. It is\nexpected that linear to perform the best w.r.t step change. By design, linear interpolation maintains\na constant step size change that is the closest to the ground-truth.\nGenerated trajectories. We visualize the imputed trajectories in Figure 5. There are 8 known\ntimesteps (black dots), including the starting position. NAOMI can successfully recover the original\ntrajectory whereas SingleRes deviates signi\ufb01cantly. Notably, SingleRes mistakenly predicts the\n\nTable 2: Metrics for billiards imputation accuracy. Statistics closer to the expert indicate better model\nperformance. NAOMI has the best overall performance, reducing deviation from ground truth by 30%\nto 70% across all metrics compared to autoregressive baselines.\n\nModels\nSinuosity\n\nstep change (10\u22123)\nre\ufb02ection to wall\nL2 loss (10\u22122)\n\nLinear\n1.121\n0.961\n0.247\n19.00\n\nKNN\n1.469\n24.59\n0.189\n5.381\n\nGRUI\n1.859\n28.19\n0.225\n20.57\n\nMaskGAN\n\nSingleRes\n\n1.095\n15.35\n0.100\n1.830\n\n1.019\n9.290\n0.038\n0.233\n\nNAOMI Expert\n1.006\n1.000\n1.588\n7.239\n0.023\n0.018\n0.067\n0.000\n\n6\n\n\fFigure 5: Comparison of imputed billiards trajectories. Blue and red trajectories/curves represent\nNAOMI and the single-resolution baseline model respectively. White trajectories represent the ground-\ntruth. There are 8 known observations (black dots). NAOMI almost perfectly recovers the ground-truth\nand achieves lower stepwise L2 loss of missing values than the baseline model (third row). The\ntrajectory from the baseline \ufb01rst incorrectly bounces off the upper wall, which results in curved paths\nthat deviate from the ground-truth.\n\nFigure 6: Billiards model performance with increasing percentage of missing values. The median\nand 25, 75 percentile values are displayed at each number of missing steps. Statistics closer to the\nexpert indicate better performance. NAOMI performs better than SingleRes for all metrics.\n\nball to bounce off the upper wall instead of the left wall. As such, SingleRes has to correct\nits behavior to match future observations, leading to curved and unrealistic trajectories. Another\ndeviation can be seen near the bottom-left corner, where NAOMI produces trajectory paths that are\ntruly parallel after bouncing off the wall twice, but SingleRes does not.\nRobustness to missing proportion. Figure 6 compares the performance of NAOMI and SingleRes\nas we increase the proportion of missing values. The median value and 25, 75 percentile values are\ndisplayed for each metric. As the dynamics are deterministic, higher missing portion usually means\nbigger gaps, making it harder to \ufb01nd the correct solutions. We can see both models\u2019 performance\ndegrade drastically as we increase the percentage of missing values, but NAOMI still outperforms\nSingleRes in all metrics.\n\n4.3\n\nImputing Basketball Players Movement\n\nThe basketball tracking dataset contains the trajectories of professional basketball players on offense\nwith 107,146 training and 13,845 test sequences. Each sequence contains the (x, y)-coordinates of\n5 players for 50 timesteps at 6.25Hz and takes place in the left half-court. We generate a masking\nsequence for each trajectory with 40 to 49 missing values.\n\nTable 3: Metrics for basketball imputation accuracy. Statistics closer to the expert indicate better\nmodel performance. NAOMI has the best overall performance, reducing deviation from ground truth\nby more than 70% compared to autoregressive baselines.\n\nModels\n\nPath Length\n\nOOB Rate (10\u22123)\nStep Change (10\u22123)\n\nPath Difference\nPlayer Distance\n\nLinear\n0.482\n2.997\n0.522\n0.519\n0.422\n\nKNN\n0.921\n0.128\n13.24\n0.746\n0.403\n\nGRUI\n1.141\n4.703\n14.95\n0.690\n0.398\n\nMaskGAN\n\nSingleRes\n\n0.793\n4.592\n9.622\n0.680\n0.427\n\n0.702\n3.874\n4.811\n0.571\n0.417\n\nNAOMI Expert\n0.573\n0.556\n0.861\n1.733\n2.565\n1.982\n0.581\n0.580\n0.423\n0.425\n\n7\n\n\fFigure 7: Comparison of imputed basketball trajectories. Black dots represent known observations\n(10 in \ufb01rst row, 5 in second). Overall, NAOMI produces trajectories that are the most consistent and\nhave the most realistic player velocities and speeds.\n\nImputation accuracy. Since the environment is stochastic (basketball players on offense aim to be\nunpredictable), measuring L2 loss between our model output and the ground-truth is not necessarily a\ngood indicator of realistic trajectories [32, 33]. Hence, we follow previous work and compute domain-\nspeci\ufb01c metrics to compare trajectory quality: (1) Average trajectory length to measure the typical\nplayer movement in 8 seconds; (2) Average out-of-bound rate to measure the odds of trajectories\ngoing out of court boundaries; (3) Average step size change to quantify the player movement variance;\n(4) Max-Min path diff ; and (5) Average player distance to characterize the team coordination. Table\n3 compares model performances using these metrics. Expert represents real human play, and the\ncloser to the expert data, the better. NAOMI outperforms baselines in almost all the metrics.\nGenerated trajectories. We visualize imputed trajectories from all models in Figure 7. NAOMI\nproduces trajectories that are the most consistent with known observations and have the most realistic\nplayer velocities and speeds. In contrast, other baseline models often fail in these regards. KNN\ngenerates trajectories with unnatural jumps as \ufb01nding nearest neighbors becomes dif\ufb01cult with dense\nknown observations. Linear fails to generate curvy trajectories when few observations are known.\nGRUI generates trajectories that are inconsistent with known observations. This is largely due to\nmode collapse caused by applying a discriminator to the entire sequence. MaskGAN, which relies on\nseq2seq and a single encoder, fails to condition on the future observations and predicts straight lines.\nRobustness to missing proportion. Figure 8 compares the performance of NAOMI and SingleRes\nas we increase the proportion of missing values. The median value and 25, 75 percentile values are\ndisplayed for each metric. Note that we always observe the \ufb01rst step. Generally speaking, more\nmissing values make the imputation harder, and also brings more uncertainty to model predictions.\nWe can see that performance (average performance and imputation variance) of both models degrade\nwith more missing values. However, at a certain percentage of missing values, the performance of\nimputation starts to improve for both models.\nThis shows an interesting trade-off between available information and number of constraints for\ngenerative models in imputation. More observations provide more information regarding the data\ndistribution, but can also constrain the learned model output. As we reduce the number of observations,\nthe model can learn more \ufb02exible generative distributions, without conforming to the constraints\nimposed by the observed time steps.\n\nLearned conditional distribution. Our model is fully generative and learns the conditional distri-\nbution of the complete sequences given observations. As shown in Figure 9. For a given set of known\nobservations, we use NAOMI to impute missing values with 50 different random seeds and overlay the\ngenerated trajectories. We can see that as the number of known observations increases, the variance\nof the learned conditional distribution decreases. However, we also observe some mode collapse in\nour model: the trajectory of the purple player in the ground truth is not captured in the conditional\ndistribution in the \ufb01rst image.\n\n4.4 Forward Prediction\n\nForward prediction is a special case of imputation when all observations, except for a leading\nsequence, are missing. We show that NAOMI can also be trained to perform forward prediction without\n\n8\n\n\fFigure 8: Basketball model performance with\nincreasing percentage of missing values. The\nmedian and 25, 75 percentile values are dis-\nplayed. Statistics closer to the expert indicate\nbetter performance. NAOMI performs better than\nSingleRes for all metrics.\n\nFigure 9: The generated conditional distribution\nof basketball trajectories given known observa-\ntions (black dots) with sampled trajectories. As\nthe number of known observations increases, the\nvariance of the predictions, hence the model un-\ncertainty decreases.\n\nmodifying the model structure. We take a trained imputation model as initialization, and continue\ntraining for forward prediction by using the masking sequence mi = 0,\u2200i \u2265 5 (\ufb01rst 5 steps are\nknown). We evaluate forward prediction performance using the same metrics.\nFigure 10 compares forward prediction performance in Billiards. Without any known observations in\nthe future, autoregressive models like SingleRes are effective in learning consistent step changes,\nbut NAOMI generates straighter lines and learns the re\ufb02ection dynamics better than other baselines.\n\nModels\nSinuosity\n\nStep Change (10\u22123)\nRe\ufb02ection to wall\nL2 Loss (10\u22123)\n\nRNN\n1.054\n11.6\n0.074\n4.698\n\nSingleRes\n\n1.038\n9.69\n0.068\n4.753\n\nNAOMI Expert\n1.020\n1.00\n10.8\n1.59\n0.036\n0.018\n1.682\n0.0\n\nFigure 10: Billiard Forward Prediction Comparison. Top: metrics for billiards prediction accuracy.\nStatistics closer to the expert indicate better model performance. Bottom: predicted billiards\ntrajectories. Black dots represent known observations. NAOMI perfectly recovers the ground-truth.\n\n5 Conclusion\n\nWe propose a deep generative model NAOMI for imputing missing data in long-range spatiotemporal\nsequences. NAOMI recursively \ufb01nds and predicts missing values from coarse to \ufb01ne-grained resolutions\nusing a non-autoregressive approach. Leveraging multiresolution modeling and adversarial training,\nNAOMI is able to learn the conditional distribution given very few known observations. Future work\nwill investigate how to infer the underlying distribution when complete training sequences are not\navailable. The trade-off between partial observations and external constraints is another direction for\ndeep generative imputation models.\nAcknowledgments. This work was supported in part by NSF #1564330, NSF #1850349, and DARPA\nPAI: HR00111890035.\n\n9\n\n\fReferences\n[1] Raquel Urtasun, David J Fleet, and Pascal Fua. 3d people tracking with gaussian process\ndynamical models. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society\nConference on, volume 1, pages 238\u2013245. IEEE, 2006.\n\n[2] Craig F Ansley and Robert Kohn. On the estimation of arima models with missing values. In\n\nTime series analysis of irregularly observed data, pages 9\u201337. Springer, 1984.\n\n[3] Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley &\n\nSons, 2004.\n\n[4] Edgar Acuna and Caroline Rodriguez. The treatment of missing values and its effect on classi\ufb01er\naccuracy. In Classi\ufb01cation, clustering, and data mining applications, pages 639\u2013647. Springer,\n2004.\n\n[5] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793.\n\nWiley, 2019.\n\n[6] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent\nneural networks for multivariate time series with missing values. Scienti\ufb01c reports, 8(1):6085,\n2018.\n\n[7] William Fedus, Ian Goodfellow, and Andrew Dai. Maskgan: Better text generation via \ufb01lling in\n\nthe (blank). In International Conference on Learning Representations (ICLR), 2018.\n\n[8] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Gain: Missing data imputation using\n\ngenerative adversarial nets. In International Conference on Machine Learning, 2018.\n\n[9] Yonghong Luo, Xiangrui Cai, Ying Zhang, Jun Xu, et al. Multivariate time series imputation\nwith generative adversarial networks. In Advances in Neural Information Processing Systems,\npages 1603\u20131614, 2018.\n\n[10] S van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained\n\nequations in r. Journal of statistical software, pages 1\u201368, 2010.\n\n[11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, NY, USA:, 2001.\n\n[12] Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi Marwala. Missing data: A com-\nparison of neural network and expectation maximization techniques. Current Science, pages\n1514\u20131521, 2007.\n\n[13] Jinsung Yoon, William R Zame, and Mihaela van der Schaar. Estimating missing data in\ntemporal data streams using multi-directional recurrent neural networks. IEEE Transactions on\nBiomedical Engineering, 2018.\n\n[14] Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. Brits: Bidirectional recurrent\nimputation for time series. In Advances in Neural Information Processing Systems 31, pages\n6776\u20136786, 2018.\n\n[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[16] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-\nIn International Conference on Learning Rep-\n\nautoregressive neural machine translation.\nresentations (ICLR), 2018.\n\n[17] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural\nsequence modeling by iterative re\ufb01nement. In Proceedings of the 2018 Conference on Empirical\nMethods in Natural Language Processing, 2018.\n\n10\n\n\f[18] Jind\u02c7rich Libovick`y and Jind\u02c7rich Helcl. End-to-end non-autoregressive neural machine translation\nwith connectionist temporal classi\ufb01cation. In Proceedings of the 2018 Conference on Empirical\nMethods in Natural Language Processing, 2018.\n\n[19] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray\nKavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg,\net al. Parallel wavenet: Fast high-\ufb01delity speech synthesis. In International Conference on\nMachine Learning, 2018.\n\n[20] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural informa-\ntion processing systems, pages 4743\u20134751, 2016.\n\n[21] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[22] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI, pages 2852\u20132858, 2017.\n\n[23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for im-\nproved quality, stability, and variation. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[24] Scott Reed, A\u00e4ron van den Oord, Nal Kalchbrenner, Sergio G\u00f3mez Colmenarejo, Ziyu Wang,\nDan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. In\nInternational Conference on Machine Learning, 2017.\n\n[25] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural\n\nnetworks. International Conference on Learning Representations (ICLR), 2017.\n\n[26] Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua\nBengio, and Aaron C Courville. Multiresolution recurrent neural networks: An application to\ndialogue response generation. In AAAI, pages 3288\u20133294, 2017.\n\n[27] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical\nlatent vector model for learning long-term structure in music. International Conference on\nMachine Learning, 2018.\n\n[28] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Confer-\n\nence on Learning Representations (ICLR), 2014.\n\n[29] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n[30] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.\n\n[31] Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual\npredictive models of physics for playing billiards. In International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[32] Stephan Zheng, Yisong Yue, and Jennifer Hobbs. Generating long-term trajectories using\ndeep hierarchical networks. In Advances in Neural Information Processing Systems, pages\n1543\u20131551, 2016.\n\n[33] Eric Zhan, Stephan Zheng, Yisong Yue, and Patrick Lucey. Generating multi-agent trajectories\nusing programmatic weak supervision. In International Conference on Learning Representations\n(ICLR), 2019.\n\n11\n\n\f", "award": [], "sourceid": 6005, "authors": [{"given_name": "Yukai", "family_name": "Liu", "institution": "Caltech"}, {"given_name": "Rose", "family_name": "Yu", "institution": "Northeastern University"}, {"given_name": "Stephan", "family_name": "Zheng", "institution": "Salesforce"}, {"given_name": "Eric", "family_name": "Zhan", "institution": "Caltech"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}]}