Review for NeurIPS paper: Modeling Continuous Stochastic Processes with Dynamic Normalizing Flows

NeurIPS 2020

Modeling Continuous Stochastic Processes with Dynamic Normalizing Flows

Review 1

Summary and Contributions: This paper proposes using continuous-time neural ODEs to transform a latent Wiener process into a richer process over the observation space. The idea is similar to the usual formulation of normalizing flows, except that now the (usually fixed) base distribution is a stochastic process. Or to be more precise, the flow’s transformation warps the conditional Normal variables of the Brownian bridge. This conditional formation is used for exact likelihood calculation and sampling. Sampling the latent process via independent increments means that the observation process’ increments are independent as well. This is eventually relaxed by the addition of another latent variable. Experiments are first performed showing the ability to model synthetic data simulated from three different stochastic processes. Secondly, performance is demonstrated on continuous control, ECG modeling, and air quality prediction.

Strengths: Irregular sampling: The primary benefit of this model is its ability to handle irregularly sampled time series data. This paper’s transformation of just the increments allows for complete flexibility in the time steps and no RNN-based decoding, as previous work has required. (Yet note that, unlike for the Latent ODE, this work’s use of the Neural ODE has *nothing* to do with handing irregular sampling---more on this below.)

Weaknesses: No Explanation of Transformations of Stochastic Processes: I was under the impression that transforming / reparameterizing a stochasic process is non-trivial. I expected to see Ito’s lemma, which is analogous to the change of variables for SDEs: For Y_t = f(X_t), it’s derivative is: dY_t = f’(X_t) dX_t + ½ f’’(X_t) \sigma^2_t dt. Thus, I was expecting Equation 7 to include a second derivative term. Why doesn’t it? I’m not saying that Equation 7 is wrong, per se---transforming just the increments agrees with intuition. However, the problem is that the paper provides no explanation or mathematical references for stochastic processes and their transformations. There are *zero* citations in both Section 2.2 and Section 3.1. More work needs to be done to assure the reader that Equations 5-7 are correct, since they are the foundation of the work. Use of Neural ODEs: The paper says that Neural ODEs are used “because it has free-form Jacobian and efficient trace estimator” (line 158). It makes no sense to me why one would choose to encumber the model with such a slow, (relatively-)difficult-to-implement flow. Why not at least use a Residual Flow [Behrmann et al., 2019; Chen et al., 2019]? These are essentially the same except that now there’s no need to backprop through an ODE solver. Furthermore, the one ablation experiment done with an RNVP flow is not informative. As the paper mentions, the checkerboard masking of RNVP is very restrictive. Moreover, it is designed for images---a domain the paper does not consider. I suspect using Glow or an autoregressive flow (inversion cost not withstanding) would be competitive-to-superior to the Neural ODE-variant while being much more practical. Ablation study of sampling intervals: I would really have liked to have seen an ablation study of the sampling intervals. For the real-world data, the time stamps are sampled once from a homogeneous Poisson process with an intensity of 0.5 (line 36 of supp mats). Varying the sampling would have allowed us to see which of the models is truly best under irregular sampling. Perhaps Latent ODE and CTFP excel in different regimes and such a study would tell the reader which to use for their data. Repetitive Introduction: I find the Introduction to be quite repetitive. It could be shortened and the extra space used for the stochastic process background I mentioned above. Behrmann, Jens, et al. "Invertible residual networks." International Conference on Machine Learning. 2019. Chen, Ricky TQ, et al. "Residual flows for invertible generative modeling." Advances in Neural Information Processing Systems. 2019.

Correctness: The theoretical correctness is hard to verify because no exposition or even references are given for the transformation of stochastic processes.

Clarity: No. There are places for clear improvement. The Introduction is quite repetitive, and the stochastic process background material that underlies the method should be expanded.

Relation to Prior Work: I believe so but am not well-read in time series modeling.

Reproducibility: Yes

Additional Feedback: POST REBUTTAL UPDATE: Thank you, authors, for addressing my concerns. I have raised my score to a 6. While I agree that the paper's primary contribution is the flow's general formulation and the exact transformation is just an implementation choice, I still think it is important to report results for a wider range of transformation classes and to discuss trade-offs.

Review 2

Summary and Contributions: In this work, they use a normalizing flow to transform a Weiner process into a more complex stochastic process.

Strengths: Sound. Novel in the strict sense (I do not know of work proposing exactly this scheme).

Weaknesses: The key issue is in Eq. 6, which appears to indicate that the normalising flow transformation is applied at each timepoint independently. This renders the model somewhat trivial (a Kalman filter with nonlinear outputs), extensively studied (e.g. in the EKF literature). It is well known that linear dynamics in a Kalman filter is a discretisation of an underlying continuous dynamical system, and you could use unevenly spaced observations if desired. Here are some possible claims that would render the work more interesting: 1.) Inclusion of the Jacobean in the output transformation renders the ML solution a better characterisation of the Bayesian solution. 2.) Having a normalising flow that depends on the value of the Weiner process at all past time steps (which allows much richer temporal dependencies). 3.) Arguing that the presently described process is surprisingly effective, by looking at a more empirical examples, and comparing to SOTA performance from referenced papers. Eq 12 might do some of this, but it is extremely unclear.

Correctness: Technically reasonable.

Clarity: Well written.

Relation to Prior Work: The paper confuses "Background" and "Related Work". The Background should contain basic information e.g. on stochastic processes and normalising flows. The Related Work section should discuss a small number of the most closely related approaches (probably those that combine normalising flows with dynamical models), and tell us how the present work differs. Missing reference: Filtering Normalizing Flows (Razaghi 2019).

Reproducibility: Yes

Additional Feedback: While I cannot recommend acceptance at this stage, I hope that my suggestions will prove helpful for the authors.

Review 3

Summary and Contributions: The authors propose a new normalizing flow using a Wiener process, the continuous time flow process (CTFP). CTFP has many nice properties, such as being evaluated on irregular grids and having continuous sample paths. Although any normalizing flow model can be used under the authors’ framework, they use the ANODE model.

Strengths: The paper is very well written and easy to understand. The mathematical notation is also clear and precise. The related work section has the correct level of detail and explanation of previous work. The figures are extremely clear and useful. Figure 2 specifically was very helpful in visualizing the CTFP. The logic flow is excellent (no pun intended).

Weaknesses: Line 55 - The authors point out that continuity is important but do not explain why. Obviously this is a desirable property in general, but examples showing the limitations of non-continuous processes should be shown (or at least mentioned) to bolster the authors’ work. The authors claim that “the stochastic process generated by CTFP is guaranteed to have continuous sample paths, making it a natural fit for data with continuously-changing dynamics”; while this makes sense, it doesn’t quantify how important this is.

Correctness: Everything seems correct.

Clarity: The paper is extremely clear and well written. A few suggestions for improvements: Figure 1 is not referenced in the text, therefore it is easy to skip over. Equation 13 - The subscript of z in E_{\mathbf{z_k}∼q} is confusing. I am guessing that you are saying that each z_k is distributed according to q and that the expectation is over z1, ..., z_K. However, it is confusing because k is also used to index the sum inside of the expectation. Figure 3 should have labeled axes with a larger font size. Line 19 - The authors propose what a good time series model needs but doesn’t explain how this relates to their work or the work of others. Even adding something like “Our proposed model captures all of these properties while being computationally tractable” at Line 28 would help the reader understand that these are the properties that CTFP captures.

Relation to Prior Work: Yes previous contributions and the authors’ work are presented clearly.

Reproducibility: Yes

Additional Feedback: Overall the paper was very good. I would be happy to see it at NeurIPS.