Reviews: Learning low-dimensional state embeddings and metastable clusters from time series data

This appears to be a detailed, carefully written, and theoretically/empirically supported piece of work. The results are based on RKHS theory and appear to extend the state-of-the-art in a number of ways, including learning KMEs while dealing with dependent samples (e.g. from a Markov chain), low dimensional/rank approximations with state-of-the-art error bounds, and a variational approach for clustering, as well as a robust set of experiments. On the whole, I am satisfied that this clearly meets the acceptance criteria for NeurIPS and I am recommending it for acceptance. I have a few comments below on aspects of the paper, where things were either unclear, or might benefit from a revisit. It is unclear how strong assumption 2 is. It seems to me that if the marginal p(x) is in the Hilbert space H, and the joint p(x,y) is in the joint Hilbert space Hx\tilde{H} then it is natural to assume that the conditional is too, but it isn't clear whether this is a strong or weak assumption. When could it be violated? The equations under [lines 147 and 158] seem to be inconsistent (one is the theoretical and the other the approximate conditional probability). I think the first equation should have half powers on the C-matrices. Theorem 4 appears to assume that the underlying process has a correct clustering of states, but this isn't explicit in the theorem. The stochastic diffusion process could be a bit better described. And a better interpretation/intuition over the results. The colours in the #clusters=15 plot should be distinct (I count 8 colours and 17 contiguous regions). There appears to be a bit of repetition in the description of the DQN experiment, and the pairs of states could be more formally described (how did you identify that these states should be close to one another before you conducted the experiment, or was it a post-hoc analysis). ## typos [line 48] metastablet sets -> metastable sets [line 130-1] shift-invariance kernel -> shift-invariant kernel [line 158] Assumptions 2 -> Assumption 2 [line 168] states within the same set shares -> states within the same set share Equation under [line 179] The symbol i on the top line should be an x I think. Also, it should be stated that this misclassification rate is with respect to data where the ground truth is that the states belong to m clusters. [line 220] with the number of clusters vary -> with a varying number of clusters ## After reading the reviews I can see some valid points made by the other reviewers in their comments particularly with respect to the clarity of what contributions are novel. I still consider this to be a good paper though, and I will keep my score as it stands.

Reviewer 2

Aside from some minor grammatical issues, I found this to be an interesting paper introducing new methods that leveraging low-rank structure to embed transition distributions and time-varying states. I think the idea of the diffusion distance can inspire further work in distance metrics for sequential models as well as new learning paradigms. (i) Where do the authors see their work fitting in relative to existing work on Koopman operators and Dynamic Mode Decomposition. For example in https://papers.nips.cc/paper/6713-learning-koopman-invariant-subspaces-for-dynamic-mode-decomposition.pdf, the authors use the level sets of the Koopman eigenfunction to find basins of attraction of the dynamics (Figure 5 -- right in the pdf) in a manner similar to what is done here via clustering. (ii) Have you experimented with clustering the raw time-series (i.e. without the state embedding). What do those results look like and how do they qualitatively differ from what is presented in Figure 2? (iii) Could you comment on the assumption in line 171 that the clustering admits a unique optimal solution. What properties of the time-series do you think might be necessary for that assumption to be satisfied? (iv) How did the quality of the state embeddings (as visualized in Figure 2) vary as a function of the number of Fourier features used. Presumably there is some tradeoff between the number of features needed to capture the properties of the dynamics to a sufficient degree and the complexity of the underlying dynamics (for example, the number of basins of attraction and how close they are). Have you experimented with this?

Reviewer 3

I think the theory proposed in section 2 is quite interesting and a novel way of computing a Markov transition matrix given sampled transitions. From my understanding it basically requires sampling transitions and projecting them on to a random set of features, and then using the projection to approximate a transition matrix and its SVD. I think that the paper is not written very well because it does not emphasize the intuition behind projecting to random features or other work on random features. Further this could be modified to be a streaming algorithm which the authors don't do. The next section is more confusing. First, the authors don't specify in what sense their markov transition matrix is irreversible. I believe they are referring to the fact that the normal diffusion operator is reversible after multiplication by a degree conjugate. However, it seems like the same general type of symmetrization trick is working here. At least this needs to be written clear for a non-familiar audience.

Reviewer 4

After the rebuttal, I have increased my score accordingly. ################################ The authors propose a kernel mean embedding nonparametric estimation for low-rank Markov chains, and provide corresponding statistical rate. Majors: 1. Nonparametric estimation for Markov chain probability kernel seems to have been considered in [1]. And [1] assume a more general model with an approximately low-rank structure with exponential decay of the eigenvalue. Minimax estimate rate is also derived. Can the authors clarify the novelty beyond [1]? 2. I feel assumption 2 is strong since the authors assume the kernel function for Hilbert space can be exactly represented by a finite sum of basis functions. Does this mean that there is no approximation error any more? Typically in RHKS literature, people represent the kernel function by an infinite sum spectral expansion. Different kernel functions then are categorized by different eigenvalue decay rates, e.g. polynomial decay kernel or exponential decay kernel. 3. The rate in Theorem 1 is just a parametric rate that roughly the same as the one in [2][3]. So I am wondering how hard to generalize the technique from [2][3] to this paper. 4. I appreciate the authors could implement their method on Atari game. But I am not convinced that why you take the last hidden layer of DQN? What is the benefit and will it improve the score of the game? The explanation for some particular actions of a game is quire artificial since we know vanilla DQN itself is not stable. The author should test some datasets to show that if the transition kernel is indeed low-rank. Minor: 1. The comparison with plain KME is not fair since reshaped KME uses additional information, saying the knowledge of low-rankness. [1]. Matthias Löffler, Antoine Picard. Spectral thresholding for the estimation of Markov chain transition operators. [2]. Anru Zhang, Mengdi Wang. Spectral State Compression of Markov Processes. [3]. Xudong Li, Mengdi Wang, Anru Zhang. Estimation of Markov Chain via Rank-Constrained Likelihood.

Paper ID:	2572
Title:	Learning low-dimensional state embeddings and metastable clusters from time series data

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4