{"title": "Learning low-dimensional state embeddings and metastable clusters from time series data", "book": "Advances in Neural Information Processing Systems", "page_first": 4561, "page_last": 4570, "abstract": "This paper studies how to find compact state embeddings from high-dimensional Markov state trajectories, where the transition kernel has a small intrinsic rank. In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process's dynamics. This idea also leads to a kernel reshaping method for more accurate nonparametric estimation of the transition function. State embedding can be used to cluster states into metastable sets, thereby identifying the slow dynamics. Sharp statistical error bounds and misclassification rate are proved. Experiment on a simulated dynamical system shows that the state clustering method indeed reveals metastable structures. We also experiment with time series generated by layers of a Deep-Q-Network when playing an Atari game. The embedding method identifies game states to be similar if they share similar future events, even though their raw data are far different.", "full_text": "Learning low-dimensional state embeddings and\n\nmetastable clusters from time series data\n\nYifan Sun\n\nCarnegie Mellon University\nyifans@andrew.cmu.edu\n\nHao Gong\n\nPrinceton University\n\nhgong@princeton.edu\n\nYaqi Duan\n\nPrinceton University\n\nyaqid@princeton.edu\n\nMengdi Wang\n\nPrinceton University\n\nmengdiw@princeton.edu\n\nAbstract\n\nThis paper studies how to \ufb01nd compact state embeddings from high-dimensional\nMarkov state trajectories, where the transition kernel has a small intrinsic rank.\nIn the spirit of diffusion map, we propose an ef\ufb01cient method for learning a low-\ndimensional state embedding and capturing the process\u2019s dynamics. This idea also\nleads to a kernel reshaping method for more accurate nonparametric estimation\nof the transition function. State embedding can be used to cluster states into\nmetastable sets, thereby identifying the slow dynamics. Sharp statistical error\nbounds and misclassi\ufb01cation rate are proved. Experiment on a simulated dynamical\nsystem shows that the state clustering method indeed reveals metastable structures.\nWe also experiment with time series generated by layers of a Deep-Q-Network\nwhen playing an Atari game. The embedding method identi\ufb01es game states to\nbe similar if they share similar future events, even though their raw data are far\ndifferent.\n\n1\n\nIntroduction\n\nHigh-dimensional time series is ubiquitous in scienti\ufb01c studies and machine learning. Finding\ncompact representation from state-transition trajectories is often a prerequisite for uncovering the\nunderlying physics and making accurate predictions. Suppose that we are given a Markov process\n{Xt} taking values in \u2126 \u2282 Rd. Let p(y|x) be the one-step transition density function (transition\nkernel) of the Markov process. In practice, state-transition trajectories may appear high-dimensional,\nbut they are often generated by a system with fewer internal parameters and small intrinsic dimension.\nIn this paper, we focus on problems where the transition kernel p(y|x) admits a low-rank decomposi-\ntion structure. Low-rank or nearly low-rank nature of the transition kernel has been widely identi\ufb01ed\nin scienti\ufb01c and engineering applications, e.g. molecular dynamics [RZMC11, SS13], periodized\ndiffusion process [G\u02daar54], traf\ufb01c transition data [ZW18, DKW19], Markov decision process and\nreinforcement learning [KAL16]. For reversible dynamical systems, leading eigenfunctions of p are\nrelated to metastable sets and slow dynamics [SS13]. Low-rank latent structures also helps state\nrepresentation learning and dimension reduction in robotics and control [BSB+15].\nOur goal is to estimate the transition kernel p(\u00b7|\u00b7) from \ufb01nite time series and \ufb01nd state representation\nin lower dimensions. For nonparametric estimation of probability distributions, one natural approach\nis the kernel mean embedding (KME). Our approach starts with a kernel space, but we \u201copen up\u201d the\nkernel function into a set of features. We will leverage the low-rankness of p in the spirit of diffusion\nmap for dimension reduction. By using samples of transition pairs {(Xt, Xt+1)}, we can estimate the\n\u201cprojection\u201d of p onto the product feature space and \ufb01nds its leading singular functions. This allows\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fus to learn state embeddings that preserve information about the transition dynamics. Our approach\ncan be thought of as a generalization of diffusion map to nonreversible processes and Hilbert space.\nWe show that, when the features can fully express the true p, the estimated state embeddings preserve\nthe diffusion distances and can be further used to cluster states that share similar future paths, thereby\n\ufb01nding metastable sets and long-term dynamics of the process.\nThe contributions of this paper are:\n\n1.KME Reshaping for more accurate estimation of p. The method of KME reshaping is proposed to\nestimate p from dependent time series data. The method takes advantage of the low-rank structure\nof p and can be implemented ef\ufb01ciently in compact space. Theorem 1 gives a \ufb01nite-sample error\nbound and shows that KME reshaping achieves signi\ufb01cantly smaller error than plain KME.\n\n2.State embedding learning with statistical distortion guarantee. In light of the diffusion map, we\nstudy state embedding by estimating the leading spectrum of the transition kernel. Theorems 2,3\nshow that the state embedding largely preserves the diffusion distance.\n\n3.State clustering with misclassi\ufb01cation error bound. Based on the state embeddings, we can further\naggregate states to preserve the transition dynamics and \ufb01nd metastable sets. Theorem 4 establishes\nthe statistical misclassi\ufb01cation guarantee for continuous-state Markov processes.\n\n4.Experiments with diffusion process and Atari game. The \ufb01rst experiment studies a simulated\nstochastic diffusion process, where the results validate the theoretical bounds and reveals metastable\nstructures of the process. The second experiment studies the time series generated by a deep Q-\nnetwork (DQN) trained on an Atari game. The raw time series is read from the last hidden layer as\nthe DQN is run. The state embedding results demonstrate distinctive and interpretable clusters of\ngame states. Remarkably, we observe that game states that are close in the embedding space share\nsimilar future moves, even if their raw data are far different.\n\nTo our best knowledge, our theoretical results on estimating p and state embedding are the \ufb01rst\nof their kind for continuous-state nonreversible Markov time series. Our methods and analyses\nleverage spectral properties of the transition kernel. We also provide the \ufb01rst statistical guarantee for\npartitioning the continuous state space according to diffusion distance.\n\nRelated Work Spectral dimension reduction methods \ufb01nd wide use in data analysis and scienti\ufb01c\ncomputing. Diffusion map is a prominent dimension reduction tool which applies to data analysis,\ngraph partitioning and dynamical systems [LL06],[CKL+08]. For molecular dynamics, [SNL+11]\nshowed that leading spectrum of transition operator contains information on slow dynamics of the\nsystem, and it can be used to identify coresets upon which a coarse-grained Markov state model could\nbe built. [KSM18] extended the transfer operator theory to reproducing kernel spaces and pointed\nout the these operators are related to conditional mean embeddings of the transition distributions. See\n[KKS16, KNK+18] for surveys on data-driven dimension reduction methods for dynamical systems.\nThey did not study statistical properties of these methods which motivated our research.\nNonparametric estimation of the Markov transition operator has been thoroughly studied, see\n[Yak79, Lac07, Sar14]. Among nonparametric methods, kernel mean embeddings are prominent\nfor representing probability distributions [BTA04, SGSS07]. [SHSF09] extended kernel embedding\nmethods to conditional distributions. [GLB+12] proposed to use conditional mean embedding to\nmodel Markov decision processes. See [MFSS17] for a survey on kernel mean embedding. None of\nthese works considered low-rank estimation of Markov transition kernel, to our best knowledge.\nEstimation of low-rank transition kernel was \ufb01rst considered by [ZW18] in the special case of \ufb01nite-\nstate Markov chains. [ZW18] used a singular thresholding method to estimate the transition matrix\nand proves near-optimal error upper and lower bounds. They also proved misclassi\ufb01cation rate for\nstate clustering when the chain is lumpable or aggragable. [LWZ18] studied a rank-constrained\nmaximum likelihood estimator of the transition matrix. [DKW19] proposed a novel approach for\n\ufb01nding state aggregation by spectral decomposing transition matrix and transforming singular vectors.\nFor continuous-state reversible Markov chains, [LP18] studied the nonparametric estimation of\ntransition kernel via Galerkin projection with spectral thresholding. They proved recovery error\nbounds when eigenvalues decay exponentially.\nNotations For a function f : \u2126 \u2192 R, we de\ufb01ne (cid:107)f(cid:107)2\n\nrespectively. For g(\u00b7,\u00b7) \u2192 R, we de\ufb01ne (cid:107)g(\u00b7,\u00b7)(cid:107)L2(\u03c0)\u00d7L2 := ((cid:82) \u03c0(x)g(x, y)2dydx)1/2. We use (cid:107) \u00b7 (cid:107) to\n\nL2(\u03c0) :=(cid:82)\n\n\u2126 f (x)2dx and (cid:107)f(cid:107)2\n\n\u2126 \u03c0(x)f (x)2dx,\n\nL2 :=(cid:82)\n\n2\n\n\ftmix = min(cid:8)t | TV(P t(\u00b7 | x), \u03c0(\u00b7)) \u2264 1\n\n4 ,\u2200x \u2208 \u2126(cid:9) , where T V is the total variation divergence between two\n\ndenote the Euclidean norm of a vector. We let tmix denote the mixing time of the Markov process [LP17], i.e,\n\ndistributions. Let \u03c0(x) be the density function of invariant measure of the Markov chain. Let p(x, y) be the den-\nsity of the invariant measure of the bivariate chain {(Xt, Xt+1)}\u221e\nt=0, i.e. p(Xt, Xt+1) = \u03c0(Xt)p(Xt+1|Xt).\nWe use P(\u00b7) to denote probability of an event.\n\n2 KME Reshaping for Estimating p\nIn this section we study the estimation of of transition function p from a \ufb01nite trajectory {Xt}n\nt=1 \u2282\nRd. We make following low-rank assumption regarding the transition kernel p, which is key to more\naccurate estimation.\nk=1 on \u2126 such that p(y|x) :=\nAssumption 1. There exist real-valued functions {uk}r\n\nk=1, {vk}r\n\n(cid:80)r\n\nk=1 \u03c3kuk(x)vk(y), where r is the rank.\n\n(cid:90)\ni)}n\n(cid:113) E(X,Y )\u223cp[K(X,X) \u02dcK(Y,Y )]\n\nDue to the asymmetry of p(\u00b7,\u00b7) and lack of reversibility, we use two reproducing kernel Hilbert\nspaces H and \u02dcH to embed the left and right side of p. Let K and \u02dcK be the kernel functions for H and\n\u02dcH respectively. The Kernel Mean Embedding (KME) \u00b5p(x, y) of the joint distribution p(x, y) into\nthe product space H \u00d7 \u02dcH is de\ufb01ned by\n\nn\n\n1\nn\n\nK(x, u) \u02dcK(y, v)p(u, v)dudv.\n\n\u00b5p(x, y) :=\nGiven sample transition pairs {(Xi, X(cid:48)\n\ni=1, the natural empirical KME estimator is \u02dc\u00b5p(x, y) =\ni, y). If data pairs are independent, one can show that the embedding error\n(Lemma 1 in Appendix). Next we\n\n(cid:80)n\ni=1 K(Xi, x) \u02dcK(X(cid:48)\n(cid:107)\u00b5p \u2212 \u02dc\u00b5p(cid:107)H\u00d7 \u02dcH is approximately\npropose a sharper KME estimator.\n(cid:80)\nj\u2208J \u03a6j(x)\u03a6j(y), and \u02dcK(x, y) =(cid:80)\nSuppose that the kernel functions K and \u02dcK are continuous and symmetric semi-de\ufb01nite. Let\n{\u03a6j(x)}j\u2208J and { \u02dc\u03a6j(x)}j\u2208J be the real-valued feature functions on \u2126 such that K(x, y) =\n\u02dc\u03a6j(x) \u02dc\u03a6j(y). In practice, if one is given a shift-invariant\nsymmetric kernel function, we can generate \ufb01nitely many random Fourier features to approximate\nthe kernel [RR08]. In what follows we assume without loss of generality that J is \ufb01nite of size N.\n(cid:90)\nLet \u03a6(x) = [\u03a61(x), . . . , \u03a6N (x)]T \u2208 RN . We de\ufb01ne the \u201cprojection\u201d of p onto the feature space by\n(1)\nAssumption 1 suggests that rank(P) \u2264 r (Lemma 2 in Appendix). Note that the KME of p(x, y) is\nequivalent to \u00b5p(x, y) = \u03a6(x)T P \u02dc\u03a6(y) (Lemma 3 in Appendix). The matrix P is of \ufb01nite dimensions,\ntherefore we can estimate it tractably from the trajectory {Xt} by\n\np(x, y)\u03a6(x) \u02dc\u03a6(y)T dxdy.\n\nP =\n\nj\u2208J\n\n\u03a6(Xt) \u02dc\u03a6(Xt+1)T .\n\n(2)\n\nn(cid:88)\n\nt=1\n\n\u02c6P :=\n\n1\nn\n\nSince the unknown P is low-rank, we propose to apply singular value truncation to \u02c6P for obtaining a\nbetter KME estimator. The algorithm is given below:\n\n\u02c6V be the best rank r approximation of \u02c6P;\n\nAlgorithm 1: Reshaping the Kernel Mean Embedding.\nInput:{X1, . . . ,Xn}, r;\nGet \u02c6P by (2), compute its SVD: \u02c6P = \u02c6U \u02c6\u03a3 \u02c6V;\nLet \u02dcP := \u02c6U \u02c6\u03a3[1...r]\nLet \u02c6\u00b5p(x, y) := \u03a6(x)T \u02dcP \u02dc\u03a6(y);\nOutput: \u02c6\u00b5p(x, y)\nWe analyze the convergence rate of \u02c6\u00b5p to \u00b5p. Let Kmax := max{supx\u2208\u2126 K(x, x), supx\u2208\u2126\nWe de\ufb01ne following kernel covariance matrices:\nV1 = E(X,Y )\u223cp[\u03a6(X)\u03a6(X)T \u02dcK(Y, Y )],\n\nLet \u00af\u03bb := max{\u03bbmax(V1), \u03bbmax(V2)}. We show the following \ufb01nite-sample error bound.\n\nV2 = E(X,Y )\u223cp[K(X, X) \u02dc\u03a6(Y ) \u02dc\u03a6(Y )T ].\n\n\u02dcK(x, x)}.\n\n3\n\n\f(cid:18)(cid:114)\n\n(cid:19)\n\nr\n\n\u221a\n\ntmix\n\n\u00af\u03bb log(2tmixN/\u03b4)\n\nTheorem 1 (KME Reshaping). Let Assumption 1 hold. For any \u03b4 \u2208 (0, 1), we have\n(cid:107)\u00b5p \u2212 \u02c6\u00b5p(cid:107)H\u00d7 \u02dcH = (cid:107)P \u2212 \u02dcP(cid:107)F \u2264 C\nwith probability at least 1 \u2212 \u03b4, where C is a universal constant.\nThe KME reshaping method and Theorem 1 enjoys the following advantages:\n1.Improved accuracy compared to plain KME. The plain KME \u02dc\u00b5p\u2019s estimation error is ap-\n(Appendix Lemma 1). Note that Tr(V1) = Tr(V2) =\nproximately\nE(X,Y )\u223cp[K(X, X) \u02dcK(Y, Y )]. When r (cid:28) N, we typically have r\u00af\u03bb (cid:28) Tr(V1) = Tr(V2), there-\nfore the reshaped KME has a signi\ufb01cantly smaller estimation error.\n\n(cid:113) E(X,Y )\u223cp[K(X,X) \u02dcK(Y,Y )]\n\ntmixKmax log(2tmixN/\u03b4)\n\n3n\n\n+\n\nn\n\nn\n\n2.Ability to handle dependent data. Algorithm 1 applies to time series consisting of highly\ndependent data. The proof of Theorem 1 handles dependency by constructing a special matrix\nmartingale and using the mixing properties of the Markov process to analyze its concentration.\n\n3.Tractable implementation. Kernel-based methods usually require memorizing all the data and\nmay be intractable in practice. Our approach is based on a \ufb01nite number of features and only\nneeds to low-dimensional computation. As pointed out by [RR08], one can approximate any shift-\ninvariant kernel function using N features where N is linear with respect to the input dimension d.\nTherefore Algorithm 1 can be approximately implemented in O(nd2) time and O(d2) space.\n\n3 Embedding States into Euclidean Space\n\nIn this section we want to learn low-dimensional representations of the state space \u2126 to capture the\ntransition dynamics. We need following extra assumption that p can be fully represented in the kernel\nspace.\nAssumption 2. The transition kernel belongs to the product Hilbert space, i.e., p(\u00b7 | \u00b7) \u2208 H \u00d7 \u02dcH.\nFor two arbitrary states x, y \u2208 \u2126, we consider their distance given by\n\n(cid:18)(cid:90) (cid:0)p(z|x) \u2212 p(z|y)(cid:1)2\n\n(cid:19)1/2\n\ndz\n\n.\n\n(3)\n\ndist(x, y) := (cid:107)p(\u00b7|x) \u2212 p(\u00b7|y)(cid:107)L2 =\n\nEq. (3) is known as the diffusion distance [NLCK06]. It measures the similarity between future\npaths of two states. We are motivated by the diffusion map approach for dimension reduction\n[LL06, CKL+08, KSM18]. Diffusion map refers to the leading eigenfunctions of the transfer\noperator of a reversible dynamical system. We will generalize it to nonreversible processes and\nfeature spaces.\nFor simplicity of presentation, we assume without loss of generality that {\u03a6i}N\ni=1 are\nL2(\u03c0) and L2 orthogonal bases of H and \u02dcH respectively, with squared norms \u03c11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c1N and\n\u02dc\u03c11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u02dc\u03c1N respectively. Any given features can be orthogonalized to satisfy this condition. In\nparticular, let the matrix C := diag[\u03c11,\u00b7\u00b7\u00b7 , \u03c1N ], \u02dcC := diag[\u02dc\u03c11,\u00b7\u00b7\u00b7 , \u02dc\u03c1N ], it is easy to verify that\np(y|x) = \u03a6(x)T C\u22121P \u02dcC\n[1\u00b7\u00b7\u00b7r]V(\u03c1) be its SVD. We de\ufb01ne the\nstate embedding as\n\n\u22121 \u02dc\u03a6(y). Let C\u22121/2P \u02dcC\n\ni=1 and { \u02dc\u03a6i}N\n\n= U(\u03c1)\u03a3(\u03c1)\n\n\u22121/2\n\n(cid:18)\n\n\u03a8(x) :=\n\n\u03a6(x)T C\u22121/2U(\u03c1)\u03a3(\u03c1)\n\n[1\u00b7\u00b7\u00b7r]\n\n(cid:19)T\n\n.\n\nIt is straightforward to verify that dist(x, z) = (cid:107)\u03a8(x) \u2212 \u03a8(z)(cid:107). We propose to estimate \u03a8 in\nAlgorithm 2.\n\nAlgorithm 2: Learning State Embedding\nInput:{X1, . . . , Xn}, r;\nGet \u02c6P from (2), compute SVD \u02c6U(\u03c1) \u02c6\u03a3(\u03c1) \u02c6V(\u03c1)\n\n= C\u22121/2 \u02c6P \u02dcC\n\nCompute state embedding using \ufb01rst r singular pairs \u02c6\u03a8(x) =\nOutput: x (cid:55)\u2192 \u02c6\u03a8(x)\n\n(cid:18)\n\u22121/2;\n\n\u03a6(x)T C\u22121/2 \u02c6U(\u03c1) \u02c6\u03a3(\u03c1)\n\n[1\u00b7\u00b7\u00b7r]\n\n(cid:19)T\n\n;\n\n4\n\n\f(cid:19)\n\nLet (cid:100)dist(x, z) := (cid:107) \u02c6\u03a8(x) \u2212 \u02c6\u03a8(z)(cid:107). We show that the estimated state embeddings preserve the\nLmax := supx\u2208\u2126 \u03a6(x)T C\u22121\u03a6(x) and let \u03ba be the condition number of(cid:112)\u03c0(x)p(y|x). For any\n0 < \u03b4 < 1 and for all x, z \u2208 \u2126, |dist(x, z) \u2212(cid:100)dist(x, z)| is upper bounded by:\n\ndiffusion distance with an additive distortion.\nTheorem 2 (Maximum additive distortion of state embeddings). Let Assumptions 1,2 hold. Let\n\n(cid:115)\n\n(cid:20)\u221a\n\n(cid:21)(cid:18)(cid:114)\n\nC\n\nLmax\n\u03c1N \u02dc\u03c1N\n\n2\u03ba + 1\n\ntmix\n\n\u00af\u03bb log(2tmixN/\u03b4)\n\ntmixKmax log(2tmixN/\u03b4)\n\nn\n\n+\n\n3n\n\nwith probability at least 1 \u2212 \u03b4 for some constant C.\nUnder Assumption 2, we can recover the full transition kernel from data by\n\u22121/2 \u02dc\u03a6(y).\n\n\u02c6p(y|x) = \u03a6(x)T C\u22121/2 \u02c6U(\u03c1) \u02c6\u03a3(\u03c1)\n\n)T \u02dcC\n\n[1\u00b7\u00b7\u00b7r]( \u02c6V(\u03c1)\n\n(cid:114) r\n\n(cid:18)(cid:114)\n\ntmix\n\n\u00af\u03bb log(2tmixN/\u03b4)\n\nTheorem 3 (Recovering the transition density). Let Assumptions 1,2 hold. For any \u03b4 \u2208 (0, 1),\n(cid:107)p(\u00b7|\u00b7) \u2212 \u02c6p(\u00b7|\u00b7)(cid:107)L2(\u03c0)\u00d7L2 \u2264 C\nwith probability at least 1 \u2212 \u03b4 for some constant C.\nTheorems 2,3 provide the \ufb01rst statistical guarantee for learning state embeddings and recovering the\ntransition density for continuous-state low-rank Markov processes. The state embedding learned by\nAlgorithm 2 can be represented in O(N r) space since \u03a6 is priorly known. When \u2126 is \ufb01nite and the\nfeature map is identity, Theorem 3 nearly matches the the information-theoretical error lower bound\ngiven by [LWZ18].\n\ntmixKmax log(2tmixN/\u03b4)\n\n\u03c1N \u02dc\u03c1N\n\n3n\n\n+\n\nn\n\n(cid:19)\n\n4 Clustering States Using Diffusion Distances\nWe want to \ufb01nd a partition of the state space into m disjoint sets \u21261 \u00b7\u00b7\u00b7 \u2126m. The principle is if\nx, y \u2208 \u2126i for some i, then p(\u00b7|x) \u2248 p(\u00b7|y), meaning that states within the same set share similar future\npaths. This motivates us to study the following optimization problem, which has been considered in\nstudies for dynamical systems [SS13],\n\nmin\n\n\u21261,\u00b7\u00b7\u00b7 ,\u2126m\n\nmin\n\nq1\u2208 \u02dcH,\u00b7\u00b7\u00b7 ,qm\u2208 \u02dcH\n\n\u03c0(x)(cid:107)p(\u00b7|x) \u2212 qi(\u00b7)(cid:107)2\n\nL2dx,\n\n(4)\n\n(cid:90)\n\nm(cid:88)\n\ni=1\n\n\u2126i\n\n1, . . . , \u2126\u2217\n\nWe assume without loss of generality that it admits a unique optimal solution, which we denote by\ni (\u00b7) is a probability distribution and\n(\u2126\u2217\nk=1 of p(\u00b7|\u00b7) (Lemma 7 in Appendix). We\ncan be represented by right singular functions {vk(\u00b7)}r\npropose the following state clustering method:\n\nm). Under Assumption 2, each q\u2217\n\nm) and (q\u2217\n\n1, . . . , q\u2217\n\nAlgorithm 3: Learning metastable state clusters\nData: {X1, . . . , Xn, r, m}\nUse Alg. 2 to get state embedding \u02c6\u03a8 : \u2126 (cid:55)\u2192 Rr;\nSolve k-means problem:\n\nmin\n\n\u21261,\u00b7\u00b7\u00b7 ,\u2126m\n\nmin\n\ns1\u00b7\u00b7\u00b7 ,sm\u2208Rr\n\nOutput: \u02c6\u2126\u2217\n\n1 \u00b7\u00b7\u00b7 \u02c6\u2126\u2217\n\nm\n\nm(cid:88)\n\n(cid:90)\n\n\u03c0(x)(cid:107) \u02c6\u03a8(x) \u2212 si(cid:107)2dx;\n\ni=1\n\n\u2126i\n\nThe k-means method uses the invariant measure \u03c0 as a weight function. In practice if \u03c0 is unknown,\none can pick any reasonable measure and the theoretical bound can be adapted to that measure.\nWe analyze the performance of the state clustering method on \ufb01nite data. De\ufb01ne the misclassi\ufb01cation\nrate as\n\nM ( \u02c6\u2126\u2217\n\n1,\u00b7\u00b7\u00b7 , \u02c6\u2126\u2217\n\nm) := min\n\n\u03c3\n\n\u03c0({x : x \u2208 \u2126\u2217\nj , i /\u2208 \u02c6\u2126\u2217\n\u03c0(\u2126\u2217\nj )\n\n\u03c3(j)})\n\n,\n\nm(cid:88)\n\nj=1\n\n5\n\n\fwhere \u03c3 is taken over all possible permutations over {1, . . . , m}. The misclassi\ufb01cation rate is always\nbetween 0 and m. We let \u22062\n2 be the minimal value\nof (4).\nTheorem 4 (Misclassi\ufb01cation error bound for state clustering). Let Assumptions 1,2 hold. Let \u03ba be\n\nthe condition number of(cid:112)\u03c0(x)p(y|x). If \u22061 > 4\u22062, then for any 0 < \u03b4 < 1 and \u0001 > 0, by letting\n\n1 := mink minl(cid:54)=k \u03c0(\u2126\u2217\n\nL2 and let \u22062\n\nl \u2212 q\u2217\nk(cid:107)2\n\n(cid:18) \u03ba2r\u00af\u03bbtmix log(2tmixN/\u03b4)\n\nn = \u0398\n\nwe have M ( \u02c6\u2126\u2217\n\n1,\u00b7\u00b7\u00b7 , \u02c6\u2126\u2217\n\nm) \u2264 16\u22062\n\n2\n\n\u22062\n1\n\n\u00b7 max\n\n1\n\n\u03c1N \u02dc\u03c1N\n\n(\u22061 \u2212 4\u22062)2 ,\n+ \u0001 with probability at least 1 \u2212 \u03b4.\n\nk)(cid:107)q\u2217\n(cid:26)\n\n(cid:27)(cid:19)\n\n,\n\n1\n\u0001\u22062\n1\n\n,\n\n\u22062\n2\n\u00012\u22064\n1\n\n(cid:82)\n\nThe full proof is given in Appendix. The condition \u22061 > 4\u22062 is a separability condition needed for\n(cid:80)m\n\ufb01nding the correct clusters with high probability, and 16\u22062\nis non-vanishing misclassi\ufb01cation error. In\n\u22062\n1\nthe case of reversible \ufb01nite-state Markov process, the clustering problem is equivalent to \ufb01nding the\nk=1 p(\u2126k|\u2126k), where p(\u2126j|\u2126i) :=\noptimal metastable m-full partition given by argmax\u21261,\u00b7\u00b7\u00b7 ,\u2126m\nm) gives\n\u03c0(\u2126i)\nmetastable sets that can be used to construct a reduced-order Markov state model [SS13]. In the more\ngeneral case of nonreversible Markov chains, the proposed method will cluster states together if they\nshare similar future paths. It provides an unsupervised learning method for state aggregation, which is\na widely used heuristic for dimension reduction of control and reinforcement learning [BT96, SJJ95].\n\n\u03c0(x)p(y|x)dydx [ELVE08, SS13]. The optimal partition (\u2126\u2217\n\n1,\u00b7\u00b7\u00b7 , \u2126\u2217\n\nx\u2208\u2126i,y\u2208\u2126j\n\n1\n\n2\n\n5 Experiments\n\n2\n\n1\n\n\u2212 (cid:107)x\u2212y(cid:107)2\n\n(2\u03c0\u03c32)d/2 e\n\n5.1 Stochastic Diffusion Processes\nWe test the proposed approach on simulated diffusion processes of the form dXt = \u2212\u2207V (Xt)dt +\n\u221a\n2dBt, Xt \u2208 Rd, where V (\u00b7) is a potential function and {Bt}t\u22650 is the standard Brownian motion.\nFor any interval \u03c4 > 0, the discrete-time trajectory {Xk\u03c4}\u221e\nk=1 is a Markov process. We apply the\nEuler method to generate sample path {Xk\u03c4}n\nk=1 according to the stochastic differential equation.\nWe use the Gaussian kernels K(x, y) = \u02dcK(x, y) =\n2\u03c32 where \u03c3 > 0, and construct\nRKHS H = \u02dcH from L2(\u03c0). To get the features \u03a6, we generate 2000 random Fourier features\ni=1 hi(x)hi(y) ([RR08]), and then orthogonalize h\n\nh = [h1, h2, . . . , hN ](cid:62) such that K(x, y) \u2248(cid:80)N\n\nto get \u03a6.\nComparison between reshaped and plain KME\nWe apply Algorithm 1 to \ufb01nd the reshaped KME \u02c6\u00b5p and\ncompare its error with the plain KME \u02dc\u00b5p given by (2). The\nexperiment is performed on a four-well diffusion on R,\nand we take rank r = 4. By orthogonalizing N = 2000\nrandom Fourier features, we obtain J = 82 basis functions.\nFigure 5.1 shows that the reshaped KME consistently out-\nperforms plain KME with varying sample sizes.\nState clustering to reveal metastable structures\nWe apply the state clustering method to analyze metastable\nstructures of a diffusion process whose potential function\nV (x) is given by Figure 5.1 (a). We generate trajectories\nof length n = 106 and take the time interval to be \u03c4 =\n0.1, 1, 5 and 10. We conduct the state embedding and\nclustering procedures with rank r = 4. Figure 5.1 (c)\nshows the clustering results for \u03c4 = 1 with a varying number of clusters. The partition results reliably\nreveal metastable sets, which are also known as invariance sets that characterize slow dynamics of\nthis process. Figure 5.1 (d) shows the four-cluster results with varying values of \u03c4, where the contours\nare based on diffusion distances to the centroid in each cluster. One can see that the diffusion distance\ncontours are dense when \u03c4 takes small values. This is because, when \u03c4 is small, the state embedding\nmethod largely captures fast local dynamics. By taking \u03c4 to be larger values, the state embedding\n\nFigure 1: Reshaped KME versus plain\nKME. The error curve approximately satis-\n\ufb01es a convergence rate of n\u22121/2.\n\n6\n\n10000800000.10.150.20.25\f(a)\n\n(b)\n\n(c)\n\nV (x)\n\n# Cluster = 4\n\n# Cluster = 5\n\n# Cluster = 9\n\n# Cluster = 15\n\n(d)\n\n\u03c0(x)\n\n\u03c4 = 0.1\n\n\u03c4 = 1\n\n\u03c4 = 5\n\n\u03c4 = 10\n\nFigure 2: Metastable state clusters learned from a stochastic diffusion process. (a) Potential function V (x)\nof the diffusion process. (b) Invariant measure \u03c0(x). (c) State clusters based on a state embedding \u02c6\u03a8 : x (cid:55)\u2192 R4.\n(d) Diffusion distance to the nearest cluster centroid (red dot) illustrated as contour plots.\n\nmethod begins to capture slower dynamics, which corresponds to low-frequency transitions among\nthe leading metastable sets.\n\n5.2 DQN for Demon Attack\n\nWe test the state embedding method on the game trajectories of Demon Attack, an Atari 2600 game.\nIn this game, demons appear in waves, move randomly and attack from above, where the player\nmoves to dodge the bullets and shoots with a laser cannon. We train a Deep Q-Network using\nthe architecture given by [MKS+15]. The DQN takes recent image frames of the game as input,\nprocesses them through three convolutional layers and two fully connected layers, and outputs a\nsingle value for each action, among which the action with the maximal value is chosen. Please refer\nto Appendix for more details on DQN training. In our experiment, we take the times series generated\nby the last hidden layer of a trained DQN when playing the game as our raw data. The raw data is a\ntime series of length 47936 and dimension 512, comprising 130 game trajectories. We apply the state\nembedding method by approximating the Gaussian kernel with 200 random Fourier features. Then\nwe obtain low-dimensional embeddings of the game states in R3.\nBefore embedding vs. after embedding\nFigure 3 visualizes the raw states and the state embeddings using t-SNE, a visualization tool to\nillustrate multi-dimensional data in two dimensions [VdM08]. In both plots, states that are mapped\nto nearby points tend to have similar \u201cvalues\u201d (expected future returns) as predicted by the DQN, as\nillustrated by colors of data points.\nComparing Figure 3(a) and (b), the raw state data are more scattered, while after embedding they\n\nexhibit clearer structures and fewer outliers. The markers\u25e6, (cid:52), (cid:5) identify the same pair of game\n\nstates before and after embedding. They suggest that the embedding method maps game states that\nare far apart from each other in their raw data to closer neighbors. It can be viewed as a form of\ncompression. The experiment has been repeated multiple times. We consistently observe that state\nembedding leads to improved clusters and higher granularity in the t-SNE visualization.\nUnderstanding state embedding from examples\n\nFigure 4 illustrates three examples that were marked by \u25e6, (cid:52), (cid:5) in Figure 3. In each example,\n\nwe have a pair of game states that were far apart in their raw data but are close to each other after\nembedding. Also note the two images are visually not alike, therefore any representation learning\nmethod based on individual images alone will not consider them to be similar. Let us analyze these\nthree examples:\n\n\u25e6: Both streaming lasers (purple) are about to destroy a demon and generate a reward; Both cannons\n\nare moving towards the left end.\n\n(cid:52): In both images, two new demons are emerging on top of the cannon to join the battle and there is\n(cid:5): Both cannons are waiting for more targets to appear, and they are both moving towards the center\n\nan even closer enemy, leading to future dangers and potential rewards.\n\nfrom opposite sides.\n\n7\n\n\fThese examples suggest that state embedding is able to identify states as similar if they share similar\nnear-future events and values, even though they are visually dissimilar and distant from each other in\ntheir raw data.\n\n(a) Before Embedding\n\n(b) After Embedding\n\nFigure 3: Visualization of game states before and after embedding in t-SNE plots. The raw data is a time\nseries of 512 dimensions, which generated by the last hidden layer by the DQN while it is playing Demon\nAttack. State embeddings are computed from the raw time series using a Gaussian kernel with 200 random\n\nFourier features. Game states are colored by the \u201cvalue\u201d of the state as predicted by the DQN. The markers\u25e6,\n(cid:52),(cid:5) identify the same pair of game states before and after embedding. Comparing (a) and (b), state embedding\n\u25e6: V = 6.27 \u25e6: V = 6.14 (cid:52): V = 6.17 (cid:52): V = 6.16 (cid:5): V = 4.44 (cid:5): V = 4.35\n\nimproves the granularity of clusters and reveals more structures of the data.\n\ntwo states share similar \u201cV\u201d values as predicted by the DQN, but they were not close in the raw data and are\n\nFigure 4: Pairs of game states that are close after embedding (\u25e6, (cid:52),(cid:5) in Figure 3). Within each pair, the\nvisually dissimilar. \u25e6: Both streaming lasers (purple) are about to destroy a demon and generate a reward;\nBoth cannons are moving towards the left end. (cid:52): In both images, two new demons are emerging on top of the\ncannon to join the battle and there is an even closer enemy, leading to future dangers and potential rewards (cid:5):\n\nBoth cannons are waiting for more targets to appear, and they are both moving towards the center from opposite\nsides. The examples above suggest that state embedding is able to identify states as similar if they share similar\nnear-future paths and values.\n\nSummary and Future Work\nThe experiments validate our theory and lead to interesting discoveries: estimated state embedding\ncaptures what would happen in the future conditioned on the current state. Thus the state embedding\ncan be useful to decision makers in terms of gaining insights into the underlying logic of the game,\nthereby helping them to make better predictions and decisions.\nOur methods are inspired by dimension reduction methods from scienti\ufb01c computing and they further\nleverage the low-rankness of the transition kernel to reduce estimation error and \ufb01nd compact state\nembeddings. Our theorems provide the basic statistical theory on state embedding/clustering from\n\ufb01nite-length dependent time series. They are the \ufb01rst theoretical results known for continuous-state\nMarkov process. We hope our results would motivate more work on this topic and lead to broader\napplications in scienti\ufb01c data analysis and machine learning. A natural question to ask next is how\ncan one use state embedding to make control and reinforcement learning more ef\ufb01cient. This is a\ndirection for future research.\n\n8\n\n\fReferences\n[BSB+15] Wendelin B\u00a8ohmer, Jost Tobias Springenberg, Joschka Boedecker, Martin Riedmiller, and Klaus\nObermayer. Autonomous learning of state representations for control. K\u00a8unstliche Intelligez,\n29(4):1\u201310, 2015.\n\n[BT96] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena\n\nScienti\ufb01c Belmont, MA, 1996.\n\n[BTA04] Alain A Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability\n\nand Statistics. Kluwer Academic Publishers, 2004.\n\n[CKL+08] Ronald R. Coifman, Ioannis G. Kevrekidis, St\u00b4ephane Lafon, Mauro Maggioni, and Boaz Nadler.\nDiffusion maps, reduction coordinates, and low dimensional representation of stochastic systems.\nSIAM Journal on Multiscale Modeling and Simulation, 7(2):852\u2013864, 2008.\n\n[DKW19] Yaqi Duan, Zheng Tracy Ke, and Mengdi Wang. State aggregation learning from markov transition\n\ndata. Conference on Neural Information Processing Systems (NeurIPS), 2019.\n\n[ELVE08] Weinan E, Tiejun Li, and Eric Vanden-Eijnden. Optimal partition and effective dynamics of complex\n\nnetworks. Proceedings of the National Academy of Sciences, 105(23):7907\u20137912, 2008.\n\n[G\u02daar54] Lars G\u02daarding. On the asymptotic distribution of the eigenvalues and eigenfunctions of elliptic\n\ndifferential operators. Mathematica Scandinavica, pages 237\u2013255, 1954.\n\n[GLB+12] Steffen Gr\u00a8unew\u00a8alder, Guy Lever, Luca Baldassarre, Massimilano Pontil, and Arthur Gretton. Mod-\nelling transition dynamics in mdps with rkhs embeddings. International Conference on Machine\nLearning, 2012.\n\n[KAL16] Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Pac reinforcement learning with rich\n\nobservations. In Advances in Neural Information Processing Systems, pages 1840\u20131848, 2016.\n\n[KKS16] Stefan Klus, P\u00b4eter Koltai, and Christof Sch\u00a8utte. On the numerical approximation of the perronfrobe-\n\nnius and koopman operator. Journal of Computational Dynamics, 3(1):51\u201379, 2016.\n\n[KNK+18] Stefan Klus, Feliks N\u00a8uske, P\u00b4eter Koltai, Hao Wu, Ioannis Kevrekidis, Christof Sch\u00a8utte, and Frank\nNo\u00b4e. Data-driven model reduction and transfer operator approximation. Journal of Nonlinear\nScience, 28(3):985\u20131010, 2018.\n\n[KSM18] Stefan Klus, Ingmar Schuster, and Krikamol Muandet. Eigendecompositions of transfer operators\n\nin reproducing kernel hilbert spaces. arXiv preprint arXiv:1712.01572, 2018.\n\n[Lac07] Claire Lacour. Estimation non paramtrique adaptative pour les chanes de Markov et les chanes de\n\nMarkov caches. PhD thesis, Ph.D. Thesis, UNIVERSITE PARIS DESCARTES, 2007.\n\n[LL06] St\u00b4ephane Lafon and Ann Lee. Diffusion maps and coarse-graining: A uni\n\ned framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE\nTrans. on Pattern Analysis and Machine Intelligence, 29(9):1393\u20131403, 2006.\n\n[LP17] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American\n\nMathematical Soc., 2017.\n\n[LP18] Matthias L\u00a8of\ufb02er and Antoine Picard. Spectral thresholding for the estimation of markov chain\n\ntransition operators. arXiv preprint arXiv:1808.08153, 2018.\n\n[LWZ18] Xudong Li, Mengdi Wang, and Anru Zhang. Estimation of markov chain via rank-constrained\n\nlikelihood. International Conference on Machine Learning, 2018.\n\n[MFSS17] Kirkamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schlkopf. Kernel mean\nembedding of distributions: A review and beyond. Foundations and Trends in Machine Learning,\n10(1-2):1\u2013141, 2017.\n\n[MKS+15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nNature, 518:529\u2013533, 2015.\n\n9\n\n\f[NLCK06] Boaz Nadler, St\u00b4ephane Lafon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Diffusion maps, spec-\ntral clustering and eigenfunctions of fokker-planck operators. In Advances in neural information\nprocessing systems, pages 955\u2013962, 2006.\n\n[RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2008.\n\n[RZMC11] Mary A Rohrdanz, Wenwei Zheng, Mauro Maggioni, and Cecilia Clementi. Determination of reac-\ntion coordinates via locally scaled diffusion map. The Journal of chemical physics, 134(12):03B624,\n2011.\n\n[Sar14] Mathieu Sart. Estimation of the transition density of a markov chain.\n\nProbabilit\u00b4es et statistiques, volume 50, pages 1028\u20131068, 2014.\n\nIn Annales de l\u2019IHP\n\n[SGSS07] Alex Smola, Arthur Gretton, Le Song, and Sch\u00a8olkopf. A hilbert space embedding for distributions.\n\nIn International Conference on Algorithmic Learning Theory, 2007.\n\n[SHSF09] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of\nInternational Conference on\n\nconditional distributions with applications to dynamical systems.\nMachine Learning, 2009.\n\n[SJJ95] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state\n\naggregation. In Advances in neural information processing systems, pages 361\u2013368, 1995.\n\n[SNL+11] Christof Sch\u00a8utte, Frank Noe, Jianfeng Lu, Macro Sarich, and Eric Vanden-Eijnden. Markov state\n\nmodels based on milestoning. The Journal of Chemical Physics, 134(20):204105, 2011.\n\n[SS13] Christof Sch\u00a8utte and Marco Sarich. Metastability and Markov State Models in Molecular\nDynamics: Modelling, Analysis, Algorithm Approach, volume 24. American Mathematical Soc.,\n2013.\n\n[VdM08] G. Van der Maaten, Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine\n\nLearning Research, 9(Nov):2579\u20132605, 2008.\n\n[Yak79] Sidney Yakowitz. Nonparametric estimation of markov transition func- tions. The Annals of\n\nStatistics, 7(3):671\u2013679, 1979.\n\n[ZW18] Anru Zhang and Mengdi Wang. Spectral state compression of markov processes. arXiv preprint\n\narXiv:1802.02920, 2018.\n\n10\n\n\f", "award": [], "sourceid": 2572, "authors": [{"given_name": "Yifan", "family_name": "Sun", "institution": "Carnegie Mellon University"}, {"given_name": "Yaqi", "family_name": "Duan", "institution": "Princeton University"}, {"given_name": "Hao", "family_name": "Gong", "institution": "Princeton University"}, {"given_name": "Mengdi", "family_name": "Wang", "institution": "Princeton University"}]}