{"title": "Dimensionality Reduction for Stationary Time Series via Stochastic Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3496, "page_last": 3506, "abstract": "Stochastic optimization naturally arises in machine learning. Efficient algorithms with provable guarantees, however, are still largely missing, when the objective function is nonconvex and the data points are dependent. This paper studies this fundamental challenge through a streaming PCA problem for stationary time series data. Specifically, our goal is to estimate the principle component of time series data with respect to the covariance matrix of the stationary distribution. Computationally, we propose a variant of Oja's algorithm combined with downsampling to control the bias of the stochastic gradient caused by the data dependency. Theoretically, we quantify the uncertainty of our proposed stochastic algorithm based on diffusion approximations. This allows us to prove the asymptotic rate of convergence and further implies near optimal asymptotic sample complexity. Numerical experiments are provided to support our analysis.", "full_text": "Dimensionality Reduction for Stationary Time Series\n\nvia Stochastic Nonconvex Optimization\n\nMinshuo Chen1 Lin F. Yang2 Mengdi Wang2 Tuo Zhao1\n\n1Georgia Institute of Technology\n\n2Princeton University\n\n1{mchen393, tourzhao}@gatech.edu\n\n2{lin.yang, mengdiw}@princeton.edu\n\nAbstract\n\nStochastic optimization naturally arises in machine learning. Ef\ufb01cient algorithms\nwith provable guarantees, however, are still largely missing, when the objective\nfunction is nonconvex and the data points are dependent. This paper studies this\nfundamental challenge through a streaming PCA problem for stationary time series\ndata. Speci\ufb01cally, our goal is to estimate the principle component of time series data\nwith respect to the covariance matrix of the stationary distribution. Computationally,\nwe propose a variant of Oja\u2019s algorithm combined with downsampling to control\nthe bias of the stochastic gradient caused by the data dependency. Theoretically, we\nquantify the uncertainty of our proposed stochastic algorithm based on diffusion\napproximations. This allows us to prove the asymptotic rate of convergence and\nfurther implies near optimal asymptotic sample complexity. Numerical experiments\nare provided to support our analysis.\n\n1\n\nIntroduction\n\nMany machine learning problems can be formulated as a stochastic optimization problem in the\nfollowing form,\n\nsubject to u 2U ,\n\nu EZ\u21e0D[f (u, Z)]\nmin\n\n(1.1)\nwhere f is a possibly nonconvex loss function, Z denotes the random sample generated from some\nunderlying distribution D (also known as statistical model), u is the parameter of our interest, and\nU is a possibly nonconvex feasible set for imposing modeling constraints on u. For \ufb01nite sample\nsettings, we usually consider n (possibly dependent) realizations of Z denoted by {z1, ..., zn}, and\nthe loss function in (1.1) is further reduced to an additive form, E[f (u, z)] = 1\ni=1 f (u, zi). For\ncontinuously differentiable f, Robbins and Monro (1951) propose a simple iterative stochastic search\nalgorithm for solving (1.1). Speci\ufb01cally, at the k-th iteration, we obtain zk sampled from D and take\n(1.2)\nwhere \u2318 is the step-size parameter (also known as the learning rate in machine learning literature),\nruf (uk, zk) is an unbiased stochastic gradient for approximating ruEZ\u21e0Df (uk, Z), i.e.,\n\nuk+1 =\u21e7 U [uk \u2318ruf (uk, zk)],\n\nnPn\n\nEzkruf (uk, zk) = ruEZ\u21e0Df (uk, Z),\n\nand \u21e7U is a projection operator onto the feasible set U. This seminal work is the foundation of the\nresearch on stochastic optimization, and has a tremendous impact on the machine learning community.\nThe theoretical properties of such a stochastic gradient descent (SGD) algorithm have been well\nstudied for decades, when both f and U are convex. For example, Sacks (1958); Bottou (1998);\nChung (2004); Shalev-Shwartz et al. (2011) show that under various regularity conditions, SGD\nconverges to a global optimum as k ! 1 at different rates. Such a line of research for convex and\nsmooth objective function f is fruitful and has been generalized to nonsmooth optimization (Duchi\net al., 2012b; Shamir and Zhang, 2013; Dang and Lan, 2015; Reddi et al., 2016).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWhen f is nonconvex, which appears more often in machine learning problems, however, the\ntheoretical studies on SGD are very limited. The main reason behind is that the optimization landscape\nof nonconvex problems can be much more complicated than those of convex ones. Thus, conventional\noptimization research usually focuses on proving that SGD converges to \ufb01rst order optimal stationary\nsolutions (Nemirovski et al., 2009). More recently, some results in machine learning literature show\nthat SGD actually converges to second order optimal stationary solutions, when the nonconvex\noptimization problem satis\ufb01es the so-called \u201cstrict saddle property\u201d (Ge et al., 2015; Lee et al.,\n2017). More precisely, when the objective has negative curvatures at all saddle points, SGD can\n\ufb01nd a way to escape from these saddle points. A number of nonconvex optimization problems in\nmachine learning and signal processing have been shown to satisfy this property, including principal\ncomponent analysis (PCA), multiview learning, phase retrieval, matrix factorization, matrix sensing,\nmatrix completion, complete dictionary learning, independent component analysis, and deep linear\nneural networks (Srebro and Jaakkola, 2004; Sun et al., 2015; Ge et al., 2015; Sun et al., 2016; Li\net al., 2016; Ge et al., 2016; Chen et al., 2017).\nThese results further motivate many followup works. For example, Allen-Zhu (2017) improves the\n\nfunctions, where \u270f is a pre-specifed optimization accuracy; Jain et al. (2016); Allen-Zhu and Li (2016)\n\niteration complexity of SGD from eO(\u270f4) in Ge et al. (2015) to eO(\u270f3.25) for general unconstrained\nshow that the iteration complexity of SGD for solving the eigenvalue problem is eO(\u270f1). Despite of\n\nthese progresses, we still lack systematic approaches for analyzing the algorithmic behavior of SGD.\nMoreover, these results focusing on the convergence properties, however, cannot precisely capture the\nuncertainty of SGD algorithms (e.g., how to escape from saddle points), which makes the theoretical\nanalysis less intuitive.\nBesides nonconvexity, data dependency is another important challenge arising in stochastic opti-\nmization for machine learning, since the samples zk\u2019s are often collected with a temporal pattern.\nFor many applications (e.g., time series analysis), this may involve certain dependency. Taking\ngeneralized vector autoregressive (GVAR) data as an example, our observed zk+1 2 Rm is generated\nby zi\nk+1 is the i-th component of\nzk+1, p(\u00b7) denotes the density of the exponential family, and a>i zk is the natural parameter. There is\nonly limited literature on convex stochastic optimization for dependent data. For example, Duchi\net al. (2012a) investigate convex stochastic optimization algorithms for ergodic underlying data gener-\nating processes; Homem-de Mello (2008) investigates convex stochastic optimization algorithms for\ndependent but identically distributed data. For nonconvex optimization problems in machine learning,\nhowever, how to address such dependency is still quite open.\nThis paper proposes to attack stochastic nonconvex optimization problems for dependent data by\ninvestigating a simple but fundamental problem in machine learning \u2014 Streaming PCA for stationary\ntime series. PCA has been well known as a powerful tool to reduce the dimensionality, and well\napplied to data visualization and representation learning. Speci\ufb01cally, we solve the following\nnonconvex problem,\n\nk+1|zk \u21e0 p(a>i zk), where ai\u2019s are unknown coef\ufb01cient vectors, zi\n\nU\u21e4 2 argmax\nU2Rm\u21e5r\n\nTrace(U>\u2303U )\n\nsubject to U>U = Ir\n\n(1.3)\n\nwhere \u2303 is the covariance matrix of our interest. This is also known as an eigenvalue problem.\nThe column span of the optimal solution U\u21e4 equals the subspace spanned by the eigenvectors\ncorresponding to the \ufb01rst r largest eigenvalues of \u2303. Existing literature usually assumes that at the\nk-th iteration, we observe a random vector zk independently sampled from some distribution D with\nE[zk] = 0 and E[zkz>k ] =\u2303 . Our setting, however, assumes that zk is sampled from some time\nseries with a stationary distribution \u21e1 satisfying E\u21e1[zk] = 0 and E\u21e1[zkz>k ] =\u2303 . There are two key\ncomputational challenges in such a streaming PCA problem:\n\u2022 For time series, it is dif\ufb01cult to get unbiased estimators of the covariance matrix of the stationary\ndistribution because of the data dependency. Taking GVAR as an example, the marginal distribution\nof zk is different from the stationary distribution. As a result, the stochastic gradient at the k-th\niteration is biased, i.e., E[zkz>k Uk|Uk] 6=\u2303 Uk;\n\u2022 The optimization problem in (1.3) is nonconvex, and its solution space is rotational-invariant. Given\nany orthogonal matrix Q 2 Rr\u21e5r and any feasible solution U, the product U Q is also a feasible\nsolution and gives the same column span as U. When r > 1, this fact leads to the degeneracy in\nthe optimization landscape such that equivalent saddle points and optima are non-isolated. The\nalgorithmic behavior under such degeneracy is still quite open for SGD.\n\n2\n\n\fTo address the \ufb01rst challenge, we propose a variant of Oja\u2019s algorithm to handle data dependency.\nSpeci\ufb01cally, inspired by Duchi et al. (2012a), we use downsampling to generate weakly dependent\nsamples. Theoretically, we show that the downsampled data point yields a sequence of stochastic\napproximations of the covariance matrix of the stationary distribution with controllable small bias.\nMoreover, the block size for downsampling only logarithmically depends on the optimization accuracy,\nwhich is nearly constant (see more details in Sections 2 and 3).\nTo attack nonconvexity and the degeneracy of the solution space, we establish new convergence\nanalysis based on principle angle between Uk and the eigenspace of \u2303. By applying diffusion\napproximations, we show that the solution trajectory weakly converges to the solution of a stochastic\ndifferential equation (SDE), which enables us to quantify the uncertainty of the proposed algorithm\n(see more details in Sections 3 and 5). Investigating the analytical solution of the SDE allows us to\ncharacterize the algorithmic behavior of SGD in three different scenarios: escaping from saddle points,\ntraversing between stationary points, and converging to global optima. We prove that the stochastic\nalgorithm asymptotically converges and achieves nearly optimal asymptotic sample complexity.\nThere are several closely related works. Chen et al. (2017) study the streaming PCA problem for\nr = 1 also based on diffusion approximations. However, r = 1 makes problem (1.3) admit an\nisolated optimal solution, unique up to sign change. For r > 1, the global optima are nonisolated due\nto the rotational invariance property. Thus, the analysis is more involved and challenging. Moreover,\nJain et al. (2016); Allen-Zhu and Li (2016) provide nonasymptotic analysis for the Oja\u2019s algorithm for\nstreaming PCA. Their techniques are quite different from ours. Their nonasymptotic results, though\nmore rigorous in describing discrete algorithms, lack intuition and can only be applied to the Oja\u2019s\nalgorithm with no data dependency. In contrast, our analysis handles data dependency and provides\ndetailed explanation to the asymptotic algorithmic behavior.\nNotations: Given a vector v = (v1, . . . , vm)> 2 Rm, we de\ufb01ne the Euclidean norm kvk2\n2 = v>v.\nGiven a matrix A 2 Rm\u21e5n, we de\ufb01ne the spectral norm kAk2 as the largest singular value of A\nand the Frobenius norm kAk2\nF = Trace(AA>). We also de\ufb01ne r(A) as the r-th largest singular\nvalue of A. For a diagonal matrix \u21e5 2 Rm\u21e5m, we de\ufb01ne sin \u21e5 = diag (sin(\u21e511), . . . , sin(\u21e5mm))\nand cos \u21e5 = diag (cos(\u21e511), . . . , cos(\u21e5mm)). We denote the canonical basis of Rm by ei for\ni = 1, . . . , m with the i-th element being 1, and the canonical basis of Rr by e0j for j = 1, . . . , r.\n\n2 Downsampled Oja\u2019s Algorithm\n\nWe \ufb01rst explain how to construct a nearly unbiased covariance estimator for the stationary distribution,\nwhich is crucial for our proposed algorithm. Before proceed, we brie\ufb02y review geometric ergodicity\nfor time series, which characterizes the mixing time of a Markov chain.\nDe\ufb01nition 2.1 (Geometric Ergodicity and Total Variation Distance). A Markov chain with state\nspace S and stationary distribution \u21e1 on (S,F) with F being a -algebra on S, is geometrically\nergodic, if it is positive recurrent and there exists an absolute constant \u21e2 2 (0, 1) such that the total\nvariation distance satis\ufb01es\n\nDTV (pn(x,\u00b7),\u21e1 (\u00b7)) = supA2F |pn(x, A) \u21e1(A)| = O (\u21e2n)\n\nfor all x 2 S,\n\nwhere pn(\u00b7,\u00b7) is the n-step transition kernel1.\nNote that \u21e2 is independent of n and only depends on the underlying transition kernel of the Markov\nchain. The geometric ergodicity is equivalent to saying that the chain is -mixing with an exponen-\ntially decaying coef\ufb01cient (Bradley et al., 2005).\nAs aforementioned, one key challenge of solving the streaming PCA problem for time series is that\nit is dif\ufb01cult to get unbiased estimators of the covariance matrix \u2303 of the stationary distribution.\nHowever, when the time series is geometrically ergodic, the transition probability ph(zk, zk+h)\nconverges exponentially fast to the stationary distribution. This allows us to construct a nearly\nunbiased estimator of \u2303 as shown in the next lemma.\n\n1The formal de\ufb01nitions of positive recurrent and transition kernel can be found in Durrett (2010) Chapter 6.\nIn short, a positive recurrent Markov chain visits each state in a \ufb01nite time almost surely, and transition kernel is\na generalization of transition probability to continuous state spaces.\n\n3\n\n\fLemma 2.2. Let {zk}1k=1 be a geometrically ergodic Markov chain with parameter \u21e2, and assume\nzk is Sub-Gaussian. Given a pre-speci\ufb01ed accuracy \u2327, there exists h = O\uf8ff\u21e2 log 1\nEh(z2h+k zh+k)(z2h+k zh+k)>/2zki =\u2303+ E\u2303\nwith kEk2 \uf8ff \u2327, where \uf8ff\u21e2 is a constant depending on \u21e2 and \u2303 is the covariance matrix of zk under\nthe stationary distribution.\n\n\u2327 such that\n\nk+1|zk \u21e0 p(a>i zk), where zi\n\nLemma 2.2 shows that as h increases, the bias decreases to zero. This suggests that we can use the\ndownsampling method to reduce the bias of the stochastic gradient. Speci\ufb01cally, we divide the data\npoints into blocks of length 2h, i.e., z1, z2, . . . , z2h , z2h+1, . . . , z4h , . . . , z2(b1)h+1, . . . , z2bh .\nFor the s-th block, we use data points z(2s1)h and z2sh to approximate \u2303 by Xs = 1\n2 (z2sh \nz(2s1)h)(z2sh z(2s1)h)>. Later we will show that the block size h only logarithmically depends\non the optimization accuracy. Thus, the downsampling is affordable. Moreover, if the stationary\ndistribution has zero mean, we only need the block size to be h and Xs = zshz>sh.\nMany time series models in machine learning are geometrically ergodic. We discuss a few examples.\nExample 2.3. The vector autoregressive (VAR) model follows the update zk+1 = Azk + \u270fk, where\n\u270fk\u2019s are i.i.d. Sub-Gaussian random vectors with E[\u270fk] = 0 and E[\u270fk\u270f>k ] = , and A is the coef\ufb01cient\nmatrix. When \u21e2 = kAk2 < 1, the model is stationary and geometrically ergodic (Tj\u00f8stheim, 1990).\nMoreover, the mean of its stationary distribution is 0.\nExample 2.4. Recall that GVAR model follows zi\nk+1\u2019s are independent\nconditioning on zk. The density function is p(x|\u2713) = h(x) exp (T (x)\u2713 B(\u2713)) , where T (x) is a\nstatistic, and B(\u2713) is the log partition function. GVAR is stationary and geometrically ergodic under\ncertain regularity conditions (Hall et al., 2016).\nAs an illustrative example, we show that for Gaussian VAR with \u21e2 = kAk2 < 1 and = I, the\nbias of the covariance estimator can be controlled by choosing h = O\u21e3 1\n\u2327\u2318. The covariance\n1\u21e2 log 1\nmatrix of the stationary distribution is \u2303= P1i=0 Ai(A>)i. One can check\nE\u21e5zh+kz>h+k|zk\u21e4 \u2303= Ahzkz>k (A>)h +P1i=h Ai(A>)i.\nhand side, since both terms are of the order O(\u21e22h). As a result, the bias of E\u21e5zh+kz>h+k|zk\u21e4\ndecays to zero exponentially fast. We pick h = O\u21e3 1\n\u2327\u2318 , and obtain E\u21e5zk+hz>k+h|zk\u21e4 =\n\u2303+ E\u2303 with kEk2 \uf8ff \u2327.\nWe then propose a variant of Oja\u2019s algorithm combined with our downsampling technique as sum-\nmarized in Algorithm 1. For simplicity, we assume the stationary distribution has mean zero.\nThe projection \u21e7Orth(U ) denotes the orthogonal-\nization operator that performs on columns of U.\nSpeci\ufb01cally, for U 2 Rm\u21e5r, \u21e7Orth(U ) returns a\nmatrix U0 2 Rm\u21e5r that has orthonormal columns.\nTypical examples of such operators include Gram-\nSchmidt method and Householder transformation.\nThe step,\n\nInput: data points zk, block size h, step size \u2318\nInitialize U1 with orthonormal columns.\nSet s 1\nrepeat\n\nHere the spectrum of A acts as the geometrically decaying factor for both terms on the right\n\nAlgorithm 1 Downsampled Oja\u2019s Algorithm\n\n1\u21e2 log 1\n\nUs+1 =\u21e7 Orth(Us + \u2318XsUs),\n\nis essentially the original Oja\u2019s update. Our vari-\nant manipulates on data points by downsampling\nsuch that Xs is nearly unbiased. We emphasize\nthat s denotes the number of iterations, and k denotes the number of samples.\n\nTake sample zsh, and set Xs zshz>sh\nUs+1 \u21e7Orth(Us + \u2318XsUs)\ns s + 1\nuntil Convergence\nOutput: Us\n\n3 Theory\n\nWe exploit diffusion approximations to characterize the convergence of downsampled SGD in 3 stages.\nSpeci\ufb01cally, we use an ODE (Theorem 3.4) to analyze the global convergence and SDEs (Theorems\n3.5 and 3.8) to capture the local dynamics around saddle points and global optima. By the weak\n\n4\n\n\fconvergence of the discrete algorithm trajectory to the ODE and SDE, we show that downsampled\nSGD achieves an nearly optimal asymptotic sample complexity (Corollary 3.10). Before proceed, we\nimpose some model assumptions on the problem.\nAssumption 3.1 . There exists an eigengap in the covariance matrix \u2303 of the stationary distribution,\ni.e., 1 \u00b7\u00b7\u00b7 r > r+1 \u00b7\u00b7\u00b7 m > 0, where i is the i-th eigenvalue of \u2303.\nAssumption 3.2 . Data points {zk}k1 are generated from a geometrically ergodic time series with\nparameter \u21e2, and the stationary distribution has mean zero. Each zk is Sub-Gaussian, and the block\nsize is chosen as h = O (\uf8ff\u21e2 log(1/\u2318)) for downsampling.\nThe eigengap in Assumption 3.1 implies that the optimal solution is identi\ufb01able. Speci\ufb01cally, the\noptimal solution U\u21e4 is unique up to rotation. The positive de\ufb01nite assumption on \u2303 is for theoretical\nsimplicity. Assumption 3.2 implies that each zk has bounded moments of any order.\nWe also brie\ufb02y explain the optimization landscape of streaming PCA problems as follows. Speci\ufb01cally,\nwe consider the eigenvalue decomposition \u2303= R\u21e4R> with \u21e4= diag(1, 2, . . . , m). Recall that\nei is the canonical basis of Rm. We distinguish stationary points U of streaming PCA problems:\n\u2022 U is a global optimum, if the column span of R>U equals the subspace spanned by {e1, . . . , er};\n\u2022 U is a saddle point or a global minima, if the column span of R>U equals the subspace spanned by\n{ea1, . . . , ear}, where Ar = {a1, . . . , ar}\u21e2{ 1, . . . , m} and Ar 6= {1, . . . , r}.\nFor convenience, if the column span of R>U coincides with {ea1, . . . , ear}, we say that U is a\nstationary point corresponding to the set Ar = {a1, . . . , ar}.\nTo handle the rotational invariance of the solution space, we use principle angle to characterize the\ndistance between the column spans of U\u21e4 and Us. The notation is as follows. Given two matrices U 2\nRm\u21e5r1 and V 2 Rm\u21e5r2 with orthonormal columns, where 1 \uf8ff r1 \uf8ff r2 \uf8ff m, the principle angle\nbetween these two matrices is, \u21e5(U, V ) = diagarccos1(U>V ) , . . . , arccosr1(U>V ) .\nWe show the consequence of using principle angle as follows. Speci\ufb01cally, any optimal solution\nU\u21e4 satis\ufb01es ksin \u21e5(Rr, U\u21e4)k2\nF = 0, where Rr denotes the \ufb01rst r columns\nof R, and Rr denotes the last m r columns of R. This essentially implies that the column span\nof U\u21e4 is orthogonal to that of Rr. Thus, to prove the convergence of SGD, we only need to show\ncos \u21e5(Rr, Us)2\nF ! 0. By the rotational invariance of principle angle, we obtain \u21e5Rr, Us =\n\u21e5R>Rr, R>Us =\u21e5 Er, R>Us , where Er = [er+1, . . . , em]. For notational simplicity, we\ndenote U s = R>Us. Then the convergence of the algorithm is equivalent tocos \u21e5Er, U s2\n0. We need such an orthogonal transformation, becausecos \u21e5Er, U s2\ncos \u21e5Er, U s2\ni,s =e>i U s2\n\n3.1 Global Convergence by ODE\nSince the sequence {zsh, U s}1s=1 forms a discrete Markov process, we can apply diffusion ap-\nproximations to establish global convergence of SGD. Speci\ufb01cally, by a continuous time interpo-\nlation, we construct continuous time processes U \u2318(t) and X \u2318(t) such that U \u2318(t) = Ubt/\u2318c+1 and\nX \u2318(t) = Xbt/\u2318c+1. The subscript bt/\u2318c + 1 denotes the number of iterations, and the superscript \u2318\nhighlights the dependence on \u2318. We denote U\n(t) = R>X \u2318(t)R. The con-\ntinuous time version of 2\n2. It is dif\ufb01cult to directly characterize\nthe global convergence of 2\nLemma 3.3. Let Er = [e1, . . . , er] 2 Rm\u21e5r. Suppose U\nE>r U\n\nF = cos \u21e5(Rr, U\u21e4)2\n\ni,\u2318(t). Thus, we introduce an upper bound of 2\n\nF =Pm\n\ni=r+1e>i U s2\n\nF !\nF can be expressed as\n\n(t) has orthonormal columns and\n\n2 =Pm\n\n(t) is invertible. We have\n\n(t) = R>U \u2318(t) and X\n\ni=r+1 2\n\ni,s with 2\n\ni,\u2318(t) as follows.\n\ni,s is written as 2\n\ni,\u2318(t) = ke>i U\n\n(t)k2\n\n\u2318\n\n\u2318\n\n\u2318\n\n2 .\n\n\u2318\n\n\u2318\n\n2\n\n2 2\n\ni,\u2318(t).\n\n(3.1)\n\n(3.2)\n\n\u2318\n\n\u2318\n\n(t)\u21e3E>r U\n\n(t)\u23181\ni,\u2318(t) =e>i U\ne2\nThe detailed proof is provided in Appendix B.1. We showe2\nTheorem 3.4. As \u2318 ! 0, the processe2\nde2\ni = bie2\n\n5\n\ni,\u2318(t) converges in the following theorem.\n\ni,\u2318(t) weakly converges to the solution of the ODE\ni dt with bi \uf8ff 2(i r),\n\n\f2\n\n2\n\n, and U (0) has orthonormal columns.\n\ni (0) =e>i U (0)E>r U (0)1\n\nwheree2\nThe detailed proof is provided in Appendix B.2. The analytical solution to (3.2) ise2\nto derive the upper bound (3.1). Under this condition,e2\n\ni (0)ebit\nwith bi \uf8ff 2(r+1 r) < 0 for any i 2{ r + 1, . . . , m}. Note that we need E>r U (0) to be invertible\ni (t) converges to zero. However, when\nE>r U (0) is not invertible, the algorithm starts at a saddle point, and (3.2) no longer applies. As can\nbe seen, the ODE characterization is insuf\ufb01cient to capture the local dynamics (e.g., around saddle\npoints or global optima) of the algorithm.\n\ni (t) =e2\n\n\u2318\n\ni,\u2318(t) as 2\n\n3.2 Local Dynamics by SDE\nThe deterministic ODE characterizes the average behavior of the solution trajectory. To capture the\nuncertainty of the local algorithmic behavior, we need to rescale the in\ufb02uence of the noise to bring\nthe randomness back, which leads us to a stochastic differential equation (SDE) approximation.\n\u2022 Stage 1: Escape from Saddle Points Recall that \u21e4= diag(1, . . . , m) collects all the eigenval-\nues of \u2303. We consider the eigenvalue decomposition U>(0)\u21e4U (0) = Q>e\u21e4Q, where Q 2 Rr\u21e5r\nis orthogonal and e\u21e4= diag(e1, . . . ,er). Again, by a continuous time interpolation, we denote\n(t)]>ei, where e0j is the canonical basis in Rr. Then we decompose the\n\u21e3ij,\u2318(t) = \u23181/2e0>j Q[U\nprinciple angle 2\nij,\u2318(t). Recall that U (0) is a saddle point, if the column\nspan of U (0) equals the subspace spanned by {ea1, . . . , ear} with Ar = {a1, . . . , ar}6 = {1, . . . , r}.\nTherefore, if the algorithm starts around a saddle point, there exists some i 2{ 1, . . . , r} such that\ni,\u2318(0) \u21e1 0 and 2\ni,\u2318(t) around a saddle point is\n2\ncaptured in the following theorem.\nTheorem 3.5. Suppose U (0) is initialized around a saddle point corresponding to Ar. As \u2318 ! 0,\ni,\u2318(t) = O(\u2318) for some i 2{ 1, . . . , r} ,\u21e3 ij,\u2318(t) weakly converges to\nconditioning on the event2\nthe solution of the following stochastic differential equation\nwhere Bt is a standard Brownian motion, and ar is the largest element in Ar.\nThe detailed proof is provided in Appendix B.3. We remark that the event 2\ntechnical assumption. This does not cause any issue, since when \u231812\nhas escaped from the saddle point. Note that (3.3) admits the analytical solution\n\nd\u21e3ij = Kij\u21e3ijdt + GijdBt with Kij 2 [i 1, i ar ] and G2\n\na,\u2318(0) \u21e1 1 for a 2A r. The asymptotic behavior of 2\n\ni,\u2318(t) = O(\u2318) is only a\ni,\u2318(t) is large, the algorithm\n\ni,\u2318(t) = \u2318Pr\n\nij < 1,\n\nj=1 \u21e32\n\n(3.3)\n\n\u21e3ij(t) = \u21e3ij(0)eKij t + GijZ t\n\n0\n\n(3.4)\nwhich is known as an O-U process. We give the following implications on different values of Kij:\n\neKij (st)dB(s),\n\n(a). When Kij > 0, rewrite (3.4) as \u21e3ij(t) =h\u21e3ij(0) + GijR t\n\n0 eKij sdB(s)i eKij t. The exponential\nterm eKij t is dominant and increases to positive in\ufb01nity as t ! 1. While the remaining part on the\nright hand side is a process with mean \u21e3ij(0) and variance bounded by G2\nij/(2Kij). Hence, eKij t\nacts as a driving force to increase \u21e3ij(t) exponentially fast so that \u21e3ij(t) quickly gets away from 0;\n(b). When Kij < 0, the mean of \u21e3ij(t) is \u21e3ij(0)eKij t. The initial condition restricts \u21e3ij(0) to be\nsmall. Thus as t increases, the mean of \u21e3ij(t) converges to zero. Thus, the drift term vanishes quickly.\nThe variance of \u21e3ij(t) is bounded by G2\n(c). When Kij = 0, the drift term is approximately zero, meaning that \u21e3ij(t) also oscillates around 0.\nWe provide an example showing how the algorithm escapes from a saddle point. Suppose that the\nalgorithm starts at the saddle point corresponding to Ar = {1, . . . , q 1, q + 1, . . . , r, p}. Consider\nq,\u2318(t). By implication (a), we have Kqr = qp > 0. Hence \u21e3qr,\u2318(t) increases\nthe principle angle 2\nquickly away from zero. Thus, 2\nqr,\u2318(t) also increases quickly, which drives the algorithm\naway from the saddle point. Meanwhile, by (b) and (c), 2\ni,\u2318(t) stays at 1 for i < q because of the\nvanishing drift. The algorithm tends to escape from the saddle point through reducing 2\np,\u2318(t), since\nthis yields the largest eigengap, q p. When we have q = r and p = r + 1, the eigengap is\nminimal. Thus, it is the worst situation for the algorithm to escape from a saddle point. We give the\nfollowing proposition characterizing the time for the algorithm to escape from a saddle point.\n\nij/(2Kij). Hence, \u21e3ij(t) roughly oscillates around 0;\n\nq,\u2318(t) \u2318\u21e32\n\n6\n\n\fT1 \u21e3\n\nr r+1\n\nProposition 3.6. Suppose that the algorithm starts around the saddle point corresponding to Ar =\n{1, . . . , r 1, r + 1}. Given a pre-speci\ufb01ed \u232b and = O(\u2318 1\nlog(K + 1) with K =\n\n1\n\n,\n\n2 ) for a suf\ufb01ciently small \u2318, we need\n2(r r+1)\u231812\n)i2\nh1( 1\u232b/2\nG2\nrr\n\n2\n\nsuch that P2\n\nr,\u2318(T1) 2 1 \u232b, where is the CDF of the standard Gaussian distribution.\n\nThe detailed proof is provided in Appendix B.4. This implies that, asymptotically, we need\n\nS1 \u21e3\n\nT1\n\u2318 \u21e3\n\nlog(K + 1)\n\u2318(r r+1)\n\niterations to escape from a saddle point, and the algorithm enters the second stage.\n\u2022 Stage 2: Traverse between Stationary Points After the algorithm escapes from the saddle\npoint, the gradient is dominant, and the noise is negligible. Thus, the algorithm behaves like\nan almost deterministic traverse between stationary points, which can be viewed as a two-step\ndiscretization of the ODE with an error of the order O(\u2318) (Grif\ufb01ths and Higham, 2010). Hence,\ni,\u2318(t) to characterize the algorithmic behavior in this stage. Recall that we assume\nwe focus on 2\nAr = {1, . . . , r1, r+1}. When the algorithm escapes from the saddle point, we have 2\nr,\u2318(T1) 2,\nwhich impliesPm\ni,\u2318(t) \uf8ff 1 2. The following proposition assumes that the algorithm starts\nat this initial condition.\nProposition 3.7. Restarting the counter of time, for a suf\ufb01ciently small \u2318 and = O(\u2318 1\n2 ). We need\n\ni=r+1 2\n\nsuch that PPm\n\ni=r+1 2\n\ni,\u2318(t) \uf8ff 2 3\n\n1\n\nr r+1\n\nlog\n\n1\n2\n\nT2 \u21e3\n4.\n\nThe detailed proof is provided in Appendix B.5. This implies that, asymptotically, we need\n\nS2 \u21e3\n\nT2\n\u2318 \u21e3\n\n1\n\nlog\n\n1\n2\n\n\u2318(r r+1)\niterations to reach the neighborhood of the global optima.\n\u2022 Stage 3: Converge to Global Optima Similar to stage 1, we focus on \u21e3ij,\u2318(t) to characterize the\ndynamics of the algorithm around the global optima using an SDE approximation.\nTheorem 3.8. Suppose U (0) is initialized around the global optima withPm\ni,\u2318(0) = O(\u2318).\nThen as \u2318 ! 0, for i = r + 1, . . . , m and j = 1, . . . , r, \u21e3ij,\u2318(t) weakly converges to the solution of\nthe following SDE\n(3.5)\nd\u21e3ij = Kij\u21e3ijdt + GijdBt with Kij 2 [i 1, i r] and G2\n\nij < 1,\n\ni=r+1 2\n\nwhere Bt is a standard Brownian motion.\nThe detailed proof is provided in Appendix B.6. The analytical solution of (3.5) is\n\n\u21e3ij(t) = \u21e3ij(0)eKij t + GijZ t\n\n0\n\neKij (st)dB(s).\n\nWe then establish the following proposition.\nProposition 3.9. For a suf\ufb01ciently small \u270f and \u2318, = O(\u2318 1\n\n2 ), restarting the counter of time, we need\n\n1\n\nT3 \u21e3\n\nlog K0 with K0 =\n\n8(r r+1)2\n\n,\n\nsuch that we have PPm\n\nr r+1\ni=r+1 2\n\ni,\u2318(T3) \uf8ff \u270f 3\n\n(r r+1)\u270f 4\u2318rGm\n\n4 , where Gm = max1\uf8ffj\uf8ffrPm\n\nThe detailed proof is provided in Appendix B.7. The subscript m in Gm highlights its dependence on\nthe dimension m. Proposition 3.9 implies that, asymptotically, we need\n\nij.\ni=r+1 G2\n\nS3 \u21e3\n\nT3\n\u2318 \u21e3\n\nlog K0\n\n\u2318(r r+1)\n\niterations to converge to an \u270f-optimal solution in the third stage. Combining all the results in three\nstages, we know that after T1 + T2 + T3 time, the algorithm converges to an \u270f-optimal solution\nasymptotically. This further leads us to a more re\ufb01ned result in the following corollary.\n\n7\n\n\fCorollary 3.10. For a suf\ufb01ciently small \u270f, we choose\n\n\u2318 \u21e3\n\n(r r+1)\u270f\n\n.\n\n5rGm\n\n\u2318\n\n2\n\n4 .\n\nSuppose we start the algorithm near a saddle point, then we need T = T1 + T2 + T3 such that\n\nThe detailed proof is provided in Appendix B.8. Recall that we choose the block size h of downsam-\n\nF \uf8ff \u270f\u25c6 3\n\u2318\u2318. Thus, the asymptotic sample complexity satis\ufb01es\n\nP\u2713cos \u21e5\u21e3Er, U\n(T )\u2318\npling to be h = O\u21e3\uf8ff\u21e2 log 1\nFrom the perspective of statistical recovery, the obtained estimatorbU enjoys a near-optimal asymptotic\nrate of convergencecos \u21e5(bU , U\u21e4)\n\n\u270f(r r+1)2 log2\n\n, where N is the number of data points.\n\nrGm log N\n\n(rr+1)2N/\uf8ff\u21e2\n\n\u270f(r r+1)\n\nN \u21e3\n\nT h\n\u2318 \u21e3\n\n4 Numerical Experiments\n\n2\n\nF \u21e3\n\nrGm\n\nrGm\n\n.\n\nWe demonstrate the effectiveness of our proposed algorithm using both simulated and real datasets.\n\u2022 Simulated Data We \ufb01rst verify our analysis of streaming PCA problems for time series using a\nsimulated dataset. We choose a Gaussian VAR model with dimension m = 16. The random vector\n\u270fk\u2019s are independently sampled from N (0, S), where\n\nS = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3).\n\nWe choose the coef\ufb01cient matrix A = V >DV , where V 2 R16\u21e516 is an orthogonal matrix that we\nrandomly generate, and D = 0.1D0 is a diagonal matrix satisfying\nD0 = diag(0.68, 0.68, 0.69, 0.70, 0.70, 0.70, 0.72, 0.72, 0.72, 0.72, 0.72, 0.72, 0.80, 0.80, 0.85, 0.90).\nBy solving the discrete Lyapunov equation \u2303= A\u2303A> + S, we calculate the covariance matrix of\nthe stationary distribution, which satis\ufb01es \u2303= U>\u21e4U, where U 2 R16\u21e516 is orthogonal and\n\n\u21e4= diag(3.0175, 3.0170, 3.0160, 1.0077, 1.0070, 1.0061, 1.0058, 1.0052,\n1.0052, 1.0052, 1.0052, 1.0051, 1.0049, 1.0049, 1.0047, 1.0047).\n\nWe aim to \ufb01nd the leading principle components of \u2303 corresponding to the \ufb01rst 3 largest eigenvalues.\nThus, the eigengap is 3 4 = 2.0083. We initialize the solution at the saddle point whose column\nspan is the subspace spanned by the eigenvectors corresponding to 3.0175, 3.0170 and 1.0077. The\nstep size is \u2318 = 3 \u21e5 105, and the algorithm runs with 8 \u21e5 105 total samples. The trajectories of\nthe principle angle over 20 independent simulations with block size h = 4 are shown in Figure 1a.\nWe can clearly distinguish three different stages. Figure 1c and 1d illustrate that entries of principle\nangles, \u21e333 in stage 1 and \u21e342 in stage 3, are Ornstein-Uhlenbeck processes. Speci\ufb01cally, the estimated\ndistributions of \u21e333 and \u21e342 over 100 simulations follow Gaussian distributions. We can check that\nthe variance of \u21e333 increases in stage 1 as iteration increases, while the variance of \u21e342 in stage 3\napproaches a \ufb01xed value. All these simulated results are consistent with our theoretical analysis.\n\nStage 1\n\nStage 2\n\nStage 3\n\n(a) Solution trajectories\n\n(b) Different block sizes (c) Distribution of \u21e333(t) (d) Distribution of \u21e342(t)\n\nFigure 1: Illustrations of various algorithmic behaviors in simulated examples: (a) presents three\nstages of the algorithm; (b) compares the performance of different block sizes; (c) and (d) demonstrate\nthe Ornstein-Uhlenbeck processes of \u21e333 in stage 1 and \u21e342 in stage 3.\nWe further compare the performance of different block sizes of downsampling with step size annealing.\nWe keep using Gaussian VAR model with D = 0.9D0 and\nS = diag(1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.45, 1.455, 1.455, 1.455).\n\n8\n\n\fThe eigengap is 3 4 = 0.005. We run the algorithm with 5 \u21e5 105 samples and the chosen step\nsizes vary according to the number of samples k. Speci\ufb01cally, we set the step size \u2318 = \u23180 \u21e5 h\n4000 if\n48000 if k 2 [5 \u21e5 104, 10 \u21e5 104),\nk < 2 \u21e5 104, \u2318 = \u23180 \u21e5 h\n120000 if k 10 \u21e5 104. We choose \u23180 in {0.125, 0.25, 0.5, 1, 2} and report the \ufb01nal\nand \u2318 = \u23180 \u21e5 h\nprinciple angles achieved by different block sizes h in Table 1. Figure 1b presents the averaged\nprinciple angle over 5 simulations with \u23180 = 0.5. As can be seen, choosing h = 4 yields the\nbest performance. Speci\ufb01cally, the performance becomes better as h increases from 1 to around 4.\nHowever, the performance becomes worse, when h = 16 because of the lack of iterations.\n\n8000 if k 2 [2 \u21e5 104, 5 \u21e5 104), \u2318 = \u23180 \u21e5 h\n\nTable 1: The \ufb01nal principle angles achieved by different block sizes with varying \u23180.\n\n\u23180 = 0.125\n\n\u23180 = 0.25\n\nh = 1\nh = 2\nh = 4\nh = 6\nh = 8\nh = 16\n\n0.7775\n0.7792\n0.7892\n0.7542\n0.7982\n0.7783\n\n0.3595\n0.3569\n0.3745\n0.3655\n0.3933\n0.4324\n\n\u23180 = 0.5\n0.2320\n0.2080\n0.1130\n0.1287\n0.2828\n0.3038\n\n\u23180 = 1\n0.2449\n0.2477\n0.3513\n0.3317\n0.3820\n0.5647\n\n\u23180 = 2\n0.3773\n0.2290\n0.4730\n0.3983\n0.4102\n0.6526\n\n\u2022 Real Data We adopt the Air Quality dataset (De Vito et al., 2008), which contains 9358 instances\nof hourly averaged concentrations of totally 9 different gases in a heavily polluted area. We remove\nmeasurements with missing data. We aim to estimate the \ufb01rst 2 principle components of the series.\nWe randomly initialize the algorithm, and choose the block size of downsampling to be 1, 3, 5, 10,\nand 60. Figure 2 shows that the projection of each data point onto the leading and the second principle\ncomponents. We also present the result of projecting data points onto the eigenspace of sample\ncovariance matrix indicated by Batch in Figure 2. All the projections have been rotated such that\nthe leading principle component is parallel to the horizontal axis. As can be seen, when h = 1, the\nprojection yields some distortion in the circled area. When h = 3 and h = 5, the projection results are\nquite similar to the Batch result. As h increases, however, the projection displays obvious distortion\nagain compared to the Batch result. The concentrations of gases are naturally time dependent. Thus,\nwe deduce that the distortion for h = 1 comes from the data dependency, while for the case h = 60,\nthe distortion comes from the lack of updates. This phenomenon coincides with our simulated data\nexperiments.\n\nh = 1\n\nh = 5\n\nh = 10\n\nh = 30\n\nh = 60\n\nBatch\n\nFigure 2: Projections of air quality data onto the leading and the second principle components with\ndifferent block sizes of downsampling. We highlight the distortions for h = 1 and h = 60.\n\n5 Discussions\n\nWe remark that our analysis characterizes how our proposed algorithm escapes from the saddle point.\nThis is not analyzed in the related work, Allen-Zhu and Li (2016), since they use random initialization.\nNote that our analysis also applies to random initialization, and directly starts with the second stage.\nOur analysis is inspired by diffusion approximations in existing applied probability literature (Glynn,\n1990; Freidlin and Wentzell, 1998; Kushner and Yin, 2003; Ethier and Kurtz, 2009), which target to\ncapture the uncertainty of stochastic algorithms for general optimization problems. Without explicitly\nspecifying the problem structures, these analyses usually cannot lead to concrete convergence\nguarantees. In contrast, we dig into the optimization landscape of the streaming PCA problem.\nThis eventually allows us to precisely characterize the algorithmic dynamics and provide concrete\nconvergence guarantees, which further lead to a deeper understanding of the uncertainty in nonconvex\nstochastic optimization.\nThe block size h of downsampled Oja\u2019s algorithm is based on the mixing property of the time series.\nWe believe estimating the mixing coef\ufb01cient is an interesting problem. The procedure in Hsu et al.\n(2015) estimates the mixing time of Markov chains, which may possibly be adapted to our time series\nsetting. Moreover, estimating the covariance matrix of the stationary distribution is also interesting\nbut challenging. We leave them for future investigation.\n\n9\n\n\fReferences\nALLEN-ZHU, Z. (2017). Natasha 2: Faster non-convex optimization than sgd. arXiv preprint\n\narXiv:1708.08694 .\n\nALLEN-ZHU, Z. and LI, Y. (2016). First ef\ufb01cient convergence for streaming k-pca: a global,\n\ngap-free, and near-optimal rate. arXiv preprint arXiv:1607.07837 .\n\nBOTTOU, L. (1998). Online learning and stochastic approximations. On-line learning in neural\n\nnetworks 17 142.\n\nBRADLEY, R. C. ET AL. (2005). Basic properties of strong mixing conditions. a survey and some\n\nopen questions. Probability surveys 2 107\u2013144.\n\nCHEN, Z., YANG, L. F., LI, C. J. and ZHAO, T. (2017). Online partial least square optimization:\nDropping convexity for better ef\ufb01ciency and scalability. In International Conference on Machine\nLearning.\n\nCHUNG, K. L. (2004). On a stochastic approximation method. In Chance And Choice: Memorabilia.\n\nWorld Scienti\ufb01c, 79\u201399.\n\nDANG, C. D. and LAN, G. (2015). Stochastic block mirror descent methods for nonsmooth and\n\nstochastic optimization. SIAM Journal on Optimization 25 856\u2013881.\n\nDE VITO, S., MASSERA, E., PIGA, M., MARTINOTTO, L. and DI FRANCIA, G. (2008). On \ufb01eld\ncalibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario.\nSensors and Actuators B: Chemical 129 750\u2013757.\n\nDUCHI, J. C., AGARWAL, A., JOHANSSON, M. and JORDAN, M. I. (2012a). Ergodic mirror descent.\n\nSIAM Journal on Optimization 22 1549\u20131578.\n\nDUCHI, J. C., BARTLETT, P. L. and WAINWRIGHT, M. J. (2012b). Randomized smoothing for\n\nstochastic optimization. SIAM Journal on Optimization 22 674\u2013701.\n\nDURRETT, R. (2010). Probability: theory and examples. Cambridge university press.\nETHIER, S. N. and KURTZ, T. G. (2009). Markov processes: characterization and convergence, vol.\n\n282. John Wiley & Sons.\n\nFREIDLIN, M. I. and WENTZELL, A. D. (1998). Random perturbations. In Random Perturbations\n\nof Dynamical Systems. Springer, 15\u201343.\n\nGE, R., HUANG, F., JIN, C. and YUAN, Y. (2015). Escaping from saddle points\u2014online stochastic\n\ngradient for tensor decomposition. In Conference on Learning Theory.\n\nGE, R., LEE, J. D. and MA, T. (2016). Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems.\n\nGLYNN, P. W. (1990). Diffusion approximations. Handbooks in Operations research and manage-\n\nment Science 2 145\u2013198.\n\nGRIFFITHS, D. F. and HIGHAM, D. J. (2010). Numerical methods for ordinary differential equations:\n\ninitial value problems. Springer Science & Business Media.\n\nHALL, E. C., RASKUTTI, G. and WILLETT, R. (2016). Inference of high-dimensional autoregressive\n\ngeneralized linear models. arXiv preprint arXiv:1605.02693 .\n\nHOMEM-DE MELLO, T. (2008). On rates of convergence for stochastic optimization problems under\nnon\u2013independent and identically distributed sampling. SIAM Journal on Optimization 19 524\u2013551.\nHSU, D. J., KONTOROVICH, A. and SZEPESV\u00c1RI, C. (2015). Mixing time estimation in reversible\nmarkov chains from a single sample path. In Advances in neural information processing systems.\nJAIN, P., JIN, C., KAKADE, S. M., NETRAPALLI, P. and SIDFORD, A. (2016). Streaming pca:\nIn\n\nMatching matrix bernstein and near-optimal \ufb01nite sample guarantees for oja\u2019s algorithm.\nConference on Learning Theory.\n\n10\n\n\fKUSHNER, H. and YIN, G. G. (2003). Stochastic approximation and recursive algorithms and\n\napplications, vol. 35. Springer Science & Business Media.\n\nLEE, J. D., PANAGEAS, I., PILIOURAS, G., SIMCHOWITZ, M., JORDAN, M. I. and RECHT, B.\n(2017). First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 .\nLI, X., WANG, Z., LU, J., ARORA, R., HAUPT, J., LIU, H. and ZHAO, T. (2016). Symmetry, saddle\npoints, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 .\nNEMIROVSKI, A., JUDITSKY, A., LAN, G. and SHAPIRO, A. (2009). Robust stochastic approxima-\n\ntion approach to stochastic programming. SIAM Journal on optimization 19 1574\u20131609.\n\nREDDI, S. J., SRA, S., POCZOS, B. and SMOLA, A. J. (2016). Proximal stochastic methods for\nnonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information Processing\nSystems.\n\nROBBINS, H. and MONRO, S. (1951). A stochastic approximation method. The annals of mathemat-\n\nical statistics 400\u2013407.\n\nSACKS, J. (1958). Asymptotic distribution of stochastic approximation procedures. The Annals of\n\nMathematical Statistics 29 373\u2013405.\n\nSHALEV-SHWARTZ, S., SINGER, Y., SREBRO, N. and COTTER, A. (2011). Pegasos: Primal\n\nestimated sub-gradient solver for svm. Mathematical programming 127 3\u201330.\n\nSHAMIR, O. and ZHANG, T. (2013). Stochastic gradient descent for non-smooth optimization:\nConvergence results and optimal averaging schemes. In International Conference on Machine\nLearning.\n\nSREBRO, N. and JAAKKOLA, T. S. (2004). Linear dependent dimensionality reduction. In Advances\n\nin Neural Information Processing Systems.\n\nSUN, J., QU, Q. and WRIGHT, J. (2015). Complete dictionary recovery over the sphere. In Sampling\n\nTheory and Applications (SampTA), 2015 International Conference on. IEEE.\n\nSUN, J., QU, Q. and WRIGHT, J. (2016). A geometric analysis of phase retrieval. In Information\n\nTheory (ISIT), 2016 IEEE International Symposium on. IEEE.\n\nTJ\u00d8STHEIM, D. (1990). Non-linear time series and markov chains. Advances in Applied Probability\n\n22 587\u2013611.\n\n11\n\n\f", "award": [], "sourceid": 1790, "authors": [{"given_name": "Minshuo", "family_name": "Chen", "institution": "Georgia Tech"}, {"given_name": "Lin", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Mengdi", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}