{"title": "State Aggregation Learning from Markov Transition Data", "book": "Advances in Neural Information Processing Systems", "page_first": 4486, "page_last": 4495, "abstract": "State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system\u2019s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts\u2019 knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system\u2019s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple two- step algorithm: The first step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to find their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan traffic data successfully generates a data-driven state aggregation map with nice interpretations.", "full_text": "State Aggregation Learning from Markov Transition\n\nData\n\nYaqi Duan\n\nPrinceton University\n\nyaqid@princeton.edu\n\nZheng Tracy Ke\nHarvard University\n\nzke@fas.harvard.edu\n\nAbstract\n\nMengdi Wang\n\nPrinceton University\n\nmengdiw@princeton.edu\n\nState aggregation is a popular model reduction method rooted in optimal control.\nIt reduces the complexity of engineering systems by mapping the system\u2019s states\ninto a small number of meta-states. The choice of aggregation map often depends\non the data analysts\u2019 knowledge and is largely ad hoc. In this paper, we propose\na tractable algorithm that estimates the probabilistic aggregation map from the\nsystem\u2019s trajectory. We adopt a soft-aggregation model, where each meta-state has\na signature raw state, called an anchor state. This model includes several common\nstate aggregation models as special cases. Our proposed method is a simple two-\nstep algorithm: The \ufb01rst step is spectral decomposition of empirical transition\nmatrix, and the second step conducts a linear transformation of singular vectors\nto \ufb01nd their approximate convex hull. It outputs the aggregation distributions and\ndisaggregation distributions for each meta-state in explicit forms, which are not\nobtainable by classical spectral methods. On the theoretical side, we prove sharp\nerror bounds for estimating the aggregation and disaggregation distributions and\nfor identifying anchor states. The analysis relies on a new entry-wise deviation\nbound for singular vectors of the empirical transition matrix of a Markov process,\nwhich is of independent interest and cannot be deduced from existing literature.\nThe application of our method to Manhattan traf\ufb01c data successfully generates a\ndata-driven state aggregation map with nice interpretations.\n\n1\n\nIntroduction\n\nState aggregation is a long-existing approach for model reduction of complicated systems. It is\nwidely used as a heuristic to reduce the complexity of control systems and reinforcement learning\n(RL). The earliest idea of state aggregation is to aggregate \u201csimilar\u201d states into a small number of\nsubsets through a partition map. However, the partition is often handpicked by practitioners based on\ndomain-speci\ufb01c knowledge or exact knowledge about the dynamical system [31, 6]. Alternatively,\nthe partition can be chosen via discretization of the state space in accordance with some priorly\nknown similarity metric or feature functions [33]. Prior knowledge of the dynamical system is often\nrequired in order to handpick the aggregation without deteriorating its performance. There lacks a\nprincipled approach to \ufb01nd the best state aggregation structure from data.\nIn this paper, we propose a model-based approach to learning probabilistic aggregation structure.\nWe adopt the soft state aggregation model, a \ufb02exible model for Markov systems. It allows one to\nrepresent each state using a membership distribution over latent variables (see Section 2 for details).\nSuch models have been used for modeling large Markov decision processes, where the membership\ncan be used as state features to signi\ufb01cantly reduce its complexity [32, 36]. When the membership\ndistributions are degenerate, it reduces to the more conventional hard state aggregation model, and\nso our method is also applicable to \ufb01nding a hard partition of the state space.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe soft aggregation model is parameterized by p aggregation distributions and r disaggregation\ndistributions, where p is the total number of states in a Markov chain and r is the number of (latent)\nmeta-states. Each aggregation distribution contains the probabilities of transiting from one raw state\nto different meta-states, and each disaggregation distribution contains the probabilities of transiting\nfrom one meta-state to different raw states. Our goal is to use sample trajectories of a Markov process\nto estimate these aggregation/disaggregation distributions. The obtained aggregation/disaggregation\ndistributions can be used to estimate the transition kernel, sample from the Markov process, and plug\ninto downstream tasks in optimal control and reinforcement learning (see Section 5 for an example).\nIn the special case when the system admits a hard aggregation, these distributions naturally produce a\npartition map of states.\nOur method is a two-step algorithm. The \ufb01rst step is the same as the vanilla spectral method, where\nwe extract the \ufb01rst r left and right singular vectors of the empirical transition matrix. The second step\nis a novel linear transformation of singular vectors. The rationale of the second step is as follows:\nAlthough the left (right) singular vectors are not valid estimates of the aggregation (disaggregation)\ndistributions, their linear span is a valid estimate of the linear span of aggregation (disaggregation)\ndistributions. Consequently, the left (right) singular vectors differ from the targeted aggregation\n(disaggregation) distributions only by a linear transformation. We estimate this linear transformation\nby leveraging a geometrical structure associated with singular vectors. Our method requires no prior\nknowledge of meta-states and provides a data-driven approach to learning the aggregation map.\nOur contributions.\n1. We introduce a notion of \u201canchor state\u201d for the identi\ufb01ability of soft state aggregation. It creates an\nanalog to the notions of \u201cseparable feature\u201d in nonnegative matrix factorization [13] and \u201canchor word\u201d\nin topic modeling [4]. The introduction of \u201canchor states\u201d not only ensures model identi\ufb01ability but\nalso greatly improves the interpretability of meta-states (see Section 2 and Section 5). Interestingly,\nhard-aggregation indeed assumes all states are anchor states. Our framework instead assumes there\nexists one anchor state for each meta-state.\n2. We propose an ef\ufb01cient method for estimating the aggregation/disaggregation distributions of a\nsoft state aggregation model from Markov transition data. In contrast, classical spectral methods are\nnot able to estimate these distributions directly.\n3. We prove statistical error bounds for the total variation divergence between the estimated ag-\ngregation/disaggregation distributions and the ground truth. The estimation errors depend on the\nsize of state space, number of meta-states, and mixing time of the process. We also prove a sample\ncomplexity bound for accurately recovering all anchor states. To our best knowledge, this is the \ufb01rst\nresult of statistical guarantees for soft state aggregation learning.\n4. At the core of our analysis is an entry-wise large deviation bound for singular vectors of the\nempirical transition matrix of a Markov process. This connects to the recent interests of entry-wise\nanalysis of empirical eigenvectors [2, 23, 24, 38, 11, 14]. Such analysis is known to be challenging,\nand techniques that work for one type of random matrices often do not work for another. Unfortunately,\nour desired results cannot be deduced from any existing literature, and we have to derive everything\nfrom scratch (see Section 4). Our large-deviation bound provides a convenient technical tool for the\nanalysis of spectral methods on Markov data, and is of independent interest.\n5. We applied our method to a Manhattan taxi-trip dataset, with interesting discoveries. The estimated\nstate aggregation model extracts meaningful traf\ufb01c modes, and the output anchor regions capture\npopular landmarks of the Manhattan city, such as Times square and WTC-Exchange. A plug-in of\nthe aggregation map in a reinforcement learning (RL) experiment, taxi-driving policy optimization,\nsigni\ufb01cantly improves the performance. These validate that our method is practically useful.\n\nConnection to literature. While classical spectral methods have been used for aggregating states\nin Markov processes [34, 37], these methods do not directly estimate a state aggregation model.\nIt was shown in [37] that spectral decomposition can reliably recover the principal subspace of a\nMarkov transition kernel. Unfortunately, singular vectors themselves are not valid estimates of the\naggregation/disaggregation distributions: the population quantity that singular vectors are trying to\nestimate is strictly different from the targeted aggregation/disaggregation distributions (see Section 3).\nOur method is inspired by the connection between soft state aggregation, nonnegative matrix factor-\nization (NMF) [26, 16], and topic modeling [7]. Our algorithm is connected to the spectral method in\n[22] for topic modeling. The method in [22] is a general approach about using spectral decomposition\nfor nonnegative matrix factorization. Whether or not it can be adapted to state aggregation learning\n\n2\n\n\fand how accurately it estimates the soft-aggregation model was unknown. In particular, [22] worked\non a topic model, where the data matrix has column-wise independence and their analysis heavily\nrelies on this property. Unfortunately, our data matrix has column-wise dependence, so we are unable\nto use their techniques. We build our analysis from ground up.\nThere are recent works on statistical guarantees of learning a Markov model [15, 18, 28, 37, 35]. They\ntarget on estimating the transition matrix, but our focus is to estimate the aggregation/disaggregation\ndistributions. Given an estimator of the transition matrix, how to obtain the aggregation/disaggregation\ndistributions is non-trivial (for example, simply performing a vanilla PCA on the estimator doesn\u2019t\nwork). To our best knowledge, our result is the \ufb01rst statistical guarantee for estimating the aggre-\ngation/disaggregation distributions. We can also use our estimator of aggregation/disaggregation\ndistributions to form an estimator of the transition matrix, and it can achieve the known minimax rate\nfor a range of parameter settings. See Section 4 for more discussions.\n\n2 Soft State Aggregation Model with Anchor States\n\nWe say that a Markov chain X0, X1, . . . , Xn admits a soft state aggregation with r meta-states, if\nthere exist random variables Z0, Z1, . . . , Zn\u22121 \u2208 {1, 2, . . . , r} such that\n\nP(Xt+1 | Xt) =\n\nP(Zt = k | Xt) \u00b7 P(Xt+1 | Zt = k),\n\n(1)\n\nr(cid:88)k=1\n\nfor all t with probability 1. Here, P(Zt | Xt) and P(Xt+1 | Zt) are independent of t and referred\nto as the aggregation distributions and disaggregation distributions, respectively. The soft state\naggregation model has been discussed in various literatures (e.g., [32, 37]), where r is presumably\nmuch smaller than p. See [5, Section 6.3.7] for a textbook review. This decomposition means that\none can map the states into meta-states while preserving the system\u2019s dynamics. In the special case\nwhere each aggregation distribution is degenerate (we say a discrete distribution is degenerate if only\none outcome is possible), it reduces to the hard state aggregation model or lumpable Markov model.\nThe soft state aggregation model has a matrix form. Let P \u2208 Rp\u00d7p be the transition matrix, where\nPij = P(Xt+1 = j | Xt = i). We introduce U \u2208 Rp\u00d7r and V \u2208 Rp\u00d7r, where Uik = P(Zt = k |\nXt = i) and Vjk = P(Xt+1 = j | Zt = k). Each row of U is an aggregation distribution, and each\ncolumn of V is a disaggregation distribution. Then, (1) is equivalent to (1s: the vector of 1\u2019s)\n(2)\n\nwhere U1r = 1p and V(cid:62)1p = 1r.\n\nP = UV(cid:62),\n\nHere, U and V are not identi\ufb01able, unless with additional conditions. We assume that each meta-state\nhas a signature raw state, de\ufb01ned either through the aggregation process or the disaggregation process.\nDe\ufb01nition 1 (Anchor State). A state i is called an \u201caggregation anchor state\u201d of the meta-state k\nif Uik = 1 and Uis = 0 for all s (cid:54)= k. A state j is called a \u201cdisaggregation anchor state\u201d of the\nmeta-state k, if Vjk > 0 and Vjs = 0 for all s (cid:54)= k.\nAn aggregation anchor state transits to only one meta-state, and a disaggregation anchor state can be\ntransited from only one meta-state. Since (2) is indeed a nonnegative matrix factorization (NMF), the\nde\ufb01nition of anchor states are natural analogs of \u201cseparable features\u201d in NMF [13]. They are also\nnatural analogs of \u201cpure documents\u201d and \u201canchor words\u201d in topic modeling. We note that in a hard\nstate aggregation model, every state is an anchor state by default.\nThroughout this paper, we take the following assumption:\nAssumption 1. There exists at least one disaggregation anchor state for each meta-state.\n\nBy well-known results in NMF [13], this assumption guarantees that U and V are uniquely de\ufb01ned\nby (2), provided that P has a rank of r. Our results can be extended to the case where each meta-state\nhas an aggregation anchor state (see the remark in Section 3). For simplicity, from now on, we call a\ndisaggregation anchor state an anchor state for short.\nThe introduction of anchor states not only guarantees identi\ufb01ability but also enhances interpretability\nof the model. This is demonstrated in an application to New York city taxi traf\ufb01c data. We model\nthe taxi traf\ufb01c by a \ufb01nite-state Markov chain, where each state is a pick-up/drop-off location in the\ncity. Figure 1 illustrates the estimated soft-state aggregation model (see Section 5 for details). The\n\n3\n\n\festimated anchor states coincide with notable landmarks in Manhattan, such as Times square area,\nthe museum area on Park avenue, etc. Hence, each meta-state (whose disaggregation distribution\nis plotted via a heat map over Manhattan in (c)) can be nicely interpreted as a representative traf\ufb01c\nmode with exclusive destinations (e.g., the traf\ufb01c to Times square, the traf\ufb01c to WTC Exchange, the\ntraf\ufb01c to museum park, etc.). In contrast, if we don\u2019t explore anchor states but simply use PCA to\nconduct state aggregation, the obtained meta-states have no clear association with notable landmarks,\nso are hard to interpret. The interpretability of our model also translates to better performance in\ndownstream tasks in reinforcement learning (see Section 5).\n\nFigure 1: Soft state aggregation learned from NYC taxi data. Left to right: (a) Illustration of 100 taxi trips\n(O: pick-up location, (cid:52): drop-off location). (b) A principal component of P (heat map), lacking interpretability.\n(c) Disaggregation distribution of P corresponding to the Times square anchor region. (d) Ten anchor regions\nidenti\ufb01ed by Alg. 1, coinciding with landmarks of NYC.\n\n3 An Unsupervised State Aggregation Algorithm\n\nGiven a Markov trajectory {X0, X1, . . . , Xn} from a state-transition system, let N \u2208 Rp\u00d7p be the\nmatrix consisting of empirical state-transition counts, i.e., Nij = (cid:80)n\u22121\nt=0 1{Xt = i, Xt+1 = j}.\nOur algorithm takes as input the matrix N and the number of meta-states r, and it estimates (a) the\ndisaggregation distributions V, (b) the aggregation distributions U, and (c) the anchor states. See\nAlgorithm 1. Part (a) is the core of the algorithm, which we explain below.\nInsight 1: Disaggregation distributions are linear transformations of right singular vectors.\nWe consider an oracle case, where the transition matrix P is given and we hope to retrieve V from P.\nLet H = [h1, . . . , hr] \u2208 Rp\u00d7r be the matrix containing \ufb01rst r right singular vectors of P. Let Span(\u00b7)\ndenote the column space of a matrix. By linear algebra, Span(H) = Span(P(cid:62)) = Span(VU(cid:62)) =\nSpan(V). Hence, the columns of H and the columns of V are two different bases of the same\nr-dimensional subspace. It follows immediately that there exists L \u2208 Rr\u00d7r such that H = VL. On\nthe one hand, this indicates that singular vectors are not valid estimates of disaggregation distributions,\nas each singular vector hk is a linear combination of multiple disaggregation distributions. On the\nother hand, it suggests a promising two-step procedure for recovering V from P: (i) obtain the right\nsingular vectors H, (ii) identify the matrix L and retrieve V = HL\u22121.\nInsight 2: The linear transformation of L is estimable given the existence of anchor states.\nThe estimation of L hinges on a geometrical structure induced by the anchor state assumption [13]:\nLet C be a simplicial cone with r supporting rays, where the directions of the supporting rays are\nspeci\ufb01ed by the r rows of the matrix L. If j is an anchor state, then the j-th row of H lies on one\nsupporting ray of this simplicial cone. If j is not an anchor state, then the j-th row of H lies in the\ninterior of the simplicial cone. See the left panel of Figure 2 for illustration with r = 3. Once we\nidentify the r supporting rays of this simplicial cone, we immediately obtain the desired matrix L.\nInsight 3: Normalization on eigenvectors is the key to estimation of L under noise corruption.\nIn the real case where N, instead of P, is given, we can only obtain a noisy version of H. With noise\ncorruption, to estimate supporting rays of a simplicial cone is very challenging. [22] discovered that\na particular row-wise normalization on H manages to \u201cproject\u201d the simplicial cone to a simplex with\nr vertices. Then, for all anchor states of one meta-state, their corresponding rows collapse to one\nsingle point in the noiseless case (and a tight cluster in the noisy case). The task reduces to estimating\n\n4\n\n\fFigure 2: Geometrical structure of anchor states. Left: Each dot is a row of the matrix H = [h1, h2, h3].\nThe data points are contained in a simplicial cone with three supporting rays. Right: Each dot is a re-scaled\nrow of H by SCORE, where the \ufb01rst coordinate is dropped and so every row is in a plane. The data points are\ncontained in a simplex with three vertices.\n\nAlgorithm 1 Learning the Soft State Aggregation Model.\nInput: empirical state-transition counts N, number of meta-states r, anchor state threshold \u03b40\n\n1. Estimate the matrix of disaggregation distributions V.\n\n(i) Conduct SVD on (cid:101)N = N[diag(N(cid:62)1p)]\u22121/2 \u2208 Rp\u00d7p, and let(cid:98)h1, ...,(cid:98)hr denote the \ufb01rst r right\nsingular vectors. Obtain a matrix (cid:98)D = [diag((cid:98)h1)]\u22121[(cid:98)h2, . . . ,(cid:98)hr] \u2208 Rp\u00d7(r\u22121).\n(ii) Run an existing vertex \ufb01nding algorithm to rows of (cid:98)D. Let(cid:98)b1, . . . ,(cid:98)br be the output vertices.\n(iii) For 1 \u2264 j \u2264 p, compute\nqk(cid:17)2\n\n(In our numerical experiments, we use the vertex hunting algorithm in [21]).\n\n+(cid:16)1 \u2212\n\n(cid:98)w\u2217j = argminq\u2208Rr(cid:13)(cid:13)(cid:13)(cid:98)dj \u2212\n\nqk(cid:98)bk(cid:13)(cid:13)(cid:13)\n\nr(cid:88)k=1\n\nSet the negative entries in (cid:98)w\u2217j to zero and renormalize it to have a unit (cid:96)1-norm. The re-\nsulting vector is denoted as (cid:98)wj. Let (cid:99)W = [(cid:98)w1,(cid:98)w2, . . . ,(cid:98)wp](cid:62) \u2208 Rp\u00d7r. Obtain the matrix\n[diag((cid:98)h1)][diag(N(cid:62)1p)]1/2(cid:99)W and re-normalize each column of it to have a unit (cid:96)1-norm. The\nresulting matrix is (cid:98)V.\n2. Estimate matrix of aggregation distributions U. Let(cid:98)P = [diag(N1p)]\u22121N be the empirical transition\n(cid:98)U = (cid:98)P(cid:98)V((cid:98)V(cid:62)(cid:98)V)\u22121,\n3. Estimate the set of anchor states. Let (cid:98)wj be as in Step (iii). Let\n\nprobability matrix. Estimate U by\n\nA =(cid:8)1 \u2264 j \u2264 p : max\n\n1\u2264k\u2264r(cid:98)wj(k) \u2265 1 \u2212 \u03b40(cid:9).\n\nOutput: estimates (cid:98)V and (cid:98)U, set of anchor states A\n\n2\n\nr(cid:88)k=1\n\n.\n\nthe vertices of a simplex, which is much easier to handle under noise corruption. This particular\nnormalization is called SCORE [20]. It re-scales each row of H by the \ufb01rst coordinate of this row.\nAfter re-scaling, the \ufb01rst coordinate is always 1, so it is eliminated; the normalized rows then have\n(r \u2212 1) coordinates. See the right panel of Figure 2. Once we identify the r vertices of this simplex,\nwe can use them to re-construct L in a closed form [22].\nThese insights have casted estimation of V to a simplex \ufb01nding problem: Given data(cid:98)d1, . . . ,(cid:98)dp \u2208\nRr\u22121, suppose they are noisy observations of non-random vectors d1, . . . , dp \u2208 Rr\u22121 (in our case,\neach dj is a row of the matrix H after SCORE normalization), where these non-random vectors are\ncontained in a simplex with r vertices b1, . . . , br with at least one dj located on each vertex. We\naim to estimate the vertices b1, . . . , br. This is a well-studied problem in the literature, also known\nas linear unmixing or archetypal analysis. There are a number of existing algorithms [8, 3, 29, 9, 21].\nFor example, the successive projection algorithm [3] has a complexity of O(pr4).\nAlgorithm 1 follows the insights above but is more sophisticated. Due to space limit, we relegate the\ndetailed explanation to Appendix. It has a complexity of O(p3), where main cost is from SVD.\nRemark (The case with aggregation anchor states). The anchor states here are the disaggregation\nanchor states in De\ufb01ntion 1. Instead, if each meta-state has an aggregation anchor state, there is a\nsimilar geometrical structure associated with the left singular vectors. We can modify our algorithm\nto \ufb01rst use left singular vectors to estimate U and then estimate V and anchor states.\n\n5\n\nh2h3h1x\u21a6[h1(x),h2(x),h3(x)]states in feature spaceanchor statesanchor statesanchor statesanchor states reveal meta-states h2h3x\u21a6[h2(x)h1(x),h3(x)h1(x)]project ontoh1=1\f4 Main Statistical Results\n\nOur main results are twofold. The \ufb01rst is row-wise large-deviation bounds for empirical singular\nvectors. The second is statistical guarantees of learning the soft state aggregation model. Throughout\nthis section, suppose we observe a trajectory {X0, X1, . . . , Xn} from an ergodic Markov chain with p\nstates, where the transition matrix P satis\ufb01es (2) with r meta-states. Let \u03c0 \u2208 Rp denote the stationary\ndistribution. De\ufb01ne a mixing time \u03c4\u2217 = min(cid:8)k \u2265 1 : max1\u2264i\u2264p (cid:107)(Pk)i,\u00b7 \u2212 \u03c0(cid:62)(cid:107)1 \u2264 1/2(cid:9). We\nassume there are constants c1, C1, \u00afC1, c2, c3, C4 > 0 such that the following conditions hold:\n(a) The stationary distribution \u03c0 satis\ufb01es c1p\u22121 \u2264 \u03c0j \u2264 C1p\u22121, for j = 1, 2, . . . , p.\n(b) The stationary distribution on meta-states satis\ufb01es(cid:0)U(cid:62)\u03c0(cid:1)k \u2264 \u00afC1r\u22121, for k = 1, 2, . . . , r.\n(c) \u03bbmin(U(cid:62)U) \u2265 c2pr\u22121 and \u03bbmin(V(cid:62)V) \u2265 c2p\u22121r.\n(d) The \ufb01rst two singular values of [diag(\u03c0)]P[diag(\u03c0)]\u22121/2 satisfy \u03c31 \u2212 \u03c32 \u2265 c3p\u2212 1\n2 .\n(e) The entries of the r-by-r matrix U(cid:62)PV satisfy maxk,l(U(cid:62)PV)kl\n\nmink,l(U(cid:62)PV)kl \u2264 C4.\n\nConditions (a)-(b) require that the Markov chain has a balanced number of visits to each state and to\neach meta-state when reaching stationarity. Such conditions are often imposed for learning a Markov\nmodel [37, 28]. Condition (c) eliminates the aggregation (disaggregation) distributions from being\nhighly collinear, so that each of them can be accurately identi\ufb01ed from the remaining. Condition (d)\nis a mild eigen-gap condition, which is necessary for consistent estimation of eigenvectors from PCA\n[2, 22]. Condition (e) says that the meta-states are reachable from one-another and that meta-state\ntransitions cannot be overly imbalanced.\n4.1 Row-wise large-deviation bounds for singular vectors\nAt the core of the analysis of any spectral method (the classical PCA or our unsupervised algorithm)\nis characterization of the errors of approximating population eigenvectors by empirical eigenvectors.\nIf we choose the loss function to be the Euclidean norm between two vectors, it reduces to deriving a\nbound for the spectral norm of noise matrix [12] and is often manageable. However, the Euclidean\nnorm bound is useless for our problem: In order to obtain the total-variation bounds for estimating\naggregation/disaggregation distributions, we need sharp error bounds for each entry of the eigenvector.\nRecently, there has been active research on entry-wise analysis of eigenvectors [2, 23, 24, 38, 11, 14].\nSince eigenvectors depend on data matrix in a complicated and highly nonlinear form, such analysis\nis well-known to be challenging. More importantly, there is no universal technique that works for\nall problems, and such bounds are obtained in a problem-by-problem manner (e.g., for Gaussian\ncovariance matrix [23, 24], network adjacency matrix [2, 21], and topic matrix [22]). As an addition\nto the nascent literature, we develop such results for transition count matrices of a Markov chain. The\nanalysis is challenging due to that the entries of the count matrix are dependent of each other.\n\nits \ufb01rst r right singular vectors (our technique also applies to the original count matrix N and the\n\nRecall that (cid:101)N is the re-scaled transition count matrix introduced in Algorithm 1 and(cid:98)h1, . . . ,(cid:98)hr are\nempirical transition matrix(cid:98)P). Theorem 1 and Theorem 2 deal with the leading singular vector and\nthe remaining ones, respectively. ((cid:101)O means \u201cbounded up to a logarithmic factor of n, p\u201d).\nTheorem 1 (Entry-wise perturbation bounds for(cid:98)h1). Suppose the regularity conditions (a)-(e) hold.\nThere exists a parameter \u03c9 \u2208 {\u00b11} such that if n =(cid:101)\u2126(cid:0)\u03c4\u2217p(cid:1), then with probability at least 1 \u2212 n\u22121,\nmax1\u2264j\u2264p |\u03c9(cid:98)h1(j) \u2212 h1(j)| = (cid:101)O(cid:16)(\u03c32 \u2212 \u03c31)\u22121(1 +(cid:112)\u03c4\u2217p/n)(cid:113) \u03c4\u2217\nnp(cid:17).\nTheorem 2 (Row-wise perturbation bounds for (cid:98)H). Suppose the regularity conditions (a)-(e) hold.\nFor 1 \u2264 s \u2264 t \u2264 r, let H\u2217 =(cid:2)hs, . . . , ht(cid:3), (cid:98)H\u2217 =(cid:2)(cid:98)hs, . . . ,(cid:98)ht(cid:3), and \u2206\u2217 = min(cid:8)\u03c3s\u22121 \u2212 \u03c3s, \u03c3t \u2212\n\u03c3t+1(cid:9), where \u03c30 = +\u221e, \u03c3r+1 = 0. If n = (cid:101)\u2126(cid:0)\u03c4\u2217p(cid:1), then with probability 1 \u2212 n\u22121, there is an or-\nthogonal matrix \u2126\u2217 such that max1\u2264j\u2264p(cid:13)(cid:13)e(cid:62)j(cid:0)(cid:98)H\u2217\u2126\u2217 \u2212 H\u2217(cid:1)(cid:13)(cid:13)2 = (cid:101)O(cid:16)\u2206\u22121\n\u2217 (cid:16)1 +(cid:112)\u03c4\u2217p/nr(cid:17)(cid:113) \u03c4\u2217r\nnp(cid:17).\n\n4.2 Statistical guarantees of soft state aggregation\nWe study the error of estimating U and V, as well as recovering the set of anchor states. Algorithm 1\nplugs in some existing algorithm for the simplex \ufb01nding problem. We make the following assumption:\n\nnoisy observations of non-random vectors d1, ..., dp, where these non-random vectors are contained\n\nAssumption 2 (Ef\ufb01ciency of simplex \ufb01nding). Given data(cid:98)d1, ...,(cid:98)dp \u2208 Rr\u22121, suppose they are\n\n6\n\n\fin a simplex with r vertices b1, . . . , br with at least one dj located on each vertex. The simplex\n\n\ufb01nding algorithm outputs(cid:98)b1, ...,(cid:98)br such that max1\u2264k\u2264r (cid:107)(cid:98)bk \u2212 bk(cid:107) \u2264 C max1\u2264j\u2264p (cid:107)(cid:98)dj \u2212(cid:98)dj(cid:107).\n\nSeveral existing simplex \ufb01nding algorithms satisfy this assumption, such as the successive projection\nalgorithm [3], the vertex hunting algorithm [21, 22], and the algorithm of archetypal analysis [19].\nSince this is not the main contribution of this paper, we refer the readers to the above references for\ndetails. In our numerical experiments, we use the vertex hunting algorithm in [21, 22].\nFirst, we provide total-variation bounds between estimated individual aggregation/disaggregation\ndistributions and the ground truth. Write V = [V1, . . . , Vr] and U = [u1, . . . , up](cid:62), where each\nVk \u2208 Rp is a disaggregation distribution and each ui \u2208 Rr is an aggregation distribution.\nTheorem 3 (Error bounds for estimating V). Suppose the regularity conditions (a)-(e) hold and\n\nTheorem 4 (Error bounds for estimating U). Suppose the regularity conditions (a)-(e) hold and\n\nAssumptions 1 and 2 are satis\ufb01ed. When n = (cid:101)\u2126(cid:0)\u03c4\u2217p\n2 r(cid:1), with probability at least 1 \u2212 n\u22121, the\nk=1(cid:13)(cid:13)(cid:98)Vk \u2212 Vk(cid:13)(cid:13)1 = (cid:101)O(cid:16)(cid:0)1 + p(cid:112)\u03c4\u2217/n(cid:1)(cid:112) \u03c4\u2217pr\nn (cid:17).\nr(cid:80)r\nestimate (cid:98)V given by Algorithm 1 satis\ufb01es 1\nAssumptions 1 and 2 are satis\ufb01ed. When n = (cid:101)\u2126(cid:0)\u03c4\u2217p\n2 r(cid:1), with probability at least 1 \u2212 n\u22121, the\nn (cid:17).\nj=1(cid:13)(cid:13)(cid:98)uj \u2212 uj(cid:13)(cid:13)1 = (cid:101)O(cid:16)r\n2(cid:0)1 + p(cid:112)\u03c4\u2217/n(cid:1)(cid:112) \u03c4\u2217pr\np(cid:80)p\nestimate (cid:98)U given by Algorithm 1 satis\ufb01es 1\nSecond, we provide sample complexity guarantee for the exact recovery of anchor states. To eliminate\nfalse positives, we need a condition that the non-anchor states are not too \u2018close\u2019 to an anchor state;\nthis is captured by the quantity \u03b4 below. (Note \u03b4j = 0 for anchor states j \u2208 A\u2217.)\nTheorem 5 (Exact recovery of anchor states). Suppose the regularity conditions (a)-(e) hold and\nAssumptions 1 and 2 are satis\ufb01ed. Let A\u2217 be the set of (disaggregation) anchor states. De\ufb01ne\n\u03b4j = 1 \u2212 max1\u2264k\u2264r PX0\u223c\u03c0(Z0 = k | X1 = j) and \u03b4 = minj /\u2208A\u2217 \u03b4j. Suppose the threshold \u03b40 in\n\n3\n\n3\n\n3\n\n3\n\n0 \u03c4\u2217p\n\n2 r(cid:1), then P(A = A\u2217) \u2265 1 \u2212 n\u22121.\n\nWe connect our results to several lines of works in the literature. First, in the special case of r = 1,\nour problem reduces to learning a discrete distribution with p outcomes, where the minimax rate\n\nAlgorithm 1 satis\ufb01es \u03b40 = O(\u03b4). If n =(cid:101)\u2126(cid:0)\u03b4\u22122\nof total-variation distance is O((cid:112)p/n) [17]. Our bound matches with this rate when p = O(\u221an).\n\nHowever, our problem is much harder: each row of P is a mixture of r discrete distributions. Second,\nour setting is connected to the setting of learning a mixture of discrete distributions [30, 27] but is\ndifferent in important ways. Those works consider learning one mixture distribution, and the data\nare iid observations. Our problem is to estimate p mixture distributions, which share the same basis\ndistributions but have different mixing proportions, and our data are a single trajectory of a Markov\nchain. Third, our problem is connected to topic modeling [4, 22], where we may view the empirical\ntransition pro\ufb01le of each raw state as a \u2018document\u2019. However, in topic modeling, the documents\nare independent of each other, but the \u2018documents\u2019 here are highly dependent as they are generated\nfrom a single trajectory of a Markov chain. Last, we compare with the literature of estimating the\ntransition matrix P of a Markov model. Without low-rank assumptions on P, the minimax rate of\nthe total variation error is O(p/\u221an) [35] (also, see [25] and reference therein for related settings in\n\nhidden Markov models); with a low-rank structure on P, the minimax rate becomes O((cid:112)rp/n) [37].\nTo compare, we use our estimator of (U, V) to construct an estimator of P by(cid:98)P = (cid:98)U(cid:98)V(cid:62). When\nr is bounded and p = O(\u221an), this estimator achieves a total-variation error of O((cid:112)rp/n), which\n\nis optimal. At the same time, we want to emphasize that estimating (U, V) is a more challenging\nproblem than estimating P, and we are not aware of any existing theoretical results of the former.\nSimulation. We test our method on simulations (settings are in the appendix). The results are\nsummarized in Figure 3. It suggests: (a) the rate of convergence in Theorem 3 is con\ufb01rmed by\nnumerical evidence, and (b) our method compares favorably with existing methods on estimating\nP (to our best knowledge, there is no other method that directly estimates U and V; so we instead\ncompare the estimation on P).\n5 Analysis of NYC Taxi Data and Application to Reinforcement Learning\nWe analyze a dataset of 1.1 \u00d7 107 New York city yellow cab trips that were collected in January 2016\n[1]. We treat each taxi trip as a sample transition generated by a city-wide Markov process over NYC,\nwhere the transition is from a pick-up location to some drop-off location. We discretize the map into\np = 1922 cells so that the Markov process becomes a \ufb01nite-state one.\n\n7\n\n\fFigure 3: Simulation Results. Left: Total variation error on V (p is \ufb01xed and n varies). Middle: Total variation\n\nerror on V (p/n is \ufb01xed). Both panels validate the scaling of(cid:112)p/n in Theorem 3. Right: Recovery error of P,\nwhere (cid:98)U(cid:98)V(cid:62) is our method and \u00afP is the spectral estimator [37] (note: this method cannot estimate U,V ).\n\n(a)\n\n(b)\n\nr = 4\n\nr = 10\n\n(cid:98)Vk (midtown)\n\n(cid:98)Uk (downtown) (cid:98)Vk (downtown)\n\nFigure 4: State aggregation results. (a) Estimated anchor regions and partition of NYC. (b) Estimated\n\nAnchor regions and partition. We apply Alg. 1 to the taxi-trip data with r = 4, 10. The algorithm\n\n(cid:98)Uk (midtown)\ndisaggregation distribution ((cid:98)Vk) and aggregation likelihood ((cid:98)Uk) for two meta-states.\nidenti\ufb01es sets of anchor states that are close to the vertices, as well as columns of (cid:98)U and (cid:98)V corre-\nsponding to each vertex (anchor region). We further use the estimated (cid:98)U, (cid:98)V to \ufb01nd a partition of the\n\ncity. Recall that in Algorithm 1, each state is projected onto a simplex, which can be represented as\na convex combination of simplex\u2019s vertices (see Figure 2). For each state, we assign the state to a\ncluster that corresponds to largest value in the weights of convex combination. In this way, we can\ncluster the 1922 locations into a small number of regions. The partition results are shown in Figure 4\n(a), where anchor regions are marked in each cluster.\n\nand disaggregation matrices. We use heat maps to visualize their columns. Take r = 10 for example.\nWe pick two meta-states, with anchor states in the downtown and midtown areas, respectively. We\n\nEstimated aggregation and disaggregation distributions. Let (cid:98)U, (cid:98)V be the estimated aggregation\nplot in Figure 4 (b) the corresponding columns of (cid:98)U and (cid:98)V. Each column of (cid:98)V is a disaggregation\ndistribution, and each column of (cid:98)U can be thought of as a likelihood function for transiting to\n\ncorresponding meta-states. The heat maps reveal the leading \u201cmodes\u201d of the traf\ufb01c-dynamics.\nAggregation distributions used as features for RL. Soft state ag-\ngregation can be used to reduce the complexity of reinforcement\nlearning (RL) [32]. Aggregation/disaggregation distributions pro-\nvide features to parameterize high-dimensional policies, in conjunc-\ntion with feature-based RL methods [10, 36]. Next we experiment\nwith using the aggregation distributions as features for RL.\nConsider the taxi-driving policy optimization problem. The driver\u2019s\nobjective is to maximize the daily revenue - a Markov decision pro-\ncess where the driver chooses driving directions in realtime based\non locations. We compute the optimal policy using feature-based\nRL [10] and simulated NYC traf\ufb01c. The algorithm takes as input\n27 estimated aggregation distributions as state features. For com-\nparison, we also use a hard partition of the city which is handpicked\naccording to 27 NYC districts. RL using aggregation distributions as\nfeatures achieves a daily revenue of $230.57, while the method using\nhandpicked partition achieves $209.14. Figure 5 plots the optimal\ndriving policy. This experiment suggests that (1) state aggregation\nlearning provides features for RL automatically; (2) using aggrega-\ntion distributions as features leads to better performance of RL than\nusing handpicked features.\n\nFigure 5: The optimal driving\npolicy learnt by feature-based RL\nwith estimated aggregation distri-\nbutions as state features. Arrows\npoint out the most favorable direc-\ntions and the thickness is propor-\ntional to favorability of the direc-\ntion.\n\n8\n\n10510610710810-210-1100#anchor = 25#anchor = 50#anchor = 75#anchor = 10010510610710810-210-1100#anchor = 25#anchor = 50#anchor = 75#anchor = 100Figure3:TotalvariationerroronV(pis\ufb01xedandnvaries).lnmax1\uf8ffk\uf8ffrkbVkVkk1\u21e15.00.51lnn.Squaredcorrelationcoef\ufb01cient=0.998.4007001000250050000.120.130.140.150.160.170.18#anchor = 0.025p#anchor = 0.050p#anchor = 0.075p#anchor = 0.100pFigure4:TotalvariationerroronV(p/nis\ufb01xed).lnmax1\uf8ffk\uf8ffrkbVkVkk1\u21e11.80.03lnp.Squaredcorrelationcoef\ufb01cient=0.8.10510610710-210-1Figure5:ErroronestimatingPbybUbV>(\u00afPistheestimatedbyspectraldecomposition).Theorem3(ErrorboundsforestimatingV).Supposetheregularityconditions(a)-(e)hold.Whenn=e\u2326\u2327\u21e4p32,withprobabilityatleast1n1,theestimatebVgivenbyAlgorithm1satis\ufb01es1rPrk=1bVkVk1=eO\u21e31+pp\u2327\u21e4/np\u2327\u21e4prn\u2318.Theorem4(ErrorboundsforestimatingU).Supposetheregularityconditions(a)-(e)hold.Whenn=e\u2326\u2327\u21e4p32r,withprobabilityatleast1n1,theestimatebUgivenbyAlgorithm1satis\ufb01es1pPpj=1bujuj1=eO\u21e3r321+pp\u2327\u21e4/np\u2327\u21e4prn\u2318.Theorem5(Recoveryofanchorstates).Supposetheregularityconditions(a)-(e)hold.LetA\u21e4bethesetofall(disaggregation)anchorstates.De\ufb01ne=1maxj/2A\u21e4max1\uf8ffk\uf8ffr[diag(\u21e1)U]jk.InAlgorithm1,if0=O()andn=e\u232620\u2327\u21e4p32r12,thenP(A=A\u21e4)1n1.tracy:Needtoaddsomewhereassumptionaboutvertexhuntingtracy:(toedit)Ourerrorboundsareasymptoticallytightwhenristreatedasaconstant.Considerthespecialcasewherer=1,sotheMarkovchainbecomesrepeatedlyandindependentlysamplingfroma\ufb01xedp-statedistribution,whichequalstothesinglecolumnofV.Inthiscase,theleadingtermoftheerrorboundgivenbyTheorem3isO(pp/n).Itmatchestheminimaxtotalvariationdivergence(uptoalogarithmicfactor)forestimatingap-statediscretedistribution[14].Usingasimilarargument,wecanseethattheleadingerrorboundO(pp/n)givenbyTheorem2isalsononimprovableforestimatingaggregationdistributions,becausetherearepaggregationdistributions(withasupportsizerwhichistreatedasaconstant)andeachdistributionissampledforroughly\u2326(n/p)times.Simulations.Weusenumericalexperiments(a)tovalidateourupperboundsinTheorem3,espe-ciallythedependenceonsamplesizen,and(b)toinvestigatetherealperformanceofourmethod.Forap-stateMarkovchainwithrmeta-states,we\ufb01rstrandomlycreatetwomatricesU,V2Rp\u21e5r+suchthateachmeta-statehasthesamenumberofanchorstates.AfterassemblingatransitionmatrixP=UV>,wegeneraterandomwalkdata{X0,X1,...,Xn}.Werunexperimentswithp=1000,r=6andthenumberofanchorstatesequalto25,50,75,100foreachmeta-state.First,inFigure3,whenpis\ufb01xedandnvaries,thelog-total-variationerrorinbVscaleslinearlywithlog(n),witha\ufb01ttedslope\u21e10.5.ThisisconsistentwithconclusionofTheorem3whichindicatesthattheerrorbounddecreaseswithnatthespeedofn1/2.Second,inFigure4,whenboth(n,p)varywhilep/nis\ufb01xed,thethelog-total-variationerrorinbVremainsalmostconstant,witha\ufb01ttedslope\u21e10.Thisvalidatesthescalingofpp/nintheerrorboundofTheorem3.Inboth\ufb01gures,weobservethatouralgorithmcantakeadvantageofmultipleanchorstatesforeachmeta-state.Last,inFigure5,weconsiderestimatingthetransitionmatrixPbybUbV>,andcompareitwiththethespectralestimatorin[26].Ourmethodhasaslightlybetterperformance.NotethatourmethodonlyonlyestimatesPbutalsoestimates(U,V),whilethespectralmethodcannotestimate(U,V).74007001000250050000.120.130.140.150.160.170.18#anchor = 0.025p#anchor = 0.050p#anchor = 0.075p#anchor = 0.100p10510610710810-210-1100#anchor = 25#anchor = 50#anchor = 75#anchor = 100Figure3:TotalvariationerroronV(pis\ufb01xedandnvaries).lnmax1\uf8ffk\uf8ffrkbVkVkk1\u21e15.00.51lnn.Squaredcorrelationcoef\ufb01cient=0.998.4007001000250050000.120.130.140.150.160.170.18#anchor = 0.025p#anchor = 0.050p#anchor = 0.075p#anchor = 0.100pFigure4:TotalvariationerroronV(p/nis\ufb01xed).lnmax1\uf8ffk\uf8ffrkbVkVkk1\u21e11.80.03lnp.Squaredcorrelationcoef\ufb01cient=0.8.10510610710-210-1Figure5:ErroronestimatingPbybUbV>(\u00afPistheestimatedbyspectraldecomposition).Theorem3(ErrorboundsforestimatingV).Supposetheregularityconditions(a)-(e)hold.Whenn=e\u2326\u2327\u21e4p32,withprobabilityatleast1n1,theestimatebVgivenbyAlgorithm1satis\ufb01es1rPrk=1bVkVk1=eO\u21e31+pp\u2327\u21e4/np\u2327\u21e4prn\u2318.Theorem4(ErrorboundsforestimatingU).Supposetheregularityconditions(a)-(e)hold.Whenn=e\u2326\u2327\u21e4p32r,withprobabilityatleast1n1,theestimatebUgivenbyAlgorithm1satis\ufb01es1pPpj=1bujuj1=eO\u21e3r321+pp\u2327\u21e4/np\u2327\u21e4prn\u2318.Theorem5(Recoveryofanchorstates).Supposetheregularityconditions(a)-(e)hold.LetA\u21e4bethesetofall(disaggregation)anchorstates.De\ufb01ne=1maxj/2A\u21e4max1\uf8ffk\uf8ffr[diag(\u21e1)U]jk.InAlgorithm1,if0=O()andn=e\u232620\u2327\u21e4p32r12,thenP(A=A\u21e4)1n1.tracy:Needtoaddsomewhereassumptionaboutvertexhuntingtracy:(toedit)Ourerrorboundsareasymptoticallytightwhenristreatedasaconstant.Considerthespecialcasewherer=1,sotheMarkovchainbecomesrepeatedlyandindependentlysamplingfroma\ufb01xedp-statedistribution,whichequalstothesinglecolumnofV.Inthiscase,theleadingtermoftheerrorboundgivenbyTheorem3isO(pp/n).Itmatchestheminimaxtotalvariationdivergence(uptoalogarithmicfactor)forestimatingap-statediscretedistribution[14].Usingasimilarargument,wecanseethattheleadingerrorboundO(pp/n)givenbyTheorem2isalsononimprovableforestimatingaggregationdistributions,becausetherearepaggregationdistributions(withasupportsizerwhichistreatedasaconstant)andeachdistributionissampledforroughly\u2326(n/p)times.Simulations.Weusenumericalexperiments(a)tovalidateourupperboundsinTheorem3,espe-ciallythedependenceonsamplesizen,and(b)toinvestigatetherealperformanceofourmethod.Forap-stateMarkovchainwithrmeta-states,we\ufb01rstrandomlycreatetwomatricesU,V2Rp\u21e5r+suchthateachmeta-statehasthesamenumberofanchorstates.AfterassemblingatransitionmatrixP=UV>,wegeneraterandomwalkdata{X0,X1,...,Xn}.Werunexperimentswithp=1000,r=6andthenumberofanchorstatesequalto25,50,75,100foreachmeta-state.First,inFigure3,whenpis\ufb01xedandnvaries,thelog-total-variationerrorinbVscaleslinearlywithlog(n),witha\ufb01ttedslope\u21e10.5.ThisisconsistentwithconclusionofTheorem3whichindicatesthattheerrorbounddecreaseswithnatthespeedofn1/2.Second,inFigure4,whenboth(n,p)varywhilep/nis\ufb01xed,thethelog-total-variationerrorinbVremainsalmostconstant,witha\ufb01ttedslope\u21e10.Thisvalidatesthescalingofpp/nintheerrorboundofTheorem3.Inboth\ufb01gures,weobservethatouralgorithmcantakeadvantageofmultipleanchorstatesforeachmeta-state.Last,inFigure5,weconsiderestimatingthetransitionmatrixPbybUbV>,andcompareitwiththethespectralestimatorin[26].Ourmethodhasaslightlybetterperformance.NotethatourmethodonlyonlyestimatesPbutalsoestimates(U,V),whilethespectralmethodcannotestimate(U,V).710510610710-210-1\fReferences\n[1] NYC Taxi and Limousine Commission (TLC) trip record data. http://www.nyc.gov/html/\n\ntlc/html/about/trip_record_data.shtml. Accessed June 11, 2018.\n\n[2] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector\nanalysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565, 2017.\n\n[3] M\u00e1rio C\u00e9sar Ugulino Ara\u00fajo, Teresa Cristina Bezerra Saldanha, Roberto Kawakami Harrop\nGalvao, Takashi Yoneyama, Henrique Caldas Chame, and Valeria Visani. The successive projec-\ntions algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics\nand Intelligent Laboratory Systems, 57(2):65\u201373, 2001.\n\n[4] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models\u2013going beyond SVD. In\n\nFoundations of Computer Science (FOCS), pages 1\u201310, 2012.\n\n[5] Dimitri P Bertsekas. Dynamic programming and optimal control. Athena scienti\ufb01c Belmont,\n\nMA, 2007.\n\n[6] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scienti\ufb01c,\n\nBelmont, MA, 1996.\n\n[7] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of\n\nmachine Learning research, 3(Jan):993\u20131022, 2003.\n\n[8] Joseph W Boardman, Fred A Kruse, and Robert O Green. Mapping target signatures via partial\n\nunmixing of aviris data. 1995.\n\n[9] C-I Chang, C-C Wu, Weimin Liu, and Y-C Ouyang. A new growing method for simplex-based\nendmember extraction algorithm. IEEE Transactions on Geoscience and Remote Sensing,\n44(10):2804\u20132819, 2006.\n\n[10] Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear \u03c0 learning using state and action\n\nfeatures. Proceedings of the 35th International Conference on Machine Learning, 2018.\n\n[11] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized\n\nmle are both optimal for top-k ranking. arXiv preprint arXiv:1707.09971, 2017.\n\n[12] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. III.\n\nSIAM Journal on Numerical Analysis, 7(1):1\u201346, 1970.\n\n[13] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a\ncorrect decomposition into parts? In Advances in neural information processing systems, pages\n1141\u20131148, 2004.\n\n[14] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Unperturbed: spectral analysis beyond\n\nDavis-Kahan. arXiv preprint arXiv:1706.06516, 2017.\n\n[15] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learn-\ning markov distributions: Does estimation trump compression? In 2016 IEEE International\nSymposium on Information Theory (ISIT), pages 2689\u20132693. IEEE, 2016.\n\n[16] Nicolas Gillis. The why and how of nonnegative matrix factorization. Regularization, Opti-\n\nmization, Kernels, and Support Vector Machines, 12(257), 2014.\n\n[17] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions\n\nunder (cid:96)1 loss. IEEE Transactions on Information Theory, 61(11):6343\u20136354, 2015.\n\n[18] Yi HAO, Alon Orlitsky, and Venkatadheeraj Pichapati. On learning markov chains. In S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 648\u2013657. Curran Associates, Inc., 2018.\n\n[19] Hamid Javadi and Andrea Montanari. Non-negative matrix factorization via archetypal analysis.\n\nJournal of the American Statistical Association, (just-accepted):1\u201327, 2019.\n\n9\n\n\f[20] Jiashun Jin. Fast community detection by SCORE. Annals of Statistics, 43(1):57\u201389, 2015.\n\n[21] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships by simplex\n\nvertex hunting. arXiv:1708.07852, 2017.\n\n[22] Zheng Tracy Ke and Minzhe Wang. A new SVD approach to optimal topic estimation.\n\narXiv:1704.07016, 2017.\n\n[23] Vladimir Koltchinskii and Karim Lounici. Asymptotics and concentration bounds for bilinear\nforms of spectral projectors of sample covariance. In Annales de l\u2019Institut Henri Poincar\u00e9,\nProbabilit\u00e9s et Statistiques, volume 52, pages 1976\u20132013. Institut Henri Poincar\u00e9, 2016.\n\n[24] Vladimir Koltchinskii and Dong Xia. Perturbation of linear forms of singular vectors under\n\nGaussian noise. In High Dimensional Probability VII, pages 397\u2013423. Springer, 2016.\n\n[25] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In\n\nInternational Conference on Machine Learning, pages 702\u2013710, 2013.\n\n[26] Daniel Lee and Sebastian Seung. Learning the parts of objects by non-negative matrix factoriza-\n\ntion. Nature, 401(6755):788\u2013791, 1999.\n\n[27] Jian Li, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. Learning arbitrary statistical\nmixtures of discrete distributions. In Proceedings of the forty-seventh annual ACM symposium\non Theory of computing, pages 743\u2013752. ACM, 2015.\n\n[28] Xudong Li, Mengdi Wang, and Anru Zhang. Estimation of Markov chain via rank-constrained\n\nlikelihood. Proceedings of the 35th international conference on Machine learning, 2018.\n\n[29] Jos\u00e9 MP Nascimento and Jos\u00e9 MB Dias. Vertex component analysis: A fast algorithm to unmix\nhyperspectral data. IEEE transactions on Geoscience and Remote Sensing, 43(4):898\u2013910,\n2005.\n\n[30] Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. Learning mixtures of arbitrary\ndistributions over large discrete domains. In Proceedings of the 5th conference on Innovations\nin theoretical computer science, pages 207\u2013224. ACM, 2014.\n\n[31] David F Rogers, Robert D Plante, Richard T Wong, and James R Evans. Aggregation and\ndisaggregation techniques and methodology in optimization. Operations Research, 39(4):553\u2013\n582, 1991.\n\n[32] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft\nstate aggregation. In Advances in neural information processing systems, pages 361\u2013368, 1995.\n\n[33] John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic\n\nprogramming. Machine Learning, 22(1-3):59\u201394, 1996.\n\n[34] Marcus Weber and Susanna Kube. Robust Perron cluster analysis for various applications in\ncomputational life science. In International Symposium on Computational Life Science, pages\n57\u201366. Springer, 2005.\n\n[35] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. In\nProceedings of the 30th International Conference on Algorithmic Learning Theory, pages\n904\u2013930, 2019.\n\n[36] Lin F Yang and Mengdi Wang. Sample-optimal parametric Q-learning with linear transition\n\nmodels. Proceedings of the 36th International Conference on Machine Learning, 2019.\n\n[37] Anru Zhang and Mengdi Wang.\n\narXiv:1802.02920, 2018.\n\nSpectral state compression of Markov processes.\n\n[38] Yiqiao Zhong and Nicolas Boumal. Near-optimal bounds for phase synchronization. SIAM\n\nJournal on Optimization, 28(2):989\u20131016, 2018.\n\n10\n\n\f", "award": [], "sourceid": 2520, "authors": [{"given_name": "Yaqi", "family_name": "Duan", "institution": "Princeton University"}, {"given_name": "Tracy", "family_name": "Ke", "institution": "Harvard University"}, {"given_name": "Mengdi", "family_name": "Wang", "institution": "Princeton University"}]}