{"title": "Supervised Learning for Dynamical System Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1963, "page_last": 1971, "abstract": "Recently there has been substantial interest in spectral methods for learning dynamical systems. These methods are popular since they often offer a good tradeoffbetween computational and statistical efficiency. Unfortunately, they can be difficult to use and extend in practice: e.g., they can make it difficult to incorporateprior information such as sparsity or structure. To address this problem, we presenta new view of dynamical system learning: we show how to learn dynamical systems by solving a sequence of ordinary supervised learning problems, therebyallowing users to incorporate prior knowledge via standard techniques such asL 1 regularization. Many existing spectral methods are special cases of this newframework, using linear regression as the supervised learner. We demonstrate theeffectiveness of our framework by showing examples where nonlinear regressionor lasso let us learn better state representations than plain linear regression does;the correctness of these instances follows directly from our general analysis.", "full_text": "Supervised Learning for Dynamical System Learning\n\nAhmed Hefny \u2217\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nahefny@cs.cmu.edu\n\nCarlton Downey \u2020\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ncmdowney@cs.cmu.edu\n\nGeoffrey J. Gordon \u2021\n\nggordon@cs.cmu.edu\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nRecently there has been substantial interest in spectral methods for learning dy-\nnamical systems. These methods are popular since they often offer a good tradeoff\nbetween computational and statistical ef\ufb01ciency. Unfortunately, they can be dif-\n\ufb01cult to use and extend in practice: e.g., they can make it dif\ufb01cult to incorporate\nprior information such as sparsity or structure. To address this problem, we present\na new view of dynamical system learning: we show how to learn dynamical sys-\ntems by solving a sequence of ordinary supervised learning problems, thereby\nallowing users to incorporate prior knowledge via standard techniques such as\nL1 regularization. Many existing spectral methods are special cases of this new\nframework, using linear regression as the supervised learner. We demonstrate the\neffectiveness of our framework by showing examples where nonlinear regression\nor lasso let us learn better state representations than plain linear regression does;\nthe correctness of these instances follows directly from our general analysis.\n\n1\n\nIntroduction\n\nLikelihood-based approaches to learning dynamical systems, such as EM [1] and MCMC [2], can\nbe slow and suffer from local optima. This dif\ufb01culty has resulted in the development of so-called\n\u201cspectral algorithms\u201d [3], which rely on factorization of a matrix of observable moments; these\nalgorithms are often fast, simple, and globally optimal.\nDespite these advantages, spectral algorithms fall short in one important aspect compared to EM and\nMCMC: the latter two methods are meta-algorithms or frameworks that offer a clear template for\ndeveloping new instances incorporating various forms of prior knowledge. For spectral algorithms,\nby contrast, there is no clear template to go from a set of probabilistic assumptions to an algorithm.\nIn fact, researchers often relax model assumptions to make the algorithm design process easier,\npotentially discarding valuable information in the process.\nTo address this problem, we propose a new framework for dynamical system learning, using the\nidea of instrumental-variable regression [4, 5] to transform dynamical system learning to a sequence\nof ordinary supervised learning problems. This transformation allows us to apply the rich literature\non supervised learning to incorporate many types of prior knowledge. Our new methods subsume a\nvariety of existing spectral algorithms as special cases.\nThe remainder of this paper is organized as follows: \ufb01rst we formulate the new learning framework\n(Sec. 2). We then provide theoretical guarantees for the proposed methods (Sec. 4). Finally, we give\n\u2217This material is based upon work funded and supported by the Department of Defense under Contract No.\nFA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a\nfederally funded research and development center.\n\n\u2020Supported by a grant from the PNC Center for Financial Services Innovation\n\u2021Supported by NIH grant R01 MH 064537 and ONR contract N000141512365.\n\n1\n\n\fFigure 1: A latent-state dynamical system.\nObservation ot is determined by latent state\nst and noise \u0001t.\n\nFigure 2: Learning and applying a dynami-\ncal system with instrumental regression. The\npredictions from S1 provide training data to\nS2. At test time, we \ufb01lter or predict using the\nweights from S2.\n\ntwo examples of how our techniques let us rapidly design new and useful dynamical system learning\nmethods by encoding modeling assumptions (Sec. 5).\n\n2 A framework for spectral algorithms\n\nA dynamical system is a stochastic process (i.e., a distribution over sequences of observations) such\nthat, at any time, the distribution of future observations is fully determined by a vector st called the\nlatent state. The process is speci\ufb01ed by three distributions: the initial state distribution P (s1), the\nstate transition distribution P (st+1 | st), and the observation distribution P (ot | st). For later use,\nwe write the observation ot as a function of the state st and random noise \u0001t, as shown in Figure 1.\nGiven a dynamical system, one of the fundamental tasks is to perform inference, where we predict\nfuture observations given a history of observations. Typically this is accomplished by maintaining\na distribution or belief over states bt|t\u22121 = P (st | o1:t\u22121) where o1:t\u22121 denotes the \ufb01rst t \u2212 1\nobservations. bt|t\u22121 represents both our knowledge and our uncertainty about the true state of the\nsystem. Two core inference tasks are \ufb01ltering and prediction.1 In \ufb01ltering, given the current belief\nbt = bt|t\u22121 and a new observation ot, we calculate an updated belief bt+1 = bt+1|t that incorporates\not. In prediction, we project our belief into the future: given a belief bt|t\u22121 we estimate bt+k|t\u22121 =\nP (st+k | o1:t\u22121) for some k > 0 (without incorporating any intervening observations).\nThe typical approach for learning a dynamical system is to explicitly learn the initial, transition, and\nobservation distributions by maximum likelihood. Spectral algorithms offer an alternate approach\nto learning: they instead use the method of moments to set up a system of equations that can be\nsolved in closed form to recover estimates of the desired parameters. In this process, they typically\nfactorize a matrix or tensor of observed moments\u2014hence the name \u201cspectral.\u201d\nSpectral algorithms often (but not always [6]) avoid explicitly estimating the latent state or the initial,\ntransition, or observation distributions; instead they recover observable operators that can be used\nto perform \ufb01ltering and prediction directly. To do so, they use an observable representation: instead\nof maintaining a belief bt over states st, they maintain the expected value of a suf\ufb01cient statistic of\nfuture observations. Such a representation is often called a (transformed) predictive state [7].\nIn more detail, we de\ufb01ne qt = qt|t\u22121 = E[\u03c8t | o1:t\u22121], where \u03c8t = \u03c8(ot:t+k\u22121) is a vector of future\nfeatures. The features are chosen such that qt determines the distribution of future observations\n\n1There are other forms of inference in addition to \ufb01ltering and prediction, such as smoothing and likelihood\n\nevaluation, but they are outside the scope of this paper.\n\n2\n\n\ud835\udc5c\ud835\udc61\u22121 \ud835\udc5c\ud835\udc61 \ud835\udc5c\ud835\udc61+\ud835\udc58\u22121 \ud835\udc5c\ud835\udc61+\ud835\udc58 history \u210e\ud835\udc61 future \ud835\udf13\ud835\udc61/\ud835\udc5e\ud835\udc61 shifted future \ud835\udf13\ud835\udc61+1 extended future \ud835\udf09\ud835\udc61/\ud835\udc5d\ud835\udc61 S1A regression \uf0e0\ud835\udc38[\ud835\udc5e\ud835\udc61|\u210e\ud835\udc61] S1B regression \uf0e0 \ud835\udc38[\ud835\udc5d\ud835\udc61|\u210e\ud835\udc61] S2 regression Condition on \ud835\udc5c\ud835\udc61 (filter) \uf0e0 \ud835\udc5e\ud835\udc61+1 Marginalize \ud835\udc5c\ud835\udc61 (predict) \uf0e0 \ud835\udc5e\ud835\udc61+1|\ud835\udc61\u22121 \fP (ot:t+k\u22121 | o1:t\u22121).2 Filtering then becomes the process of mapping a predictive state qt to qt+1\nconditioned on ot, while prediction maps a predictive state qt = qt|t\u22121 to qt+k|t\u22121 = E[\u03c8t+k |\no1:t\u22121] without intervening observations.\nA typical way to derive a spectral method is to select a set of moments involving \u03c8t, work out the\nexpected values of these moments in terms of the observable operators, then invert this relationship\nto get an equation for the observable operators in terms of the moments. We can then plug in an\nempirical estimate of the moments to compute estimates of the observable operators.\nWhile effective, this approach can be statistically inef\ufb01cient (the goal of being able to solve for the\nobservable operators is in con\ufb02ict with the goal of maximizing statistical ef\ufb01ciency) and can make\nit dif\ufb01cult to incorporate prior information (each new source of information leads to new moments\nand a different and possibly harder set of equations to solve). To address these problems, we show\nthat we can instead learn the observable operators by solving three supervised learning problems.\nThe main idea is that, just as we can represent a belief about a latent state st as the conditional\nexpectation of a vector of observable statistics, we can also represent any other distributions needed\nfor prediction and \ufb01ltering via their own vectors of observable statistics. Given such a representation,\nwe can learn to \ufb01lter and predict by learning how to map these vectors to one another.\nIn particular, the key intermediate quantity for \ufb01ltering is the \u201cextended and marginalized\u201d belief\nP (ot, st+1 | o1:t\u22121)\u2014or equivalently P (ot:t+k | o1:t\u22121). We represent this distribution via a vector\n\u03bet = \u03be(ot:t+k) of features of the extended future. The features are chosen such that the extended\nstate pt = E[\u03bet | o1:t\u22121] determines P (ot:t+k | o1:t\u22121). Given P (ot:t+k | o1:t\u22121), \ufb01ltering and\nprediction reduce respectively to conditioning on and marginalizing over ot.\nIn many models (including Hidden Markov Models (HMMs) and Kalman \ufb01lters), the extended state\npt is linearly related to the predictive state qt\u2014a property we exploit for our framework. That is,\n\npt = W qt\n\n(1)\nfor some linear operator W . For example, in a discrete system \u03c8t can be an indicator vector repre-\nsenting the joint assignment of the next k observations, and \u03bet can be an indicator vector for the next\nk + 1 observations. The matrix W is then the conditional probability table P (ot:t+k | ot:t+k\u22121).\nOur goal, therefore, is to learn this mapping W . Na\u00a8\u0131vely, we might try to use linear regression for\nthis purpose, substituting samples of \u03c8t and \u03bet in place of qt and pt since we cannot observe qt or\npt directly. Unfortunately, due to the overlap between observation windows, the noise terms on \u03c8t\nand \u03bet are correlated. So, na\u00a8\u0131ve linear regression will give a biased estimate of W .\nTo counteract this bias, we employ instrumental regression [4, 5]. Instrumental regression uses in-\nstrumental variables that are correlated with the input qt but not with the noise \u0001t:t+k. This property\nprovides a criterion to denoise the inputs and outputs of the original regression problem: we remove\nthat part of the input/output that is not correlated with the instrumental variables. In our case, since\npast observations o1:t\u22121 do not overlap with future or extended future windows, they are not cor-\nrelated with the noise \u0001t:t+k+1, as can be seen in Figure 1. Therefore, we can use history features\nht = h(o1:t\u22121) as instrumental variables.\nIn more detail, by taking the expectation of (1) given ht, we obtain an instrument-based moment\ncondition: for all t,\n\nE[pt | ht] = E[W qt | ht]\n\nE[E[\u03bet | o1:t\u22121] | ht] = W E[E[\u03c8t | o1:t\u22121] | ht]\n\nE[\u03bet | ht] = W E[\u03c8t | ht]\n\n(2)\nAssuming that there are enough independent dimensions in ht that are correlated with qt, we main-\ntain the rank of the moment condition when moving from (1) to (2), and we can recover W by least\nsquares regression if we can compute E[\u03c8t | ht] and E[\u03bet | ht] for suf\ufb01ciently many examples t.\nFortunately, conditional expectations such as E[\u03c8t | ht] are exactly what supervised learning algo-\nrithms are designed to compute. So, we arrive at our learning framework: we \ufb01rst use supervised\n\n2For convenience we assume that the system is k-observable: that is, the distribution of all future obser-\nvations is determined by the distribution of the next k observations. (Note: not by the next k observations\nthemselves.) At the cost of additional notation, this restriction could easily be lifted.\n\n3\n\n\fModel/Algorithm\n\nSpectral Algorithm\nfor HMM [3]\n\nSSID for Kalman\n\ufb01lters (time depen-\ndent gain)\n\nSSID for\nKalman\n(constant gain)\n\nstable\n\ufb01lters\n\nfuture features \u03c8t\nU(cid:62)eot where eot is an indicator vec-\ntor and U spans the range of qt (typi-\ncally the top m left singular vectors of\nthe joint probability table P (ot+1, ot))\nxt and xt \u2297 xt, where xt =\nU(cid:62)ot:t+k\u22121 for a matrix U that spans\nthe range of qt (typically the top m left\nsingular vectors of the covariance matrix\nCov(ot:t+k\u22121, ot\u2212k:t\u22121))\nU(cid:62)ot:t+k\u22121 (U obtained as above)\n\nextended future features\n\u03bet\nU(cid:62)eot+1 \u2297 eot\n\nyt and yt \u2297 yt, where\nyt is formed by stacking\nU(cid:62)ot+1:t+k and ot.\n\not and U(cid:62)ot+1:t+k\n\nUncontrolled HSE-\nPSR [9]\n\nEvaluation functional ks(ot:t+k\u22121, .)\nfor a characteristic kernel ks\n\nko(ot, .) \u2297 ko(ot, .)\nand \u03c8t+1 \u2297 ko(ot, .)\n\nffilter\n\nEstimate a state normalizer from S1A\noutput states.\n\nspeci\ufb01es a Gaussian distribution\npt\nwhere conditioning on ot is straightfor-\nward.\n\nsteady-state\n\nEstimate\ncovariance by\nsolving Riccati equation [8].\npt to-\ngether with the steady-state covariance\nspecify a Gaussian distribution where\nconditioning on ot is straightforward.\nKernel Bayes rule [10].\n\nTable 1: Examples of existing spectral algorithms reformulated as two-stage instrument regression\nwith linear S1 regression. Here ot1:t2 is a vector formed by stacking observations ot1 through ot2 and\n\u2297 denotes the outer product. Details and derivations can be found in the supplementary material.\n\nlearning to estimate E[\u03c8t | ht] and E[\u03bet | ht], effectively denoising the training examples, and then\nuse these estimates to compute W by \ufb01nding the least squares solution to (2).\nIn summary, learning and inference of a dynamical system through instrumental regression can be\ndescribed as follows:\n\n\u2022 Model Speci\ufb01cation: Pick features of history ht = h(o1:t\u22121), future \u03c8t = \u03c8(ot:t+k\u22121)\nand extended future \u03bet = \u03be(ot:t+k). \u03c8t must be a suf\ufb01cient statistic for P(ot:t+k\u22121 |\no1:t\u22121). \u03bet must satisfy\n\n\u2013 E[\u03c8t+1 | o1:t\u22121] = fpredict(E[\u03bet | o1:t\u22121]) for a known function fpredict.\n\u2013 E[\u03c8t+1 | o1:t] = f\ufb01lter(E[\u03bet | o1:t\u22121], ot) for a known function f\ufb01lter.\n\nht]. The training data for this model are (ht, \u03bet) across time steps t.\n\n\u00af\u03c8t = E[\u03c8t | ht]. The training data for this model are (ht, \u03c8t) across time steps t.3\n\n\u2022 S1A (Stage 1A) Regression: Learn a (possibly non-linear) regression model to estimate\n\u2022 S1B Regression: Learn a (possibly non-linear) regression model to estimate \u00af\u03bet = E[\u03bet |\n\u2022 S2 Regression: Use the feature expectations estimated in S1A and S1B to train a model\nto predict \u00af\u03bet = W \u00af\u03c8t, where W is a linear operator. The training data for this model are\nestimates of ( \u00af\u03c8t, \u00af\u03bet) obtained from S1A and S1B across time steps t.\n\u2022 Initial State Estimation: Estimate an initial state q1 = E[\u03c81] by averaging \u03c81 across\nseveral example realizations of our time series.4\n\u2022 Inference: Starting from the initial state q1, we can maintain the predictive state qt =\nE[\u03c8t | o1:t\u22121] through \ufb01ltering: given qt we compute pt = E[\u03bet | o1:t\u22121] = W qt. Then,\ngiven the observation ot, we can compute qt+1 = f\ufb01lter(pt, ot). Or, in the absence of ot,\nwe can predict the next state qt+1|t\u22121 = fpredict(pt). Finally, by de\ufb01nition, the predictive\nstate qt is suf\ufb01cient to compute P(ot:t+k\u22121 | o1:t\u22121).5\n\nThe process of learning and inference is depicted in Figure 2. Modeling assumptions are re\ufb02ected\nin the choice of the statistics \u03c8, \u03be and h as well as the regression models in stages S1A and S1B.\nTable 1 demonstrates that we can recover existing spectral algorithms for dynamical system learning\nusing linear S1 regression. In addition to providing a unifying view of some successful learning\nalgorithms, the new framework also paves the way for extending these algorithms in a theoretically\njusti\ufb01ed manner, as we demonstrate in the experiments below.\n\n3Our bounds assume that the training time steps t are suf\ufb01ciently spaced for the underlying process to mix,\n\nbut in practice, the error will only get smaller if we consider all time steps t.\n\n4Assuming ergodicity, we can set the initial state to be the empirical average vector of future features in a\n\n(cid:80)T\n\nsingle long sequence, 1\nT\n\nt=1 \u03c8t.\n\n5It might seem reasonable to learn qt+1 = fcombined(qt, ot) directly, thereby avoiding the need to separately\n\nestimate pt and condition on ot. Unfortunately, fcombined is nonlinear for common models such as HMMs.\n\n4\n\n\f3 Related Work\n\nThis work extends predictive state learning algorithms for dynamical systems, which include spec-\ntral algorithms for Kalman \ufb01lters [11], Hidden Markov Models [3, 12], Predictive State Represen-\ntations (PSRs) [13, 14] and Weighted Automata [15]. It also extends kernel variants such as [9],\nwhich builds on [16]. All of the above work effectively uses linear regression or linear ridge regres-\nsion (although not always in an obvious way).\nOne common aspect of predictive state learning algorithms is that they exploit the covariance struc-\nture between future and past observation sequences to obtain an unbiased observable state represen-\ntation. Boots and Gordon [17] note the connection between this covariance and (linear) instrumental\nregression in the context of the HSE-HMM. We use this connection to build a general framework for\ndynamical system learning where the state space can be identi\ufb01ed using arbitrary (possibly nonlin-\near) supervised learning methods. This generalization lets us incorporate prior knowledge to learn\ncompact or regularized models; our experiments demonstrate that this \ufb02exibility lets us take better\nadvantage of limited data.\nReducing the problem of learning dynamical systems with latent state to supervised learning bears\nsimilarity to Langford et al.\u2019s suf\ufb01cient posterior representation (SPR) [18], which encodes the state\nby the suf\ufb01cient statistics of the conditional distribution of the next observation and represents sys-\ntem dynamics by three vector-valued functions that are estimated using supervised learning ap-\nproaches. While SPR allows all of these functions to be non-linear, it involves a rather complicated\ntraining procedure involving multiple iterations of model re\ufb01nement and model averaging, whereas\nour framework only requires solving three regression problems in sequence. In addition, the theo-\nretical analysis of [18] only establishes the consistency of SPR learning assuming that all regression\nsteps are solved perfectly. Our work, on the other hand, establishes convergence rates based on the\nperformance of S1 regression.\n\n4 Theoretical Analysis\n\nIn this section we present error bounds for two-stage instrumental regression. These bounds hold\nregardless of the particular S1 regression method used, assuming that the S1 predictions converge to\nthe true conditional expectations. The bounds imply that our overall method is consistent.\nLet (xt, yt, zt) \u2208 (X ,Y,Z) be i.i.d. triplets of input, output, and instrumental variables. (Lack of\nindependence will result in slower convergence in proportion to the mixing time of our process.) Let\n\u00afxt and \u00afyt denote E[xt | zt] and E[yt | zt]. And, let \u02c6xt and \u02c6yt denote \u02c6E[xt | zt] and \u02c6E[yt | zt] as\nestimated by the S1A and S1B regression steps. Here \u00afxt, \u02c6xt \u2208 X and \u00afyt, \u02c6yt \u2208 Y.\nWe want to analyze the convergence of the output of S2 regression\u2014that is, of the weights W given\nby ridge regression between S1A outputs and S1B outputs:\n\nHere \u2297 denotes tensor (outer) product, and \u03bb > 0 is a regularization parameter that ensures the\ninvertibility of the estimated covariance.\nBefore we state our main theorem we need to quantify the quality of S1 regression in a way that is\nindependent of the S1 functional form. To do so, we place a bound on the S1 error, and assume that\nthis bound converges to zero: given the de\ufb01nition below, for each \ufb01xed \u03b4, limN\u2192\u221e \u03b7\u03b4,N = 0.\nDe\ufb01nition 1 (S1 Regression Bound). For any \u03b4 > 0 and N \u2208 N+, the S1 regression bound \u03b7\u03b4,N > 0\nis a number such that, with probability at least (1 \u2212 \u03b4/2), for all 1 \u2264 t \u2264 N:\n\n(cid:107)\u02c6xt \u2212 \u00afxt(cid:107)X < \u03b7\u03b4,N\n(cid:107)\u02c6yt \u2212 \u00afyt(cid:107)Y < \u03b7\u03b4,N\n\nIn many applications, X , Y and Z will be \ufb01nite dimensional real vector spaces: Rdx, Rdy and\nRdz. However, for generality we state our results in terms of arbitrary reproducing kernel Hilbert\nspaces. In this case S2 uses kernel ridge regression, leading to methods such as HSE-PSRs. For\n\n5\n\n(cid:32) T(cid:88)\n\nt=1\n\n(cid:33)(cid:32) T(cid:88)\n\nt=1\n\n(cid:33)\u22121\n\n\u02c6W\u03bb =\n\n\u02c6yt \u2297 \u02c6xt\n\n\u02c6xt \u2297 \u02c6xt + \u03bbIX\n\n(3)\n\n\fthis purpose, let \u03a3\u00afx\u00afx and \u03a3\u00afy \u00afy denote the (uncentered) covariance operators of \u00afx and \u00afy respectively:\n\u03a3\u00afx\u00afx = E[\u00afx \u2297 \u00afx], \u03a3\u00afy \u00afy = E[\u00afy \u2297 \u00afy]. And, let R(\u03a3\u00afx\u00afx) denote the closure of the range of \u03a3\u00afx\u00afx.\nWith the above assumptions, Theorem 2 gives a generic error bound on S2 regression in terms of\nS1 regression. If X and Y are \ufb01nite dimensional and \u03a3\u00afx\u00afx has full rank, then using ordinary least\nsquares (i.e., setting \u03bb = 0) will give the same bound, but with \u03bb in the \ufb01rst two terms replaced by\nthe minimum eigenvalue of \u03a3\u00afx\u00afx, and the last term dropped.\nTheorem 2. Assume that (cid:107)\u00afx(cid:107)X ,(cid:107)\u00afx(cid:107)Y < c < \u221e almost surely. Assume W is a Hilbert-Schmidt\noperator, and let \u02c6W\u03bb be as de\ufb01ned in (3). Then, with probability at least 1 \u2212 \u03b4, for each xtest \u2208\nR(\u03a3\u00afx\u00afx) s.t. (cid:107)xtest(cid:107)X \u2264 1, the error (cid:107) \u02c6W\u03bbxtest \u2212 W xtest(cid:107)Y is bounded by\n\n\uf8eb\uf8ec\uf8ec\uf8ed\u03b7\u03b4,N\n\n\uf8eb\uf8ec\uf8ec\uf8ed 1\n\n\u03bb\n\nO\n\n(cid:124)\n\n(cid:114)\n(cid:123)(cid:122)\n\n1 +\n\n+\n\n(cid:113) log(1/\u03b4)\n\nN\n\n\u03bb 3\n\n2\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\uf8f6\uf8f7\uf8f7\uf8f8\n(cid:125)\n\nerror in S1 regression\n\n(cid:18) log(1/\u03b4)\u221a\n(cid:18) 1\n(cid:123)(cid:122)\n\nN\n\n\u03bb\n\n+ O\n\n(cid:124)\n\n(cid:19)(cid:19)\n(cid:125)\n\n+\n\n+\n\n1\n\u03bb 3\n\n2\n\n(cid:16)\u221a\n(cid:17)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nO\n\n\u03bb\n\nerror from \ufb01nite samples\n\nerror from regularization\n\nWe defer the proof to the supplementary material. The supplementary material also provides explicit\n\ufb01nite-sample bounds (including expressions for the constants hidden by O-notation), as well as\nconcrete examples of S1 regression bounds \u03b7\u03b4,N for practical regression models.\nTheorem 2 assumes that xtest is in R(\u03a3\u00afx\u00afx). For dynamical systems, all valid states satisfy this\nproperty. However, with \ufb01nite data, estimation errors may cause the estimated state \u02c6qt (i.e., xtest) to\nhave a non-zero component in R\u22a5(\u03a3\u00afx\u00afx). Lemma 3 bounds the effect of such errors: it states that, in\na stable system, this component gets smaller as S1 regression performs better. The main limitation\nof Lemma 3 is the assumption that f\ufb01lter is L-Lipchitz, which essentially means that the model\u2019s\nestimated probability for ot is bounded below. There is no way to guarantee this property in practice;\nso, Lemma 3 provides suggestive evidence rather than a guarantee that our learned dynamical system\nwill predict well.\nLemma 3. For observations o1:T , let \u02c6qt be the estimated state given o1:t\u22121. Let \u02dcqt be the projection\nof \u02c6qt onto R(\u03a3\u00afx\u00afx). Assume f\ufb01lter is L-Lipchitz on pt when evaluated at ot, and f\ufb01lter(pt, ot) \u2208\nR(\u03a3\u00afx\u00afx) for any pt \u2208 R(\u03a3\u00afy \u00afy). Given the assumptions of theorem 2 and assuming that (cid:107)\u02c6qt(cid:107)X \u2264 R\nfor all 1 \u2264 t \u2264 T , the following holds for all 1 \u2264 t \u2264 T with probability at least 1 \u2212 \u03b4/2.\n\n(cid:107)\u0001t(cid:107)X = (cid:107)\u02c6qt \u2212 \u02dcqt(cid:107)X = O\n\n(cid:19)\n\n(cid:18) \u03b7\u03b4,N\u221a\n\n\u03bb\n\nSince \u02c6W\u03bb is bounded, the prediction error due to \u0001t diminishes at the same rate as (cid:107)\u0001t(cid:107)X .\n\n5 Experiments and Results\n\nWe now demonstrate examples of tweaking the S1 regression to gain advantage. In the \ufb01rst experi-\nment we show that nonlinear regression can be used to reduce the number of parameters needed in\nS1, thereby improving statistical performance for learning an HMM. In the second experiment we\nshow that we can encode prior knowledge as regularization.\n\n5.1 Learning A Knowledge Tracing Model\n\nIn this experiment we attempt to model and predict the performance of students learning from an\ninteractive computer-based tutor. We use the Bayesian knowledge tracing (BKT) model [19], which\nis essentially a 2-state HMM: the state st represents whether a student has learned a knowledge\ncomponent (KC), and the observation ot represents the success/failure of solving the tth question in\na sequence of questions that cover this KC. Figure 3 summarizes the model. The events denoted by\nguessing, slipping, learning and forgetting typically have relatively low probabilities.\n\n5.1.1 Data Description\n\nWe evaluate the model using the \u201cGeometry Area (1996-97)\u201d data available from DataShop [20].\nThis data was generated by students learning introductory geometry, and contains attempts by 59\n\n6\n\n\fFigure 3: Transitions and observations in BKT. Each node represents a possible value of the state\nor observation. Solid arrows represent transitions while dashed arrows represent observations.\n\nstudents in 12 knowledge components. As is typical for BKT, we consider a student\u2019s attempt at a\nquestion to be correct iff the student entered the correct answer on the \ufb01rst try, without requesting\nany hints from the help system. Each training sequence consists of a sequence of \ufb01rst attempts for a\nstudent/KC pair. We discard sequences of length less than 5, resulting in a total of 325 sequences.\n\n5.1.2 Models and Evaluation\n\nUnder the (reasonable) assumption that the two states have distinct observation probabilities, this\nmodel is 1-observable. Hence we de\ufb01ne the predictive state to be the expected next observation,\nwhich results in the following statistics: \u03c8t = ot and \u03bet = ot \u2297k ot+1, where ot is represented by\na 2 dimensional indicator vector and \u2297k denotes the Kronecker product. Given these statistics, the\nextended state pt = E[\u03bet | o1:t\u22121] is a joint probability table of ot:t+1.\nWe compare three models that differ by history features and S1 regression method:\nSpec-HMM: This baseline uses ht = ot\u22121 and linear S1 regression, making it equivalent to the\nspectral HMM method of [3], as detailed in the supplementary material.\nFeat-HMM: This baseline represents ht by an indicator vector of the joint assignment of the previ-\nous b observations (we set b to 4) and uses linear S1 regression. This is essentially a feature-based\nspectral HMM [12]. It thus incorporates more history information compared to Spec-HMM at the\nexpense of increasing the number of S1 parameters by O(2b).\nLR-HMM: This model represents ht by a binary vector of length b encoding the previous b obser-\nvations and uses logistic regression as the S1 model. Thus, it uses the same history information as\nFeat-HMM but reduces the number of parameters to O(b) at the expense of inductive bias.\nWe evaluated the above models using 1000 random splits of the 325 sequences into 200 training\nand 125 testing. For each testing observation ot we compute the absolute error between actual and\nexpected value (i.e. |\u03b4ot=1 \u2212 \u02c6P (ot = 1 | o1:t\u22121)|). We report the mean absolute error for each split.\nThe results are displayed in Figure 4.6 We see that, while incorporating more history information\nincreases accuracy (Feat-HMM vs. Spec-HMM), being able to incorporate the same information\nusing a more compact model gives an additional gain in accuracy (LR-HMM vs. Feat-HMM). We\nalso compared the LR-HMM method to an HMM trained using expectation maximization (EM). We\nfound that the LR-HMM model is much faster to train than EM while being on par with it in terms\nof prediction error.7\n\n5.2 Modeling Independent Subsystems Using Lasso Regression\n\nSpectral algorithms for Kalman \ufb01lters typically use the left singular vectors of the covariance be-\ntween history and future features as a basis for the state space. However, this basis hides any sparsity\nthat might be present in our original basis. In this experiment, we show that we can instead use\nlasso (without dimensionality reduction) as our S1 regression algorithm to discover sparsity. This is\nuseful, for example, when the system consists of multiple independent subsystems, each of which\naffects a subset of the observation coordinates.\n\n6The differences have similar sign but smaller magnitude if we use RMSE instead of MAE.\n7We used MATLAB\u2019s built-in logistic regression and EM functions.\n\n7\n\nCorrect AnswerSkill KnownSkill KnownSkill UnknownSkill UnknownIncorrectAnswer\fModel\n\nTraining time (relative to Spec-HMM)\n\n1\n\nSpec-HMM Feat-HMM LR-HMM\n\n1.02\n\n2.219\n\nEM\n\n14.323\n\nFigure 4: Experimental results: each graph compares the performance of two models (measured\nby mean absolute error) on 1000 train/test splits. The black line is x = y. Points below this line\nindicate that model y is better than model x. The table shows training time.\n\nFigure 5: Left singular vectors of (left) true linear predictor from ot\u22121 to ot (i.e. OT O+), (middle)\ncovariance matrix between ot and ot\u22121 and (right) S1 Sparse regression weights. Each column\ncorresponds to a singular vector (only absolute values are depicted). Singular vectors are ordered by\ntheir mean coordinate, interpreting absolute values as a probability distribution over coordinates.\n\nTo test this idea we generate a sequence of 30-dimensional observations from a Kalman \ufb01lter. Obser-\nvation dimensions 1 through 10 and 11 through 20 are generated from two independent subsystems\nof state dimension 5. Dimensions 21-30 are generated from white noise. Each subsystem\u2019s transi-\ntion and observation matrices have random Gaussian coordinates, with the transition matrix scaled\nto have a maximum eigenvalue of 0.95. States and observations are perturbed by Gaussian noise\nwith covariance of 0.01I and 1.0I respectively.\nWe estimate the state space basis using 1000 examples (assuming 1-observability) and compare the\nsingular vectors of the past to future regression matrix to those obtained from the Lasso regression\nmatrix. The result is shown in \ufb01gure 5. Clearly, using Lasso as stage 1 regression results in a basis\nthat better matches the structure of the underlying system.\n\n6 Conclusion\n\nIn this work we developed a general framework for dynamical system learning using supervised\nlearning methods. The framework relies on two key principles: \ufb01rst, we extend the idea of predictive\nstate to include extended state as well, allowing us to represent all of inference in terms of predictions\nof observable features. Second, we use past features as instruments in an instrumental regression,\ndenoising state estimates that then serve as training examples to estimate system dynamics.\nWe have shown that this framework encompasses and provides a uni\ufb01ed view of some previous\nsuccessful dynamical system learning algorithms. We have also demostrated that it can be used\nto extend existing algorithms to incorporate nonlinearity and regularizers, resulting in better state\nestimates. As future work, we would like to apply this framework to leverage additional techniques\nsuch as manifold embedding and transfer learning in stage 1 regression. We would also like to\nextend the framework to controlled processes.\n\nReferences\n[1] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique\noccurring in the statistical analysis of probabilistic functions of markov chains. The Annals of\n\n8\n\nSpec-HMM0.260.30.34Feat-HMM0.260.280.30.320.34Spec-HMM0.260.30.34LR-HMM0.260.280.30.320.34Feat-HMM0.260.30.34LR-HMM0.260.280.30.320.34EM0.260.30.34LR-HMM0.260.280.30.320.34\fMathematical Statistics, 41(1):pp. 164\u2013171, 1970.\n\n[2] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice.\n\nChapman and Hall, London, 1996 (ISBN: 0-412-05551-1).\nThis book thoroughly summarizes the uses of MCMC in Bayesian analysis. It is a core book\nfor Bayesian studies.\n\n[3] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\n\nmarkov models. In COLT, 2009.\n\n[4] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New\n\nYork, NY, USA, 2000.\n\n[5] J.H. Stock and M.W. Watson. Introduction to Econometrics. Addison-Wesley series in eco-\n\nnomics. Addison-Wesley, 2011.\n\n[6] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Ten-\nsor decompositions for learning latent variable models. The Journal of Machine Learning\nResearch, 15(1):2773\u20132832, 2014.\n\n[7] Matthew Rosencrantz and Geoff Gordon. Learning low dimensional predictive representa-\ntions. In ICML \u201904: Twenty-\ufb01rst international conference on Machine learning, pages 695\u2013\n702, 2004.\n\n[8] P. van Overschee and L.R. de Moor. Subspace identi\ufb01cation for linear systems: theory, imple-\n\nmentation, applications. Kluwer Academic Publishers, 1996.\n\n[9] Byron Boots, Arthur Gretton, and Geoffrey J. Gordon. Hilbert Space Embeddings of Predictive\nState Representations. In Proc. 29th Intl. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI),\n2013.\n\n[10] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel bayes\u2019 rule: Bayesian inference with\n\npositive de\ufb01nite kernels. Journal of Machine Learning Research, 14(1):3753\u20133783, 2013.\n\n[11] Byron Boots. Spectral Approaches to Learning Predictive Representations. PhD thesis,\n\nCarnegie Mellon University, December 2012.\n\n[12] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In\nProceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS-2010), 2010.\n\n[13] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. Closing the learning planning loop with\n\npredictive state representations. In I. J. Robotic Research, volume 30, pages 954\u2013956, 2011.\n\n[14] Byron Boots and Geoffrey Gordon. An online spectral learning algorithm for partially ob-\nIn Proceedings of the 25th National Conference on\n\nservable nonlinear dynamical systems.\nArti\ufb01cial Intelligence (AAAI-2011), 2011.\n\n[15] Methods of moments for learning stochastic languages: Uni\ufb01ed presentation and empirical\ncomparison. In Proceedings of the 31st International Conference on Machine Learning (ICML-\n14), pages 1386\u20131394, 2014.\n\n[16] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert space embeddings of\n\nhidden Markov models. In Proc. 27th Intl. Conf. on Machine Learning (ICML), 2010.\n\n[17] Byron Boots and Geoffrey Gordon. Two-manifold problems with applications to nonlinear\n\nsystem identi\ufb01cation. In Proc. 29th Intl. Conf. on Machine Learning (ICML), 2012.\n\n[18] John Langford, Ruslan Salakhutdinov, and Tong Zhang. Learning nonlinear dynamic mod-\nels. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML\n2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 593\u2013600, 2009.\n\n[19] Albert T. Corbett and John R. Anderson. Knowledge tracing: Modelling the acquisition of\n\nprocedural knowledge. User Model. User-Adapt. Interact., 4(4):253\u2013278, 1995.\n\n[20] Kenneth R. Koedinger, R. S. J. Baker, K. Cunningham, A. Skogsholm, B. Leber, and John\nStamper. A data repository for the EDM community: The PSLC DataShop. Handbook of\nEducational Data Mining, pages 43\u201355, 2010.\n\n[21] Le Song, Jonathan Huang, Alexander J. Smola, and Kenji Fukumizu. Hilbert space embed-\ndings of conditional distributions with applications to dynamical systems. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Que-\nbec, Canada, June 14-18, 2009, pages 961\u2013968, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1193, "authors": [{"given_name": "Ahmed", "family_name": "Hefny", "institution": "Carnegie Mellon University"}, {"given_name": "Carlton", "family_name": "Downey", "institution": "Carnegie Mellon UNiversity"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "CMU"}]}