{"title": "Spectral Filtering for General Linear Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 4634, "page_last": 4643, "abstract": "We give a polynomial-time algorithm for learning latent-state linear dynamical systems without system identification, and without assumptions on the spectral radius of the system's transition matrix. The algorithm extends the recently introduced technique of spectral filtering, previously applied only to systems with a symmetric transition matrix, using a novel convex relaxation to allow for the efficient identification of phases.", "full_text": "Spectral Filtering for\n\nGeneral Linear Dynamical Systems\n\nElad Hazan\n\nPrinceton University & Google AI Princeton\n\nehazan@cs.princeton.edu\n\nHolden Lee\n\nPrinceton University\n\nholdenl@princeton.edu\n\nKaran Singh\n\nPrinceton University & Google AI Princeton\n\nkarans@cs.princeton.edu\n\nCyril Zhang\n\nPrinceton University & Google AI Princeton\n\ncyril.zhang@cs.princeton.edu\n\nYi Zhang\n\nPrinceton University & Google AI Princeton\n\ny.zhang@cs.princeton.edu\n\nAbstract\n\nWe give a polynomial-time algorithm for learning latent-state linear dynamical\nsystems without system identi\ufb01cation, and without assumptions on the spectral\nradius of the system\u2019s transition matrix. The algorithm extends the recently in-\ntroduced technique of spectral \ufb01ltering, previously applied only to systems with\na symmetric transition matrix, using a novel convex relaxation to allow for the\nef\ufb01cient identi\ufb01cation of phases.\n\n1\n\nIntroduction\n\nLinear dynamical systems (LDSs) are a cornerstone of signal processing and time series analysis.\nThe problem of predicting the response signal arising from a LDS is a fundamental problem in\nmachine learning, with a history of more than half a century.\nAn LDS is given by matrices (\ud835\udc34, \ud835\udc35, \ud835\udc36, \ud835\udc37). Given a sequence of inputs {\ud835\udc65\ud835\udc61}, the output {\ud835\udc66\ud835\udc61} of the\nsystem is governed by the linear equations\n\n\u210e\ud835\udc61 = \ud835\udc34\u210e\ud835\udc61\u22121 + \ud835\udc35\ud835\udc65\ud835\udc61 + \ud835\udf02\ud835\udc61\n\ud835\udc66\ud835\udc61 = \ud835\udc36\u210e\ud835\udc61 + \ud835\udc37\ud835\udc65\ud835\udc61 + \ud835\udf09\ud835\udc61,\n\n(1)\n\nwhere \ud835\udf02\ud835\udc61, \ud835\udf09\ud835\udc61 are noise vectors, and \u210e\ud835\udc61 is a hidden (latent) state.\nRoweis and Ghahramani [RG99] show that special cases of this formulation capture a host of ma-\nchine learning models, including hidden Markov models, Gaussian mixture models, principal com-\nponent analysis, and linear Gaussian models. It has been observed numerous times in the literature\nthat if there is no hidden state, or if the transition matrices are known, then the formulation is essen-\ntially convex and amenable to ef\ufb01cient optimization.\nIn this paper we are concerned with the general and more challenging case, arguably the one which\nis more applicable as well, in which the hidden state is not observed, and the system dynamics\nare unknown to the learner. In this setting, despite the vast literature on the subject from various\ncommunities, there is a lack of provably ef\ufb01cient methods for learning the LDS without strong\ngenerative or other assumptions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBuilding on recent advances in spectral \ufb01ltering, we develop a novel convex relaxation for LDSs,\nresulting in an ef\ufb01cient algorithm for the LDS prediction problem in the general setting. Our al-\ngorithm makes online predictions which are close (in terms of mean squared error) to those of the\noptimal LDS in hindsight.\n\n1.1 Problem statement and our results\n\nAn LDS prediction problem is de\ufb01ned as follows. Iteratively for \ud835\udc61 = 1, 2, ..., \ud835\udc47 , the learner observes\nthe input to the system \ud835\udc65\ud835\udc61 \u2208 R\ud835\udc5b. The learner then makes a prediction \u02c6\ud835\udc66\ud835\udc61 \u2208 R\ud835\udc5a, observes true\noutcome \ud835\udc66\ud835\udc61 \u2208 R\ud835\udc5a, and suffers a loss \u2113(\u02c6\ud835\udc66\ud835\udc61, \ud835\udc66\ud835\udc61). For simplicity we consider the mean square error\n\u2113(\u02c6\ud835\udc66\ud835\udc61, \ud835\udc66\ud835\udc61) = \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162, even though our techniques can handle any Lipschitz convex loss.\nThe goal of the online learner is to minimize its regret, or difference in loss between its prediction,\nand the prediction of the best LDS in hindsight that predicts with \ud835\udc66*\n\n1, . . . , \ud835\udc66*\n\ud835\udc47 :\n\n\ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162 \u2212\n\n\ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\ud835\udc66\n\n*\n\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162.\n\nRegret(\ud835\udc47 ) :=\n\n\ud835\udc61 are not \ufb01xed vectors, but rather evolve according to a hidden state and equation\n\nWe emphasize that \ud835\udc66*\n(1) according to the best possible transition matrices, in terms of mean square error \ufb01t to the data.\nOur main result is a polynomial-time algorithm that predicts \u02c6\ud835\udc66\ud835\udc61 given all previous input and feedback\n(\ud835\udc651:\ud835\udc61, \ud835\udc661:\ud835\udc61\u22121), and attains a near-optimal regret bound of\n\nRegret(\ud835\udc47 ) \u2264 \u02dc\ud835\udc42(\u221a\ud835\udc47 ) + \ud835\udc3e \u00b7 \ud835\udc3f.\n\nHere, \ud835\udc3f denotes the inevitable loss incurred by perturbations to the system which cannot be antici-\npated by the learner, which are allowed to be adversarial. This \ud835\udc3f can grow with time, and is usually\nassumed to be proportional to a small constant, say \ud835\udf00\ud835\udc47 .\nThe constant in the \u02dc\ud835\udc42(\u00b7), as well as \ud835\udc3e, depend polynomially on the dimensionality of the system,\nthe norms of the inputs and outputs, and certain natural quantities related to the transition matrix\n\ud835\udc34. Additionally, the running time of our algorithm is polynomial in all natural parameters of the\nproblem.\nIn comparison to previous approaches, we note:\n\n\u2219 Our algorithm is the \ufb01rst sample-ef\ufb01cient and polynomial-time algorithm with this guaran-\ntee. In the next section, we survey local search algorithms that either only converge to local\noptima or require an exponential number of iterations in the worst case.\n\n\u2219 The main feature is that the regret does not depend on the spectral radius \ud835\udf0c(\ud835\udc34) of the sys-\ntem\u2019s hidden-state transition matrix. If one allows a dependence on the condition number,\nthen simple linear regression-based algorithms are known to obtain the same result, with\ntime and sample complexity polynomial in\n\n1\u2212\ud835\udf0c(\ud835\udc34). (See Section 6 of [HMR16].)\n\n1\n\n1.2 Related work\n\nThe prediction problems of time series for linear dynamical systems was de\ufb01ned in the seminal work\nof Kalman [Kal60], who introduced the Kalman \ufb01lter as a recursive least-squares solution for max-\nimum likelihood estimation (MLE) of Gaussian perturbations to the system. For more background\nsee the classic survey [Lju98], and the extensive overview of recent literature in [HMR16].\nFor a linear dynamical system with no hidden state, the system is identi\ufb01able by a convex program\nand thus well understood (see [DMM+17, AYS11], who address sample complexity issues and\nregret for system identi\ufb01cation and linear-quadratic control in this setting).\nVarious exponential-time approaches have been proposed to learn the system in the case that the sys-\ntem is unknown. Regret bounds similar to ours are obtainable using the continuous multiplicative-\nweights algorithm (see [CBL06], as well as the EWOO algorithm in [HAK07]). These methods,\nmentioned brie\ufb02y in [HSZ17], basically amount to discretizing the entire parameter space of LDSs,\nand take time exponential in the system dimensions. Stronger guarantees are obtained in [KM17],\nthough still in exponential time.\n\n2\n\n\fGhahramani and Roweis [RG99] suggest using the EM algorithm to learn the parameters of an\nLDS. This approach remains widely used, but is inherently non-convex and can get stuck in local\nminima. Recently [HMR16] show that for a restricted class of systems, gradient descent (also widely\nused in practice, perhaps better known in this setting as backpropagation) guarantees polynomial\nconvergence rates and sample complexity in the batch setting. Their result applies essentially only\nto the SISO case, depends polynomially on the spectral gap, and requires the signal to be generated\nby an LDS.\nIn recent work, [HSZ17] show how to ef\ufb01ciently learn an LDS in the online prediction setting,\nwithout any generative assumptions, and without dependence on the condition number. Their\nnew methodology, however, was restricted to LDSs with symmetric transition matrices. For\nthe structural result, we use the same results from the spectral theory of Hankel matrices; see\n[BT17, Hil94, Cho83]. Obtaining provably ef\ufb01cient algorithms for the general case is signi\ufb01cantly\nmore challenging.\nWe make use of linear \ufb01ltering, or linear regression on the past observations as well as inputs, as a\nsubroutine for future prediction. This technique is well-established in the context of autoregressive\nmodels for time-series prediction that have been extensively studied in the learning and signal-\nprocessing literature, see e.g. [Ham94, BJR94, BD09, KM16, AHMS13, MW07].\nThe recent success of recurrent neural networks (RNNs) for tasks such as speech and language\nmodeling has inspired a resurgence of interest in linear dynamical systems [HMR16, BK15].\n\n2 Preliminaries\n\n2.1 Setting\nA linear dynamical system \u0398 = (\ud835\udc34, \ud835\udc35, \ud835\udc36, \ud835\udc37), with initial hidden state \u210e0 \u2208 R\ud835\udc51, speci\ufb01es a map\nfrom inputs \ud835\udc651, . . . , \ud835\udc65\ud835\udc47 \u2208 R\ud835\udc5b to outputs (responses) \ud835\udc661, . . . , \ud835\udc66\ud835\udc47 \u2208 R\ud835\udc5a, given by the recursive\nequations\n(2)\n(3)\n\n\u210e\ud835\udc61 = \ud835\udc34\u210e\ud835\udc61\u22121 + \ud835\udc35\ud835\udc65\ud835\udc61 + \ud835\udf02\ud835\udc61\n\ud835\udc66\ud835\udc61 = \ud835\udc36\u210e\ud835\udc61 + \ud835\udc37\ud835\udc65\ud835\udc61 + \ud835\udf09\ud835\udc61,\n\nwhere \ud835\udc34, \ud835\udc35, \ud835\udc36, \ud835\udc37 are matrices of appropriate dimension, and \ud835\udf02\ud835\udc61, \ud835\udf09\ud835\udc61 are noise vectors.\nWe make the following assumptions to characterize the \u201csize\u201d of an LDS we are competing against:\n\n1. Inputs and outputs and bounded: \u2016\ud835\udc65\ud835\udc61\u20162 \u2264 \ud835\udc45\ud835\udc65,\u2016\ud835\udc66\ud835\udc61\u20162 \u2264 \ud835\udc45\ud835\udc66.1\n2. The system is Lyapunov stable, i.e., the largest singular value of \ud835\udc34 is at most 1: \ud835\udf0c(\ud835\udc34) \u2264 1.\n3. \ud835\udc34 is diagonalizable by a matrix with small entries: \ud835\udc34 = \u03a8\u039b\u03a8\u22121, with \u2016\u03a8\u2016\ud835\udc39\n\nNote that we do not need this parameter to be bounded away from 1.\n\n\ud835\udc45\u03a8. Intuitively, this holds if the eigenvectors corresponding to larger eigenvalues aren\u2019t\nclose to linearly dependent.\n\n\u20e6\u20e6\u03a8\u22121\u20e6\u20e6\ud835\udc39 \u2264\n\n4. \ud835\udc35, \ud835\udc36, \ud835\udc37 have bounded spectral norms: \u2016\ud835\udc35\u20162 ,\u2016\ud835\udc36\u20162 ,\u2016\ud835\udc37\u20162 \u2264 \ud835\udc45\u0398.\n5. Let \ud835\udc46 =\n\n{\ufe01 \ud835\udefc|\ud835\udefc| : \ud835\udefc is an eigenvalue of \ud835\udc34\n\nbe the set of phases of all eigenvalues of \ud835\udc34. There\nexists a monic polynomial \ud835\udc5d(\ud835\udc65) of degree \ud835\udf0f such that \ud835\udc5d(\ud835\udf14) = 0 for all \ud835\udf14 \u2208 \ud835\udc46, the \ud835\udc3f1 norm\nof its coef\ufb01cients is at most \ud835\udc451, and the \ud835\udc3f\u221e norm is at most \ud835\udc45\u221e. We will explain this\ncondition in Section 4.1.\n\n}\ufe01\n\nIn our regret model, the adversary chooses an LDS (\ud835\udc34, \ud835\udc35, \ud835\udc36, \ud835\udc37), and has a budget \ud835\udc3f. The dynamical\nsystem produces outputs given by the above equations, where the noise vectors \ud835\udf02\ud835\udc61, \ud835\udf09\ud835\udc61 are chosen\n\nadversarially, subject to a budget constraint:\u2211\ufe00\ud835\udc47\n\n\ud835\udc61=1 \u2016\ud835\udf02\ud835\udc61\u20162 + \u2016\ud835\udf09\ud835\udc61\u20162 \u2264 \ud835\udc3f.\n\nThen, the online prediction setting is identical to that proposed in [HSZ17]. For each iteration\n\ud835\udc61 = 1, . . . , \ud835\udc47 , the input \ud835\udc65\ud835\udc61 is revealed, and the learner must predict a response \u02c6\ud835\udc66\ud835\udc61. Then, the true \ud835\udc66\ud835\udc61\nis revealed, and the learner suffers a least-squares loss of \u2016\ud835\udc66\ud835\udc61 \u2212 \u02c6\ud835\udc66\ud835\udc61\u20162. Of course, if \ud835\udc3f scales with\n1Note that no bound on \u2016\ud835\udc66\ud835\udc61\u2016 is required for the approximation theorem; \ud835\udc45\ud835\udc66 only appears in the regret\n\nbound.\n\n3\n\n\f\u2016\ud835\udc40\u20162,\ud835\udc5e :=\n\nand the limiting case\n\n\u239e\u23a0\ud835\udc5e/2\u23a4\u23a5\u23a61/\ud835\udc5e\n\n,\n\n\ud835\udc40 (\ud835\udc5d, \u210e, \ud835\udc57, \ud835\udc56)2\n\n\u23a1\u23a2\u23a3\u2211\ufe01\n\n\ud835\udc5d\n\n\u239b\u239d\u2211\ufe01\n\u221a\ufe03\u2211\ufe01\n\n\u210e,\ud835\udc56,\ud835\udc57\n\n\u210e,\ud835\udc56,\ud835\udc57\n\nthe time horizon \ud835\udc47 , it is information-theoretically impossible for an online algorithm to incur a loss\nsublinear in \ud835\udc47 , even under non-adversarial (e.g. Gaussian) perturbations. Thus, our end-to-end goal\nis to track the LDS with loss that scales with the total magnitude of the perturbations, independently\nof \ud835\udc47 .\nThis formulation is fundamentally a min-max problem: given a limited budget of perturbations, an\nadversary tries to maximize the error of the algorithm\u2019s predictions, while the algorithm seeks to be\nrobust against any such adversary. This corresponds to the \ud835\udc3b\u221e notion of robustness in the control\ntheory literature; see Section 15.5 of [ZDG+96].\n\n2.2 Spectral \ufb01ltering for time series\n\nThe spectral \ufb01ltering technique is introduced in [HSZ17], which considers a spectral decomposition\nof the derivative of the impulse response function of an LDS with a symmetric transition matrix. A\ncrucial object of consideration in spectral \ufb01ltering is the set of wave-\ufb01lters \ud835\udf111, . . . , \ud835\udf11\ud835\udc58, which are\nthe top \ud835\udc58 eigenvectors of the deterministic Hankel matrix \ud835\udc4d\ud835\udc47 \u2208 R\ud835\udc47\u00d7\ud835\udc47 , whose entries are given by\n(\ud835\udc56+\ud835\udc57)3\u2212(\ud835\udc56+\ud835\udc57). Bounds on the \ud835\udf00-rank of positive semide\ufb01nite Hankel matrices can be found\n\ud835\udc4d(\ud835\udc56, \ud835\udc57) =\nin [BT17]. Our algorithm will \u201ccompress\u201d the input time series using a time-domain convolution of\nthe input time series with \ufb01lters derived from these eigenvectors.\n\n2\n\n2.3 Notation for matrix norms\n\nWe will consider a few \u201cmixed\u201d \u2113\ud835\udc5d matrix norms of a 4-tensor \ud835\udc40, whose elements are indexed\nby \ud835\udc40 (\ud835\udc5d, \u210e, \ud835\udc57, \ud835\udc56) (the roles and bounds of these indices will be introduced later). For conciseness,\nwhenever the norm of such a 4-tensor is taken, we establish the notation for the mixed matrix norm\n\n\u2016\ud835\udc40\u20162,\u221e := max\n\n\ud835\udc5d\n\n\ud835\udc40 (\ud835\udc5d, \u210e, \ud835\udc57, \ud835\udc56)2.\n\nThese are the straightforward analogues of the matrix norms de\ufb01ned in [KSST12], and appear in the\nregularization of the online prediction algorithm.\n\n3 Algorithm and main theorem\n\nWe begin by describing the algorithm in terms of a linear model \ud835\udc66( \u02c6\u0398\ud835\udc61; \ud835\udc651:\ud835\udc61; \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udf0f ), the details of\nwhich occur in De\ufb01nition 2.\n\n\ud835\udc57=1, the top \ud835\udc58 eigenpairs of \ud835\udc4d\ud835\udc47 .\n\nAlgorithm 1 Phased wave-\ufb01ltered regression\n1: Input: time horizon \ud835\udc47 , parameters \ud835\udc58, \ud835\udc4a, \ud835\udf0f, \ud835\udc45 ^\u0398, regularization weight \ud835\udf02.\n2: Compute {(\ud835\udf0e\ud835\udc57, \ud835\udf11\ud835\udc57)}\ud835\udc58\n3: Initialize \u02c6\u03981 \u2208 \ud835\udca6 arbitrarily.\n4: for \ud835\udc61 = 1, . . . , \ud835\udc47 do\n5:\n6:\n7:\n\nPredict \u02c6\ud835\udc66\ud835\udc61 := \ud835\udc66( \u02c6\u0398\ud835\udc61; \ud835\udc651:\ud835\udc61; \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udf0f ).\nObserve \ud835\udc66\ud835\udc61. Suffer loss \u2016\ud835\udc66\ud835\udc61 \u2212 \u02c6\ud835\udc66\ud835\udc61\u20162.\n\ud835\udc61\u22121\u2211\ufe01\nSolve FTRL convex program:\n\ud835\udc62=0\u2016\ud835\udc66( \u02c6\u0398; \ud835\udc651:\ud835\udc62, \ud835\udc66\ud835\udc62\u22121:\ud835\udc62\u2212\ud835\udf0f ) \u2212 \ud835\udc66\ud835\udc62\u20162 +\n\n\u02c6\u0398\ud835\udc61+1 \u2190 arg min\n^\u0398\u2208\ud835\udca6\n\n\ud835\udc45( \u02c6\u0398).\n\n1\n\ud835\udf02\n\n8: end for\n\nThe central result in the paper is stated below.\n\n4\n\n\fTheorem 1 (Main; informal). Consider a LDS with noise (given by (2) and (3)) satisfying the\nassumptions in Section 2.1, where total noise is bounded by \ud835\udc3f. Then there is a choice of parameters\nsuch that Algorithm 1 learns a linear model \u02c6\u0398 whose predictions \u02c6\ud835\udc66\ud835\udc61 satisfy\n\npoly(\ud835\udc45, \ud835\udc51\n\n\u2032\n\n)\u221a\ud835\udc47 + \ud835\udc452\u221e\ud835\udf0f 3\ud835\udc452\n\n\u0398\ud835\udc452\n\n\u03a8\ud835\udc3f\n\n(4)\n\n)\ufe01\n\n\ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162 \u2264 \u02dc\ud835\udc42\n\n(\ufe01\n\nwhere \ud835\udc451, \ud835\udc45\ud835\udc65, \ud835\udc45\ud835\udc66, \ud835\udc45\u0398, \ud835\udc45\u03a8 \u2264 \ud835\udc45, \ud835\udc5a, \ud835\udc5b, \ud835\udc51 \u2264 \ud835\udc51\u2032.\nTo de\ufb01ne the algorithm, we specify a reparameterization of linear dynamical systems. To this end,\nwe de\ufb01ne a pseudo-LDS, which pairs a subspace-restricted linear model of the impulse response\nwith an autoregressive model:\nDe\ufb01nition 2. A pseudo-LDS \u02c6\u0398 = (\ud835\udc40, \ud835\udc41, \ud835\udefd, \ud835\udc43 ) is given by two 4-tensors \ud835\udc40, \ud835\udc41 \u2208 R\ud835\udc4a\u00d7\ud835\udc58\u00d7\ud835\udc5b\u00d7\ud835\udc5a a\nvector \ud835\udefd \u2208 R\ud835\udf0f , and matrices \ud835\udc430, . . . , \ud835\udc43\ud835\udf0f\u22121 \u2208 R\ud835\udc5a\u00d7\ud835\udc5b. Let the prediction made by \u02c6\u0398, which depends\non the entire history of inputs \ud835\udc651:\ud835\udc61 and \ud835\udf0f past outputs \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udf0f be given by\n\n\ud835\udf0f\u2211\ufe01\n\n\ud835\udc62=1\n\n[\ufe03(\ufe03\n\n\ud835\udf0f\u22121\u2211\ufe01\n(\ufe02 2\ud835\udf0b\ud835\udc62\ud835\udc5d\n\n\ud835\udc57=0\n\n\ud835\udc4a\n\n\ud835\udc66( \u02c6\u0398; \ud835\udc651:\ud835\udc61, \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udf0f )(:) :=\n\n\ud835\udefd\ud835\udc62\ud835\udc66\ud835\udc61\u2212\ud835\udc62 +\n\n\ud835\udc43\ud835\udc57\ud835\udc65\ud835\udc61\u2212\ud835\udc57\n\n\ud835\udc4a\u22121\u2211\ufe01\n\n\ud835\udc5b\u2211\ufe01\n\n\ud835\udc58\u2211\ufe01\n\n\ud835\udc61\u2211\ufe01\n\n\ud835\udc5d=0\n\n\ud835\udc56=1\n\n\u210e=1\n\n\ud835\udc62=\ud835\udf0f\n\n+\n\n\ud835\udc40 (\ud835\udc5d, \u210e, \ud835\udc56, :) cos\n\n)\ufe02\n\n(\ufe02 2\ud835\udf0b\ud835\udc62\ud835\udc5d\n\n)\ufe02)\ufe03\n\n\ud835\udc4a\n\n+ \ud835\udc41 (\ud835\udc5d, \u210e, \ud835\udc56, :) sin\n\n]\ufe03\n\n1\n4\n\n\u210e \ud835\udf11\u210e(\ud835\udc62)\ud835\udc65\ud835\udc61\u2212\ud835\udc62(\ud835\udc56)\n\ud835\udf0e\n\nHere, \ud835\udf111, . . . , \ud835\udf11\ud835\udc58 \u2208 R\ud835\udc47 are the top \ud835\udc58 eigenvectors, with eigenvalues \ud835\udf0e1, . . . , \ud835\udf0e\ud835\udc58, of \ud835\udc4d\ud835\udc47 . These\ncan be computed using specialized methods [BLV98]. Some of the dimensions of these tensors are\nparameters to the algorithm, which we list here:\n\n\u2219 Number of \ufb01lters \ud835\udc58.\n\u2219 Phase discretization parameter \ud835\udc4a .\n\u2219 Autoregressive parameter \ud835\udf0f.\nAdditionally, we de\ufb01ne the following:\n\n\ud835\udc5e\u2032 +\u2211\ufe00\ud835\udf0f\n\n\ud835\udc39 , where \ud835\udc5e =\n\n\ud835\udc57=1 \u2016\ud835\udc43\ud835\udc57\u20162\n\n\u221a\ufe01\u2211\ufe00\ud835\udf0f\n\n\u2219 Regularizer \ud835\udc45(\ud835\udc40, \ud835\udc41, \ud835\udefd, \ud835\udc43 ) := \u2016\ud835\udc40\u20162\n\nln(\ud835\udc4a )\u22121, and \ud835\udc5e\u2032 = ln(\ud835\udf0f )\nln(\ud835\udf0f )\u22121.\n\nln(\ud835\udc4a )\n\n2,\ud835\udc5e + \u2016\ud835\udc41\u20162\n\n2,\ud835\udc5e + \u2016\ud835\udefd\u20162\n\n\u2219 Composite norm \u2016(\ud835\udc40, \ud835\udc41, \ud835\udefd, \ud835\udc43 )\u2016 := \u2016\ud835\udc40\u20162,1 + \u2016\ud835\udc41\u20162,1 + \u2016\ud835\udefd\u20161 +\n\u2219 Composite norm constraint \ud835\udc45 ^\u0398, and the corresponding set of pseudo-LDSs \ud835\udca6 = { \u02c6\u0398 :\n\n\ud835\udc39 .\n\ud835\udc57=1 \u2016\ud835\udc43\ud835\udc57\u20162\n\n\u2016 \u02c6\u0398\u2016 \u2264 \ud835\udc45 ^\u0398}.\n\nCrucially, \ud835\udc66( \u02c6\u0398; \ud835\udc651:\ud835\udc61, \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udc51) is linear in each of \ud835\udc40, \ud835\udc41, \ud835\udc43, \ud835\udefd; consequently, the least-squares loss\n\u2016\ud835\udc66( \u02c6\u0398; \ud835\udc651:\ud835\udc61) \u2212 \ud835\udc66\u20162 is convex, and can be minimized in polynomial time. To this end, our online pre-\ndiction algorithm is follow-the-regularized-leader (FTRL), which requires the solution of a convex\nprogram at each iteration. We choose this regularization to obtain the strongest theoretical guarantee,\nand provide a brief note in Section 5 on alternatives to address performance issues.\nAt a high level, our algorithm works by \ufb01rst approximating the response of an LDS by an au-\ntoregressive model of order (\ud835\udf0f, \ud835\udf0f ), then re\ufb01ning the approximation using wave-\ufb01lters with a phase\ncomponent. Speci\ufb01cally, the blocks of \ud835\udc40 and \ud835\udc41 corresponding to \ufb01lter index \u210e and phase index \ud835\udc5d\nspecify the linear dependence of \ud835\udc66\ud835\udc61 on a certain convolution of the input time series, whose kernel\nis the pointwise product of \ud835\udf11\u210e and a sinusoid with period \ud835\udc4a/\ud835\udc5d. The structural result which drives\nthe theorem is that the dynamics of any true LDS are approximated by such a pseudo-LDS, with\nreasonably small parameters and coef\ufb01cients.\nNote that the autoregressive component in our de\ufb01nition of a pseudo-LDS is slightly more restricted\nthan multivariate autoregressive models: the coef\ufb01cients \ud835\udefd\ud835\udc57 are scalar, rather than allowed to be\n\n5\n\n\farbitrary matrices. These options are interchangeable for our purposes, without affecting the asymp-\ntotic regret; we choose to use scalar coef\ufb01cients for a more streamlined analysis.\nThe online prediction algorithm is fully speci\ufb01ed in Algorithm 1; the parameter choices that give the\nbest asymptotic theoretical guarantees are speci\ufb01ed in the appendix, while typical realistic settings\nare outlined in Section 5.\n\n4 Analysis\n\nThere are three parts to the analysis, which we outline in the following subsections: proving the ap-\nproximability of an LDS by a pseudo-LDS, bounding the regret incurred by the algorithm against the\nbest pseudo-LDS, and \ufb01nally analyzing the effect of noise \ud835\udc3f. The full proofs are in Appendices A, B,\nand C, respectively.\n\n4.1 Approximation theorem for general LDSs\n\nWe develop a more general analogue of the structural result from [HSZ17], which holds for systems\nwith asymmetric transition matrix \ud835\udc34.\nTheorem 3 (Approximation theorem; informal). Consider an noiseless LDS (given by (2) and (3)\nwith \ud835\udf02\ud835\udc61, \ud835\udf09\ud835\udc61 = 0) satisfying the assumptions in Section 2.1.\n\n)\ufe00)\ufe00, \ud835\udc4a = \ud835\udc42 (poly(\ud835\udf0f, \ud835\udc45\u0398, \ud835\udc45\u03a8, \ud835\udc451, \ud835\udc45\ud835\udc65, \ud835\udc47 )) and a\n\nThere is \ud835\udc58 = \ud835\udc42(\ufe00poly log(\ufe00\ud835\udc47, \ud835\udc45\u0398, \ud835\udc45\u03a8, \ud835\udc451, \ud835\udc45\ud835\udc65, 1\n\npseudo-LDS \u02c6\u0398 of norm \ud835\udc42(poly(\ud835\udc45\u0398, \ud835\udc45\u03a8, \ud835\udc451, \ud835\udf0f, \ud835\udc58)) such that \u02c6\u0398 approximates \ud835\udc66\ud835\udc61 to within \ud835\udf00 for\n1 \u2264 \ud835\udc61 \u2264 \ud835\udc47 :\n\n\ud835\udf00\n\n\u20e6\u20e6\u20e6\ud835\udc66( \u02c6\u0398; \ud835\udc651:\ud835\udc61, \ud835\udc66\ud835\udc61\u22121:\ud835\udc61\u2212\ud835\udf0f ) \u2212 \ud835\udc66\ud835\udc61\n\n\u20e6\u20e6\u20e6 \u2264 \ud835\udf00.\n\n(5)\n\n\ud835\udc51\n\n\ud835\udc51\n\nFor the formal statement (with precise bounds) and proof, see Appendix A.2. In this section we give\nsome intuition for the conditions and an outline of the proof.\nFirst, we explain the condition on the polynomial \ud835\udc5d. As we show in Appendix A.1 we can predict\nusing a pure autoregressive model, without wave\ufb01lters, if we require \ud835\udc5d to have all eigenvalues of\n\ud835\udc34 as roots (i.e., it is divisible by the minimal polynomial of \ud835\udc34). However, the coef\ufb01cients of this\npolynomial could be very large. The size of these coef\ufb01cients will appear in the bound for the main\ntheorem, as using large coef\ufb01cients in the predictor will make it sensitive to noise.\nRequiring \ud835\udc5d only to have the phases of eigenvalues of \ud835\udc34 as roots can decrease the coef\ufb01cients\nsigni\ufb01cantly. As an example, consider if \ud835\udc34 has many \ud835\udc51/3 distinct eigenvalues with phase 1, and\nsimilarly for \ud835\udf14, and \ud835\udf14, and suppose their absolute values are close to 1. Then the minimal polynomial\nis approximately (\ud835\udc65\u2212 1)\n3 which can have coef\ufb01cients as large as exp(\u2126(\ud835\udc51)). On\nthe other hand, for the theorem we can take \ud835\udc5d(\ud835\udc65) = (\ud835\udc65 \u2212 1)(\ud835\udc65 \u2212 \ud835\udf14)(\ud835\udc65 \u2212 \ud835\udf14) which has degree 3 and\ncoef\ufb01cients bounded by a constant. Intuitively, the wave\ufb01lters help if there are few distinct phases,\nor they are well-separated (consider that if the phases were exactly the \ud835\udc51th roots of unity, that \ud835\udc5d can\nbe taken to be \ud835\udc65\ud835\udc51 \u2212 1). Note that when the roots are real, we can take \ud835\udc5d = \ud835\udc65 \u2212 1 and the analysis\nreduces to that of [HSZ17].\nWe now sketch a proof of Theorem 3. Motivation is given by the Cayley-Hamilton Theorem, which\nsays that if \ud835\udc5d is the characteristic polynomial of \ud835\udc34, then \ud835\udc5d(\ud835\udc34) = \ud835\udc42. This fact tells us that the\n\ud835\udc61=1 \ud835\udefd\ud835\udc57\ud835\udc65\ud835\udf0f\u2212\ud835\udc57, then\n\n\u210e\ud835\udc61 = \ud835\udc34\ud835\udc61\u210e0 satis\ufb01es a linear recurrence of order \ud835\udf0f = deg \ud835\udc5d: if \ud835\udc5d(\ud835\udc65) = \ud835\udc65\ud835\udf0f +\u2211\ufe00\ud835\udf0f\n\u210e\ud835\udc61 +\u2211\ufe00\ud835\udf0f\nIf \ud835\udc5d has only the phases as the roots, then \u210e\ud835\udc61 +\u2211\ufe00\ud835\udf0f\n\ud835\udefc = \ud835\udc5f\ud835\udf14 with |\ud835\udf14| = 1. Suppose \ud835\udc5d(\ud835\udc65) = \ud835\udc65\ud835\udf0f +\u2211\ufe00\ud835\udf0f\n\n\ud835\udc61=1 \ud835\udefd\ud835\udc57\u210e\ud835\udc61\u2212\ud835\udc57 \u0338= 0 but can be written in terms of\nthe wave\ufb01lters. Consider for simplicity the 1-dimensional (complex) LDS \ud835\udc66\ud835\udc61 = \ud835\udefc\ud835\udc66\ud835\udc61\u22121 + \ud835\udc65\ud835\udc61, and let\n\ud835\udc61=1 \ud835\udefd\ud835\udc57\ud835\udc65\ud835\udf0f\u2212\ud835\udc57 = 0 and \ud835\udc5d(\ud835\udf14) = 0. In general the LDS\nis a \u201csum\u201d of LDS\u2019s that are in this form. Summing the past \ud835\udf0f terms with coef\ufb01cients given by \ud835\udefd,\n\n\ud835\udc61=1 \ud835\udefd\ud835\udc57\u210e\ud835\udc61\u2212\ud835\udc57 = 0.\n\n3 (\ud835\udc65\u2212 \ud835\udf14)\n\n3 (\ud835\udc65\u2212 \ud835\udf14)\n\n\ud835\udc51\n\n\ud835\udc66\ud835\udc61 = \ud835\udc65\ud835\udc61 +\ud835\udefc\ud835\udc65\ud835\udc61\u22121 +\u00b7\u00b7\u00b7\n\n+\ud835\udefc\ud835\udf0f \ud835\udc65\ud835\udc61\u2212\ud835\udf0f\n\n+\u00b7\u00b7\u00b7\n+\u00b7\u00b7\u00b7 +\ud835\udefc\ud835\udf0f\u22121\ud835\udc65\ud835\udc61\u2212\ud835\udf0f +\u00b7\u00b7\u00b7 )\n...\n\n...\n\ud835\udc65\ud835\udc61\u2212\ud835\udf0f\n\n+\u00b7\u00b7\u00b7 )\n\n+\ud835\udefd1(\ud835\udc66\ud835\udc61\u22121 =\n...\n+\ud835\udefd\ud835\udf0f (\ud835\udc66\ud835\udc61\u2212\ud835\udf0f =\n\n\ud835\udc65\ud835\udc61\u22121\n\n6\n\n\fThe terms \ud835\udc65\ud835\udc61, . . . , \ud835\udc65\ud835\udc61\u2212\ud835\udf0f +1 can be taken care of by linear regression. Consider a term \ud835\udc65\ud835\udc57, \ud835\udc57 < \ud835\udc61 \u2212 \ud835\udf0f\nin this sum. The coef\ufb01cient is \ud835\udefc\ud835\udc57\u2212(\ud835\udc61\u2212\ud835\udf0f )(\ud835\udefc\ud835\udf0f + \ud835\udefd1\ud835\udefc\ud835\udf0f\u22121 + \u00b7\u00b7\u00b7 + \ud835\udefd\ud835\udf0f ). Because \ud835\udc5d(\ud835\udf14) = 0, this can be\nwritten as\n(6)\nFactoring out 1\u2212 \ud835\udc5f from each of these terms show that \ud835\udc66\ud835\udc61 + \ud835\udefd1\ud835\udc66\ud835\udc61\u22121 +\u00b7\u00b7\u00b7 + \ud835\udefd\ud835\udf0f \ud835\udc66\ud835\udc61\u2212\ud835\udf0f can be expressed\nas a function of a convolution of the vector ((1 \u2212 \ud835\udc5f)\ud835\udc5f\ud835\udc61\u22121\ud835\udf14\ud835\udc61\u22121) with \ud835\udc651:\ud835\udc47 . The wave\ufb01lters were\ndesigned precisely to approximate the vector \ud835\udf07(\ud835\udc5f) = ((1\u2212 \ud835\udc5f)\ud835\udc5f\ud835\udc61\u22121)1\u2264\ud835\udc61\u2264\ud835\udc47 well, hence \ud835\udc66\ud835\udc61 + \ud835\udefd1\ud835\udc66\ud835\udc61\u22121 +\n\u00b7\u00b7\u00b7 + \ud835\udefd\ud835\udf0f \ud835\udc66\ud835\udc61\u2212\ud835\udf0f can be approximated using the wave\ufb01lters multiplied by phase and convolved with \ud835\udc65.\nNote that the 1 \u2212 \ud835\udc5f is necessary in order to make the \ud835\udc3f2 norm of ((1 \u2212 \ud835\udc5f)\ud835\udc5f\ud835\udc61\u22121)1\u2264\ud835\udc61\u2264\ud835\udc47 bounded, and\nhence ensure the wave\ufb01lters have bounded coef\ufb01cients.\n\n\ud835\udefc\ud835\udc57\u2212(\ud835\udc61\u2212\ud835\udf0f )((\ud835\udefc\ud835\udf0f \u2212 \ud835\udf14\ud835\udf0f ) + \ud835\udefd1(\ud835\udefc\ud835\udf0f\u22121 \u2212 \ud835\udf14\ud835\udf0f\u22121) + \u00b7\u00b7\u00b7 ).\n\n4.2 Regret bound for pseudo-LDSs\n\nAs an intermediate step toward the main theorem, we show a regret bound on the total least-squares\nprediction error made by Algorithm 1, compared to the best pseudo-LDS in hindsight.\nTheorem 4 (FTRL regret bound; informal). Let \u02c6\ud835\udc66*\n\ud835\udc47 denote the predictions made by the\n\ufb01xed pseudo-LDS minimizing the total squared-norm error. Then, there is a choice of parameters\nfor which the decision set \ud835\udca6 contains all LDSs which obey the assumptions from Section 2.1, for\nwhich the predictions \u02c6\ud835\udc661, . . . , \u02c6\ud835\udc66\ud835\udc47 made by Algorithm 1 satisfy\n\n1, . . . , \u02c6\ud835\udc66*\n\n\ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162 \u2212\n\n\ud835\udc47\u2211\ufe01\n*\n\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162 \u2264 \u02dc\ud835\udc42\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\n\n(\ufe01\n\n)\ufe01\n\npoly(\ud835\udc45, \ud835\udc51\n\n\u2032\n\n)\u221a\ud835\udc47\n\n.\n\nwhere \ud835\udc451, \ud835\udc45\ud835\udc65, \ud835\udc45\ud835\udc66, \ud835\udc45\u0398, \ud835\udc45\u03a8 \u2264 \ud835\udc45, \ud835\udc5a, \ud835\udc5b, \ud835\udc51 \u2264 \ud835\udc51\u2032.\nThe regret bound follows by applying the standard regret bound of follow-the-regularized-leader\n(see, e.g. [Haz16]). However, special care must be taken to ensure that the gradient and diameter\nfactors incur only a poly log(\ud835\udc47 ) factor, noting that the discretization parameter \ud835\udc4a (one of the di-\nmensions of \ud835\udc40 and \ud835\udc41) must depend polynomially on \ud835\udc47 /\ud835\udf00 in order for the class of pseudo-LDSs\nto approximate true LDSs up to error \ud835\udf00. To this end, we use a modi\ufb01cation of the strongly convex\nmatrix regularizer found in [KSST12], resulting in a regret bound with logarithmic dependence on\n\ud835\udc4a .\nIntuitively, this is possible due to the \ud835\udc51-sparsity (and thus \u21131 boundedness) of the phases of true\nLDSs, which transfers to an \u21131 bound (in the phase dimension only) on pseudo-LDSs that compete\nwith LDSs of the same size. This allows us to formulate a second convex relaxation, on top of that of\nwave-\ufb01ltering, for simultaneous identi\ufb01cation of eigenvalue phase and magnitude. For the complete\ntheorem statement and proof, see Appendix B.\nWe note that the regret analysis can be used directly with the approximation result for autoregressive\nmodels (Theorem 1), without wave-\ufb01ltering. This way, one can straightforwardly obtain a sublinear\nregret bound against autoregressive models with bounded coef\ufb01cients. However, for the reasons\ndiscussed in Section 4.1, the wave-\ufb01ltering technique affords us a much stronger end-to-end result.\n\n4.3 Pseudo-LDSs compete with true LDSs\n\nAppendix C (Lemma 14) that if the noise is bounded (\u2211\ufe00\ud835\udc47\n\nTheorem 3 shows that there exists a pseudo-LDS approximating the actual LDS to within \ud835\udf00 in the\nnoiseless case. We next need to analyze the best approximation when there is noise. We show in\n2 \u2264 \ud835\udc3f), we incur an\nadditional term equal to the size of the perturbation \u221a\ud835\udc3f times a competitive ratio depending on the\n2 \ud835\udc45\u0398\ud835\udc45\u03a8\u221a\ud835\udc3f. We show this by showing that any noise has a\ndynamical system, for a total of \ud835\udc45\u221e\ud835\udf0f\nbounded effect on the predictions of the pseudo-LDS.2\nLetting \u02c6\ud835\udc66*\n\n\ud835\udc61 be the predictions of the best pseudo-LDS, we have\n\n\ud835\udc61=1 \u2016\ud835\udf02\ud835\udc61\u20162\n\n2 + \u2016\ud835\udf09\ud835\udc61\u20162\n\n3\n\n\ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162\n2 =\n\n(\ufe03 \ud835\udc47\u2211\ufe01\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\ud835\udc61 \u2212 \ud835\udc66\ud835\udc61\u20162\n2 \u2212\n\n)\ufe03\n\n\ud835\udc47\u2211\ufe01\n*\n\ud835\udc61 \u2212 \u02c6\ud835\udc66\ud835\udc61\u20162\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\n\n2\n\n\ud835\udc47\u2211\ufe01\n*\n\ud835\udc61 \u2212 \u02c6\ud835\udc66\ud835\udc61\u20162\n\ud835\udc61=1 \u2016\u02c6\ud835\udc66\n2 .\n\n+\n\n(7)\n\n2In other words, the prediction error of the pseudo-LDS is stable to noise, and we bound its \ud835\udc3b\u221e norm.\n\n7\n\n\fThe \ufb01rst term is the regret, bounded by Theorem 4 and the second term is bounded by the discussion\nabove, giving the bound in the Theorem 1.\nFor the complete proof, see Appendix C.2.\n\n5 Experiments\n\nWe exhibit two experiments on synthetic time series, which are generated by randomly-generated\nill-conditioned LDSs. In both cases, \ud835\udc34 \u2208 R10\u00d710 is a block-diagonal matrix, whose 2-by-2 blocks\nare rotation matrices [cos \ud835\udf03 \u2212 sin \ud835\udf03; sin \ud835\udf03 cos \ud835\udf03] for phases \ud835\udf03 drawn uniformly at random. This\ncomprises a hard case for direct system identi\ufb01cation: long-term time dependences between input\nand output, and the optimization landscape is non-convex, with many local minima. Here, \ud835\udc35 \u2208\nR10\u00d710 and \ud835\udc36 \u2208 R2\u00d710 are random matrices of standard i.i.d. Gaussians. In the \ufb01rst experiment,\nthe inputs \ud835\udc65\ud835\udc61 are i.i.d. spherical Gaussians; in the second, the inputs are Gaussian block impulses.\n\nFigure 1: Performance of Algorithm 1 on synthetic 10-dimensional LDSs. For clarity, error plots\nare smoothed by a median \ufb01lter. Blue = ours, yellow = EM, red = SSID, black = true responses,\ngreen = inputs, dotted lines = \u201cguess the previous output\u201d baseline. Horizontal axis is time. Left:\nGaussian inputs; SSID fails to converge, while EM \ufb01nds a local optimum. Right: Block impulse\ninputs; both baselines \ufb01nd local optima.\n\nWe make a few straightforward modi\ufb01cations to Algorithm 1, for practicality. First, we replace\nthe scalar autoregressive parameters with matrices \ud835\udefd\ud835\udc57 \u2208 R\ud835\udc5a\u00d7\ud835\udc5a. Also, for performance reasons,\nwe use ridge regularization instead of the prescribed pseudo-LDS regularizer with composite norm\nconstraint. We choose an autoregressive parameter of \ud835\udf0f = \ud835\udc51 = 10 (in accordance with the theory),\nand \ud835\udc4a = 100.\nAs shown in Figure 1, our algorithm signi\ufb01cantly outperforms the baseline methods of system iden-\nti\ufb01cation followed by Kalman \ufb01ltering. The EM and subspace identi\ufb01cation (SSID; see [VODM12])\nalgorithms \ufb01nds a local optimum; in the experiment with Gaussian inputs, the latter failed to con-\nverge (left).\nWe note that while the main online algorithm from [HSZ17], Algorithm 1 is signi\ufb01cantly faster\nthan baseline methods, ours is not. The reason is that we incur at least an extra factor of \ud835\udc4a to\ncompute and process the additional convolutions. To remove this phase discretization bottleneck,\nmany heuristics are available for phase identi\ufb01cation; see Chapter 6 of [Lju98].\n\n6 Conclusion\n\nWe gave the \ufb01rst, to the best of our knowledge, polynomial-time algorithm for prediction in the\ngeneral LDS setting without dependence on the spectral radius parameter of the underlying system.\nOur algorithm combines several techniques, namely the recently introduced wave-\ufb01ltering method,\nas well as convex relaxation and linear \ufb01ltering.\n\n8\n\nSystem1:MIMOwithGaussianinputs\u221210010Timeseries(xt,yt)ytxt0200400600800100010\u22123100103106Error||\u02c6yt\u2212yt||2EMSSIDours\u02c6yt=yt\u22121System2:MIMOwithblockinputs\u221210010xt(1)yt(1)0200400600800100010\u2212410\u22122100102EMSSIDours\u02c6yt=yt\u22121\fOne important future direction is to improve the regret in the setting of (non-adversarial) Gaussian\nnoise. In this setting, if the LDS is explicitly identi\ufb01ed, the best predictor is the Kalman \ufb01lter, which,\nwhen unrolled, depends on feedback for all previous time steps, and only incurs a cost \ud835\udc42(\ud835\udc3f) from\nnoise in (4). It is of great theoretical and practical interest to compete directly with the Kalman \ufb01lter\nwithout system identi\ufb01cation.\n\nReferences\n[AHMS13] Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time\nseries prediction. In COLT 2013 - The 26th Annual Conference on Learning Theory,\nJune 12-14, 2013, Princeton University, NJ, USA, pages 172\u2013184, 2013.\n\n[AYS11] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Regret bounds for the adaptive control of\nlinear quadratic systems. In Proceedings of the 24th Annual Conference on Learning\nTheory, pages 1\u201326, 2011.\n\n[BD09] P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, 2 edition,\n\n2009.\n\n[BJR94] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control.\n\nPrentice-Hall, 3 edition, 1994.\n\n[BK15] David Belanger and Sham Kakade. A linear dynamical system model for text.\n\nInternational Conference on Machine Learning, pages 833\u2013842, 2015.\n\nIn\n\n[BLV98] Daniel L Boley, Franklin T Luk, and David Vandevoorde. A fast method to diagonalize\n\na Hankel matrix. Linear algebra and its applications, 284(1-3):41\u201352, 1998.\n\n[BT17] Bernhard Beckermann and Alex Townsend. On the singular values of matrices\nwith displacement structure. SIAM Journal on Matrix Analysis and Applications,\n38(4):1227\u20131248, 2017.\n\n[CBL06] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, New York, NY, USA, 2006.\n\n[Cho83] Man-Duen Choi. Tricks or treats with the hilbert matrix. The American Mathematical\n\nMonthly, 90(5):301\u2013312, 1983.\n\n[DMM+17] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the\nsample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688,\n2017.\n\n[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online\n\nconvex optimization. Mach. Learn., 69(2-3):169\u2013192, December 2007.\n\n[Ham94] J. Hamilton. Time Series Analysis. Princeton Univ. Press, 1994.\n\n[Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[Hil94] David Hilbert. Ein beitrag zur theorie des legendre\u2019schen polynoms. Acta mathemat-\n\nica, 18(1):155\u2013159, 1894.\n\n[HMR16] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynam-\n\nical systems. arXiv preprint arXiv:1609.05191, 2016.\n\n[HSZ17] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via\nspectral \ufb01ltering. In Advances in Neural Information Processing Systems, pages 6705\u2013\n6715, 2017.\n\n[Kal60] Rudolph Emil Kalman. A new approach to linear \ufb01ltering and prediction problems.\n\nJournal of Basic Engineering, 82.1:35\u201345, 1960.\n\n9\n\n\f[KM16] Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and online learning. In\nVitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Confer-\nence on Learning Theory, volume 49 of Proceedings of Machine Learning Research,\npages 1190\u20131213, Columbia University, New York, New York, USA, 23\u201326 Jun 2016.\n\n[KM17] Vitaly Kuznetsov and Mehryar Mohri. Discriminative state space models. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, pages 5671\u20135679.\nCurran Associates, Inc., 2017.\n\n[KSST12] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques\nfor learning with matrices. Journal of Machine Learning Research, 13(Jun):1865\u2013\n1890, 2012.\n\n[Lju98] Lennart Ljung. System identi\ufb01cation: Theory for the User. Prentice Hall, Upper Saddle\n\nRiiver, NJ, 2 edition, 1998.\n\n[MW07] Taesup Moon and Tsachy Weissman. Competitive on-line linear \ufb01r mmse \ufb01ltering. In\nIEEE International Symposium on Information Theory - Proceedings, pages 1126 \u2013\n1130, 07 2007.\n\n[RG99] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models.\n\nNeural computation, 11(2):305\u2013345, 1999.\n\n[VODM12] Peter Van Overschee and BL De Moor. Subspace Identi\ufb01cation for Linear Systems.\n\nSpringer Science & Business Media, 2012.\n\n[ZDG+96] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control,\n\nvolume 40. Prentice hall New Jersey, 1996.\n\n10\n\n\f", "award": [], "sourceid": 2257, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Holden", "family_name": "Lee", "institution": "Princeton"}, {"given_name": "Karan", "family_name": "Singh", "institution": "Princeton University"}, {"given_name": "Cyril", "family_name": "Zhang", "institution": "Princeton University"}, {"given_name": "Yi", "family_name": "Zhang", "institution": "Princeton"}]}