{"title": "Discriminative State Space Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5671, "page_last": 5679, "abstract": "In this paper, we introduce and analyze Discriminative State-Space Models for forecasting non-stationary time series. We provide data-dependent generalization guarantees for learning these models based on the recently introduced notion of discrepancy. We provide an in-depth analysis of the complexity of such models. Finally, we also study the generalization guarantees for several structural risk minimization approaches to this problem and provide an efficient implementation for one of them which is based on a convex objective.", "full_text": "Discriminative State-Space Models\n\nVitaly Kuznetsov\nGoogle Research\n\nNew York, NY 10011, USA\n\nvitaly@cims.nyu.edu\n\nMehryar Mohri\n\nCourant Institute and Google Research\n\nNew York, NY 10011, USA\n\nmohri@cims.nyu.edu\n\nAbstract\n\nWe introduce and analyze Discriminative State-Space Models for forecasting non-\nstationary time series. We provide data-dependent generalization guarantees for\nlearning these models based on the recently introduced notion of discrepancy. We\nprovide an in-depth analysis of the complexity of such models. We also study the\ngeneralization guarantees for several structural risk minimization approaches to\nthis problem and provide an ef\ufb01cient implementation for one of them which is\nbased on a convex objective.\n\n1\n\nIntroduction\n\nTime series data is ubiquitous in many domains including such diverse areas as \ufb01nance, economics,\nclimate science, healthcare, transportation and online advertisement. The \ufb01eld of time series analysis\nconsists of many different problems, ranging from analysis to classi\ufb01cation, anomaly detection, and\nforecasting. In this work, we focus on the problem of forecasting, which is probably one of the most\nchallenging and important problems in the \ufb01eld.\nTraditionally, time series analysis and time series prediction, in particular, have been approached\nfrom the perspective of generative modeling: particular generative parametric model is postulated that\nis assumed to generate the observations and these observations are then used to estimate unknown\nparameters of the model. Autoregressive models are among the most commonly used types of\ngenerative models for time series [Engle, 1982, Bollerslev, 1986, Brockwell and Davis, 1986, Box\nand Jenkins, 1990, Hamilton, 1994]. These models typically assume that the stochastic process that\ngenerates the data is stationary up to some known transformation, such as differencing or composition\nwith natural logarithms.\nIn many modern real world applications, the stationarity assumption does not hold, which has led\nto the development of more \ufb02exible generative models that can account for non-stationarity in the\nunderlying stochastic process. State-Space Models [Durbin and Koopman, 2012, Commandeur and\nKoopman, 2007, Kalman, 1960] provide a \ufb02exible framework that captures many of such generative\nmodels as special cases, including autoregressive models, hidden Markov models, Gaussian linear\ndynamical systems and many other models. This framework typically assumes that the time series Y\nis a noisy observation of some dynamical system S that is hidden from the practitioner:\n\nYt = h(St) + \u270ft, St = g(St1) + \u2318t\n\n(1)\nIn (1), h, g are some unknown functions estimated from data, {\u270ft}, {\u2318t} are sequences of random\nvariables and {St} is an unobserved sequence of states of a hidden dynamical system.1 While this\nclass of models provides a powerful and \ufb02exible framework for time series analysis, the theoretical\nlearning properties of these models is not suf\ufb01ciently well understood. The statistical guarantees\navailable in the literature rely on strong assumptions about the noise terms (e.g. {\u270ft} and {\u2318t} are\nGaussian white noise). Furthermore, these results are typically asymptotic and require the model\n\nfor all t.\n\n1A more general formulation is given in terms of distribution of Yt: ph(Yt|St)pg(St|St1).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto be correctly speci\ufb01ed. This last requirement places a signi\ufb01cant burden on a practitioner since\nthe choice of the hidden state-space is often a challenging problem and typically requires extensive\ndomain knowledge.\nIn this work, we introduce and study Discriminative State-Space Models (DSSMs). We provide the\nprecise mathematical de\ufb01nition of this class of models in Section 2. Roughly speaking, a DSSM\nfollows the same general structure as in (1) and consists of a state predictor g and an observation\npredictor h. However, no assumption is made about the form of the stochastic process used to\ngenerate observations. This family of models includes existing generative models and other state-\nbased discriminative models (e.g. RNNs) as special cases, but also consists of some novel algorithmic\nsolutions explored in this paper.\nThe material we present is organized as follows. In Section 3, we generalize the notion of discrepancy,\nrecently introduced by Kuznetsov and Mohri [2015] to derive learning guarantees for DSSMs. We\nshow that our results can be viewed as a generalization of those of these authors. Our notion of\ndiscrepancy is \ufb01ner, taking into account the structure of state-space representations, and leads to\ntighter learning guarantees. Additionally, our results provide the \ufb01rst high-probability generaliza-\ntion guarantees for state-space models with possibly incorrectly speci\ufb01ed models. Structural Risk\nMinimization (SRM) for DSSMs is analyzed in Section 4. As mentioned above, the choice of the\nstate-space representation is a challenging problem since it requires carefully balancing the accuracy\nof the model on the training sample with the complexity of DSSM to avoid over\ufb01tting. We show\nthat it is possible to adaptively learn a state-space representation in a principled manner using the\nSRM technique. This requires analyzing the complexity of several families of DSSMs of interest\nin Appendix B. In Section 5, we use our theory to design an ef\ufb01cient implementation of our SRM\ntechnique. Remarkably, the resulting optimization problem turns out to be convex. This should be\ncontrasted with traditional SSMs that are often derived via Maximum Likelihood Estimation (MLE)\nwith a non-convex objective. We conclude with some promising preliminary experimental results in\nAppendix D.\n\n2 Preliminaries\n\nyt = h(Xt, st),\n\nst = g(Xt, st1)\n\nIn this section, we introduce the general scenario of time series prediction as well as the broad family\nof DSSMs considered in this paper.\nWe study the problem of time series forecasting in which the learner observes a realization\n(X1, Y1), . . . , (XT , YT ) of some stochastic process, with (Xt, Yt) 2Z = X\u21e5Y . We assume\nand\nthat the learner has access to a family of observation predictors H = {h : X\u21e5S!Y}\nstate predictors G = {g : X\u21e5S!S}\n, where S is some pre-de\ufb01ned space. We refer to any pair\nf = (h, g) 2H\u21e5G = F as a DSSM, which is used to make predictions as follows:\n(2)\nObserve that this formulation includes the hypothesis sets used in (1) as special cases. In our setting,\nh and g both accept an additional argument x 2X . In practice, if Xt = (Yt1, . . . , Ytp) 2X = Y p\nfor some p, then Xt represents some recent history of the stochastic process that is used to make a\nprediction of Yt. More generally, X may also contain some additional side information. Elements of\nthe output space Y may further be multi-dimensional, which covers both multi-variate time series\nforecasting and multi-step forecasting.\nThe performance of the learner is measured using a bounded loss function L : H\u21e5S\u21e5Z! [0, M ],\nfor some upper bound M 0. A commonly used loss function is the squared loss: L(h, s, z) =\n(h(x, s) y)2.\nThe objective of the learner is to use the observed realization of the stochastic process up to time T to\ndetermine a DSSM f = (h, g) 2F that has the smallest expected loss at time T + 1, conditioned on\nthe given realization of the stochastic process:2\n(3)\n\nfor all t.\n\nLT +1(f|ZT\n\n1 ) = E[L(h, sT +1, ZT +1)|ZT\n1 ],\n\n2An alternative performance metric commonly considered in the time series literature is the averaged\ngeneralization error LT +1(f ) = E[L(f, sT +1, ZT +1)]. The path-dependent generalization error that we\nconsider in this work is a \ufb01ner measure of performance since it only takes into consideration the realized history\nof the stochastic process, as opposed to an average trajectory.\n\n2\n\n\fwhere st for all t is speci\ufb01ed by g via the recursive computation in (2). We will use the notation ar\ndenote (as, as+1, . . . ar).\nIn the rest of this section, we will introduce the tools needed for the analysis of this problem. The key\ntechnical tool that we require is the notion of state-space discrepancy:\n\ns to\n\n1\nT\n\n]\u25c6,\n\nTXt=1\n\nh2H\u2713 E[L(h, sT +1, ZT +1)|ZT\n\n1\n\n1\n\ndisc(s) = sup\n\n1 ] \nwhere, for simplicity, we used the shorthand s = sT +1\n. This de\ufb01nition is a strict generalization of\nthe q-weighted discrepancy of Kuznetsov and Mohri [2015]. In particular, rede\ufb01ning L(h, s, z) =\n\nE[L(h, st, Zt)|Zt1\n\n(4)\n\nseL(h, z) and setting st = T qt for 1 \uf8ff t \uf8ff T and sT +1 = 1 recovers the de\ufb01nition of q-weighted\ndiscrepancy. The discrepancy disc de\ufb01nes an integral probability pseudo-metric on the space of\nprobability distributions that serves as a measure of the non-stationarity of the stochastic process\nZ with respect to both the loss function L and the hypothesis set H, conditioned on the given state\nsequence s. For example, if the process Z is i.i.d., then we simply have disc(s) = 0 provided that\ns is a constant sequence. See [Cortes et al., 2017, Kuznetsov and Mohri, 2014, 2017, 2016, Zimin\nand Lampert, 2017] for further examples and bounds on discrepancy in terms of other divergences.\nHowever, the most important property of the discrepancy disc(s) is that, as shown in Appendix C,\nunder some additional mild assumptions, it can be estimated from data.\nThe learning guarantees that we present are given in terms of data-dependent measures of sequential\ncomplexity, such as expected sequential covering number [Rakhlin et al., 2010], that are modi\ufb01ed to\naccount for the state-space structure in the hypothesis set. The following de\ufb01nition of a complete\nbinary tree is used throughout this paper: a Z-valued complete binary tree z is a sequence (z1, . . . , zT )\nof T mappings zt : {\u00b11}t1 !Z , t 2 [1, T ]. A path in the tree is = (1, . . . , T1) 2 {\u00b11}T1.\nWe write zt() instead of zt(1, . . . , t1) to simplify the notation. Let R = R0\u21e5G be any function\nclass where G is a family of state predictors and R0 = {r : Z\u21e5S! R}. A set V of R-valued trees\nof depth T is a sequential \u21b5-cover (with respect to `p norm) of R on a tree z of depth T if for all\n(r, g) 2R and all 2 {\u00b11}T , there is v 2 V such that\n\n\" 1\n\nT\n\nTXt=1vt() r(zt(), st)p# 1\n\np\n\n\uf8ff \u21b5,\n\nwhere st = g(zt(), st1). The (sequential) covering number Np(\u21b5,R, z) on a given tree z is\nde\ufb01ned to be the size of the minimal sequential cover. We call Np(\u21b5,R) = supz Np(\u21b5,R, z) the\nmaximal covering number. See Figure 1 for an example.\nWe de\ufb01ne the expected covering number to be Ez\u21e0T (p)[Np(\u21b5,R, z)], where T (p) denotes the\ndistribution of z implicitly de\ufb01ned via the following sampling procedure. Given a stochastic process\ndistributed according to the distribution p with pt(\u00b7|zt1\n) denoting the conditional distribution at\ntime t, sample Z1, Z01 from p1 independently. In the left child of the root sample Z2, Z02 according to\np2(\u00b7|Z1) and in the right child according to p2(\u00b7|Z02) all independent from each other. For a node that\ncan be reached by a path (1, . . . , t), we draw Zt, Z0t according to pt(\u00b7|S1(1), . . . , St1(t1)),\nwhere St(1) = Zt and St(1) = Z0t. Expected sequential covering numbers are a \ufb01ner measure of\ncomplexity since they directly take into account the distribution of the underlying stochastic process.\nFor further details on sequential complexity measures, we refer the reader to [Littlestone, 1987,\nRakhlin et al., 2010, 2011, 2015a,b].\n\n1\n\n3 Theory\n\nIn this section, we present our generalization bounds for learning with DSSMs. For our \ufb01rst result,\nwe assume that the sequence of states s (or equivalently state predictor g) is \ufb01xed and we are only\nlearning the observation predictor h.\nTheorem 1. Fix s 2S T +1. For any > 0, with probability at least 1 , for all h 2H and all\n\u21b5> 0, the following inequality holds:\n\nL(f|ZT\n\n1 ) \uf8ff\n\n1\nT\n\nTXt=1\n\nL(h, Xt, st) + disc(s) + 2\u21b5 + Ms 2 log Ev\u21e0T (P)[N1(\u21b5,Rs,v)]\n\n\n\n,\n\nT\n\n3\n\n\fwhere Rs = {(z, s) 7! L(h, s, z) : h 2H}\u21e5{ s}.\nThe proof of Theorem 1 (as well as the proofs of all other results in this paper) is given in Appendix A.\nNote that this result is a generalization of the learning guarantees of Kuznetsov and Mohri [2015].\n\nIndeed, setting s = (T q1, . . . , T qT , 1) for some weight vector q and L(h, s, z) = seL(h, z) recovers\nCorollary 2 of Kuznetsov and Mohri [2015]. Zimin and Lampert [2017] show that, under some\nadditional assumptions on the underlying stochastic process (e.g. Markov processes, uniform mar-\ntingales), it is possible to choose these weights to guarantee that the discrepancy disc(s) is small.\nAlternatively, Kuznetsov and Mohri [2015] show that if the distribution of the stochastic process\nat times T + 1 and [T s, T ] is suf\ufb01ciently close (in terms of discrepancy) then disc(s) can be\nestimated from data. In Theorem 5 in Appendix C, we show that this property holds for arbitrary\nstate sequences s. Therefore, one can use the bound of Theorem 1 that can be computed from data to\nsearch for the predictor h 2H that minimizes this quantity. The quality of the result will depend\non the given state-space sequence s. Our next result shows that it is possible to learn h 2H and s\ngenerated by some state predictor g 2G jointly.\nTheorem 2. For any > 0, with probability at least 1 , for all f = (h, g) 2H\u21e5G\nand all\n\u21b5> 0, the following inequality holds:\nL(h, Xt, st) + disc(s) + 2\u21b5 + Ms 2 log Ev\u21e0T (P)[N1(\u21b5,R,v)]\n\nL(f|ZT\n\n1 ) \uf8ff\n\n1\nT\n\nT\n\n,\n\n\n\nTXt=1\n\nwhere st = g(Xt, st1) for all t and R = {(z, s) 7! L(h, s, z) : h 2H}\u21e5G .\nThe cost of this signi\ufb01cantly more general result is a slightly larger complexity term N1(\u21b5,R, v) \nN1(\u21b5,Rs, v). This bound is also much tighter than the one that can be obtained by applying the\nresult of Kuznetsov and Mohri [2015] directly to F = H\u21e5G , which would lead to the same bound\nas in Theorem 2 but with disc(s) replaced by supg2G disc(s). Not only supg2G disc(s) is an upper\nbound on disc(s), but it is possible to construct examples that lead to learning bounds that are too\nloose. Consider the stochastic process generated as follows. Let X be uniformly distributed on\n{\u00b11}. Suppose Y1 = 1 and Yt = Yt1 for all t > 1 if X = 1 and Yt = Yt1 for all t > 1\notherwise. In other words, Y is either periodic or a constant state sequence. If L is the squared\nloss, for G = {g1, g2} with g1(s) = s and g2(s) = s and H = {h} with h(s) = s, for odd T ,\nsupg2G disc(s) 1/2. On the other hand, the bound in terms of disc(s) is much \ufb01ner and helps\nus select g such that disc(s) = 0 for that g. This example shows that even for simple deterministic\ndynamics our learning bounds are \ufb01ner than existing ones.\nSince the guarantees of Theorem 2 are data-dependent and hold uniformly over F, they allow us to\nseek a solution f 2F that would directly optimize this bound and that could be computed from the\ngiven sample. As our earlier example shows, the choice of the family of state predictors G is crucial\nto achieve good guarantees. For instance, if G = {g1} then it may be impossible to have a non-trivial\nbound. In other words, if the family of state predictors is not rich enough, then, it may not be possible\nto handle the non-stationarity of the data. On the other hand, if G is chosen to be too large, then, the\ncomplexity term may be too large. In Section 4, we present an SRM technique that enables us to\nlearn the state-space representation and adapt to non-stationarity in a principled way.\n\n4 Structural Risk Minimization\n\nSuppose we are given a sequence of families of observation predictors H1 \u21e2H 2 \u21e2\u00b7\u00b7\u00b7H n . . . and\na sequence of families of state predictors G1 \u21e2G 2 \u00b7\u00b7\u00b7 Gn . . . Let Rk = {(s, z) 7! L(h, s, z) : h 2\nHk}\u21e5G k and R = [1k=1Rk. Consider the following objective function:\n\nL(h, st, Zt) +( s) + Bk + Mr log k\nwhere (s) is any upper bound on disc(s) and Bk is any upper bound on Mr 2 log\n\n.\nWe are presenting an estimatable upper bound on disc(s) in Appendix C, which provides one\n\nEv\u21e0T (P)[N1(\u21b5,Rk ,v)]\n\nTXt=1\n\nF (h, g, k) =\n\n1\nT\n\n(5)\n\nT\n\n,\n\n\n\nT\n\n4\n\n\fparticular choice for (s). In Appendix B, we also prove upper bounds on the expected sequential\ncovering numbers for several families of hypothesis. Then, we de\ufb01ne the SRM solution as follows:\n\n\n\nT\n\n1 ) \uf8ffL T +1(f\u21e4|ZT\n\nLT +1(eh,eg|ZT\n\n(eh,eg,ek) = argminh,g2Hk\u21e5Gk,k1 F (h, g, k).\n\n1 ) + 2(s\u21e4) + 2\u21b5 + 2Bk(f\u21e4) + Mr log k(f\u21e4)\n\n(6)\nWe also de\ufb01ne f\u21e4 by f\u21e4 = (h\u21e4, g\u21e4) 2 argminf2F LT +1(f|ZT\n1 ). Then, the following result holds.\nTheorem 3. For any > 0, with probability at least 1 , for all \u21b5> 0, the following bound holds:\n+ 2Ms log 2\nT\nwhere s\u21e4t = g\u21e4(Xt, s\u21e4t1), and where k(f\u21e4) is the smallest integer k such that f\u21e4 2H k \u21e5G k.\nTheorem 3 provides a learning guarantee for the solution of SRM problem (5). This result guarantees\nfor the SRM solution a performance close to that of the best-in-class model f\u21e4 modulo a penalty\nterm that includes the discrepancy (of the best-in-class state predictor), similar to the guarantees\nof Section 3. This guarantee can be viewed as a worst case bound when we are unsure if the\nstate-space predictor captures the non-stationarity of the problem correctly. However, in most cases,\nby introducing a state-space representation, we hope that it will help us model (at least to some\ndegree) the non-stationarity of the underlying stochastic process. In what follows, we present a\nmore optimistic best-case analysis which shows that, under some additional mild assumptions on\nthe complexity of the hypothesis space with respect to stochastic process, we can simultaneously\nsimplify the SRM optimization and give tighter learning guarantees for this modi\ufb01ed version.\nAssumption 1 (Stability of state trajectories). Assume that there is a decreasing function r such that\nfor any \u270f> 0 and > 0, with probability 1 , if h\u21e4, g\u21e4 = argmin(h,g)2F LT +1(h, g|ZT\n1 ) and\n(h, g) 2F is such that\n\n,\n\nthen, the following holds:\n\n\n\n1\nT\n\nTXt=1\nLT +1(h, g|ZT\n\nLt(h, g|Zt1\n\n1\n\n) L t(h\u21e4, g\u21e4|Zt1\n\n1\n\n1 ) L T +1(h\u21e4, g\u21e4|ZT\n\n1 ) \uf8ff r(\u270f).\n\n) \uf8ff \u270f,\n\n(7)\n\n(8)\n\nRoughly speaking, this assumption states that, given a sequence of states s1, . . . , sT generated by g\nsuch that the performance of some observation predictor h along this sequence of states is close to\nthe performance of the ideal pair h\u21e4 along the ideal sequence generated by g\u21e4, the performance of h\nin the near future (at state sT +1) will remain close to that of h\u21e4 (in state s\u21e4T +1). Note that, in most\ncases of interest, r has the form r(\u270f) = a\u270f, for some a > 0.\nConsider the following optimization problem which is similar to (5) but omits the discrepancy upper\nbound :\n\nF0(h, g, k) =\n\n1\nT\n\nL(h, st, Zt) + Bk + Mr log k\n\nT\n\nTXt=1\n\n,\n\n(9)\n\nWe will refer to F0 as an optimistic SRM objective and we let (h0, g0) be a minimizer of F0. Then,\nwe have the following learning guarantee.\nTheorem 4. Under Assumption 1, for any > 0, with probability at least 1 , for all \u21b5> 0, the\ninequality LT +1(h0, g0|ZT\n\n1 ) < r(\u270f) holds with\n\n1 ) L T +1(f\u21e4|ZT\n\u270f = 2\u21b5 + 2Bk(f\u21e4) + Mr log k(f\u21e4)\n\nT\n\n+ 2Ms log 2\n\nT\n\n\n\n,\n\nwhere s\u21e4t = g\u21e4(Xt, s\u21e4t1), and where k(f\u21e4) is the smallest integer k such that f\u21e4 2H k \u21e5G k.\nWe remark that a \ufb01ner analysis can be used to show that Assumption 1 only need to be satis\ufb01ed for\nk \uf8ff k(f\u21e4) for the Theorem 4. Furthermore, observe that for linear functions r(\u270f) = a\u270f, one recovers\na guarantee similar to the bound in Theorem 3, but the discrepancy term is omitted making this result\ntighter. This result suggests that in the optimistic scenarios where our hypothesis set contains a good\n\n5\n\n\f1\n\n1\n\n.\n\nstate predictor that can capture the data non-stationarity, it is possible to achieve a tighter guarantee\nthat avoids the pessimistic discrepancy term. Note that, increasing the capacity of the family of\nstate predictors makes it easier to \ufb01nd such a good state predictor but it also may make the learning\nproblem harder and lead to the violation of Assumption 1. This further motivates the use of an SRM\ntechnique for this problem to \ufb01nd the right balance between capturing the non-stationarity in data and\nthe complexity of the models that are being used. Theorem 4 formalizes this intuition by providing\ntheoretical guarantees for this approach.\nWe now consider several illustrative examples showing that this assumption holds in a variety of\ncases of interest. In all our examples, we will use the squared loss but it is possible to generalize all\nof them to other suf\ufb01ciently regular losses.\nLinear models. Let F be de\ufb01ned by F = {f : y 7! w\u00b7 (y),kwk \uf8ff \u21e4} for some \u21e4 > 0 and some\nfeature map . Consider a separable case where Yt = w\u21e4 \u00b7 (Yt1\ntp) + \u270ft, where \u270ft represents white\nnoise. One can verify that the following equality holds:\n] =h(w w\u21e4) \u00b7 (Yt1\ntp)i2\ntp) Yt)|Yt1\n) = E[(w \u00b7 (Yt1\nIn view of that, it follows that (7) is equal to\nTXt=1h(w w\u21e4) \u00b7 (Yt1\ntp)i2\nTXt=1\n(wj w\u21e4j )2 j(Yt1\ntp)2\n1 ) =h(w w\u21e4) \u00b7 (YT\n\nfor any coordinate j 2 [1, N ]. Thus, for any coordinate j 2 [1, N ], by H\u00f6lder\u2019s inequality, we have\n\n1 ) L T +1(h\u21e4, g\u21e4|ZT\n\nTp+1)i2\n\nLt(w|Zt1\n\n\uf8ff r\u270f\n\n1\nj\n\n1\nT\n\n1\nT\n\n\n\nNXj=1\n\n,\n\nt=1 j(Yt1\n\nwhere j = 1\ntp)2 is the empirical variance of the j-th coordinate and where r =\nsupy (y)2 is the empirical `1-norm radius of the data. The special case where is the identity\nmap covers standard autoregressive models. These often serve as basic building blocks for other\nstate-space models, as discussed below. More generally, other feature maps may be induced by\na positive de\ufb01nite kernel K. Alternatively, we may take as our hypothesis set F the convex hull of\nall decision trees of certain depth d. In that case, we can view each coordinate j as the output of a\nparticular decision tree on the given input.\nLinear trend models. For simplicity, in this example, we consider univariate time series with linear\ntrend. However, this can be easily generalized to the multi-variate setting with different trend models.\nDe\ufb01ne G as G = {s 7! s + c : |c|\uf8ff \u21e4} for some \u21e4 > 0 and let H be a singleton consisting of the\nidentity map. Assume that Yt = c\u21e4t + \u270ft, where \u270ft is white noise. As in the previous example, it is\neasy to check that Lt(h, g|Zt1\n) = |c c\u21e4|2t2. Therefore, in this case, one can show that (7) reduces\n3 (T + 1)(2T + 1)|c c\u21e4|2 and therefore, if \u270f = O(p1/T ), then we have |c c\u21e4|2 = O(1/T 5/2)\nto 1\nand thus (8) is |c c\u21e4|2(T + 1)2 = O(p1/T ).\n\nPeriodic signals. We study a multi-resolution setting where the time series of interest are modeled\nas a linear combination of periodic signals at different frequencies. We express this as a state-space\nmodel as follows. De\ufb01ne\n\n1\n\nLT +1(h, g|ZT\nT PT\n\nAd =\uf8ff 1 1\n0 ,\n\nId1\n\nwhere 1 is d 1-dimensional row vector of 1s, 0 is d 1-dimensional column vector of 0 and Id1 is\nan identity matrix. It easy to verify that, under the map s 7! Ads, the sequence s1 \u00b7 e1, s2 \u00b7 e1 . . . , st \u00b7\ne1 . . ., where \u270f1 = (1, 0, . . . , 0)T , is a periodic sequence with period d. Let D = d1, . . . , dk be\nany collection of positive integers and let A be a block-diagonal matrix with Ad1, . . . , Adk on the\ndiagonal. We set G = {s 7! A \u00b7 s} and H = {s 7! w \u00b7 s : kwk < \u21e4}, where we also restrict ws to\nbe non-zero only at coordinates 1, 1 + d1, 1 + d1 + d2, . . . , 1 +Pk1\nj=1 dk1. Once again, to simplify\nour presentation, we assume that Yt satis\ufb01es Yt = w\u21e4 \u00b7 st + \u270ft. Using arguments similar to those of\nthe previous examples, one can show that (7) is lower bounded by (wj w\u21e4j )2 1\nt=1 st,j for any\ncoordinate j. Therefore, as before, if (7) is upper bounded by \u270f> 0, then (8) is upper bounded by\nr\u270fPN\n\n, where r is the maximal radius of any state and j a variance of j-th state sequence.\n\nT PT\n\n1\nj\n\nj=1\n\n6\n\n\fTrajectory ensembles. Note that, in our previous example, we did not exploit the fact that the\nsequences were periodic. Indeed, our argument holds for any g that generates a multi-dimensional\ntrajectory h 2H = {s 7! w \u00b7 s : kwk < \u21e4} which can be interpreted as learning an ensemble of\ndifferent state-space trajectories.\nStructural Time Series Models (STSMs). STSMs are a popular family of state-space models that\ncombine all of the previous examples. For this model, we use (h, g) 2H\u21e5G that have the following\nstructure: h(xt, g(st)) = w\u00b7 (xt)+ct+w0\u00b7st, where st is a vector of periodic sequences described\nin the previous examples and xt is the vector representing the most recent history of the time series.\nNote that our formulation is very general and allows for arbitrary feature maps that can correspond\neither to kernel-based or tree-based models. Arguments similar to those given in previous examples\nshow that Assumption 1 holds in this case.\nShifting parameters. We consider the non-realizable case where H is a set of linear models but\nwhere the data is generated according to the following procedure. The \ufb01rst T /2 rounds obey the\nformula Yt = w0Yt1 + \u270ft, the subsequent rounds the formula Yt = w\u21e4Yt1 + \u270ft. Note that, in this\ncase, we have | 1\n)| = 0. However, if w0 and w\u21e4 are suf\ufb01ciently\nfar apart, it is possible to show that there is a constant lower bound on LT +1(w0|ZT\n1 ).\n1 )LT +1(w\u21e4|ZT\nOne approach to making Assumption 1 hold for this stochastic process is to choose H such that the\nresulting learning problem is separable. However, that requires us to know the exact nature of the\nunderlying stochastic process. An alternative agnostic approach, is to consider a sequence of states\n(or equivalently weights) that can assign different weights qt to different training points.\nFinally, observe that our learning guarantees in Section 3 and 4 are expressed in terms of the expected\nsequential covering numbers of the family of DSSMs that we are seeking to learn. A priori, it is\nnot clear if it is possible to control the complexity of such models in a meaningful way. However,\nin Appendix B, we present explicit upper bounds on the expected sequential covering numbers of\nseveral families of DSSMs, including several of those discussed above: linear models, tree-based\nhypothesis, and trajectory ensembles.\n\n) L t(w\u21e4|Zt1\n\nt=1 Lt(w0|Zt1\n\nT PT\n\n1\n\n1\n\n5 Algorithmic Solutions\n\nThe generic SRM procedures described in Section 4 can lead to the design of a range of different\nalgorithmic solutions for forecasting time series, depending on the choice of the families Hk and\nFk. The key challenge for the design of an algorithm design in this setting is to come up with a\ntractable procedure for searching through sets of increasing complexity. In this section, we describe\none such procedure that leads to a boosting-style algorithm. Our algorithm learns a structural time\nseries model by adaptively adding various structural subcomponents to the model in order to balance\nmodel complexity and the ability of the model to handle non-stationarity in data. We refer to our\nalgorithm as Boosted Structural Time Series Models (BOOSTSM).\nWe will discuss BOOSTSM in the context of the squared loss, but most of our results can be\nstraightforwardly extended to other convex loss functions. The hypothesis set used by our algorithm\nadmits the following form: H = {(x, s) 7! w \u00b7 (x) + w0 \u00b7 s : kwk1 \uf8ff \u21e4,kw0k1 \uf8ff \u21e40}. Each\ncoordinate j is a binary-valued decision tree maps its inputs to a bounded set. For simplicity, we\nalso assume that \u21e4=\u21e4 0 = 1. We choose G to be any set of state trajectories. For instance, this set\nmay include periodic or trend sequences as described in Section 4.\nNote that, to make the discussion concrete, we impose an `1-constraint to the parameter vectors, but\nother regularization penalties are also possible. The particular choice of the regularization de\ufb01ned by\nH would also lead to sparser solutions, which is an additional advantage given that our state-space\nrepresentation is high-dimensional.\nFor the squared loss and the aforementioned H, the optimistic SRM objective (9) is given by\n\nF (w, w0) =\n\n1\nT\n\nTXt=1\u21e3yt w \u00b7 (xt) + w0 \u00b7 st\u23182\n\n+ (kwk1 + kw0k1),\n\n(10)\n\nwhere we omit log(k) because the index k in our setting tracks the maximal depth of the tree and it\nsuf\ufb01ces to restrict the search to the case k < T as, for deeper trees, we can achieve zero empirical\n\nerror. With this upper bound on k, O\u21e3q log T\n\nT \u2318 is small and hence not included in the objective.\n\n7\n\n\fBOOSTSM(S = ((xi, yi)T\n\nt=1)\n\nf0 0\nfor k 1 to K do\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10 return fK\n\nj argminj \u270fk,j + sgn(wj)\nj0 argminj0 k,j0 + sgn(w0j)\nif \u270fk,j + sgn(wj) \uf8ff k,j0 + sgn(w0j) then\n\n\u2318k argmin\u2318 F (w + \u2318\u270fj, w0)\nfk fk1 + \u2318k j\nelse \u2318k argmin\u2318 F (w, w0 + \u2318\u270fj0)\nfk fk1 + \u2318t\u270fj0\n\nFigure 1: Pseudocode of the BOOSTSM algorithm. On line 3 and 4 two candidates are selected to be\nadded to the ensemble: a state trajectory with j0 or a tree-based predictor with index j. Both of these\nminimize their subgradients within their family of weak learners. Subgradients are de\ufb01ned by (11).\nThe candidate with the smallest gradient is added to the ensemble. The weight of the new ensemble\nmember is found via line search (line 6 and 8).\n\nThe regularization penalty is directly derived from the bounds on the expected sequential covering\nnumbers of H given in Appendix B in Lemma 4 and Lemma 5.\nObserve that (10) is a convex objective function. Our BOOSTSM algorithm is de\ufb01ned by the\napplication of coordinate descent to this objective. Figure 1 gives its pseudocode. The algorithm\nproceeds in K rounds. At each round, we either add a new predictor tree or a new state-space\ntrajectory to the model, depending on which results in a greater decrease in the objective. In particular,\nwith the following de\ufb01nitions:\n\nTXt=1\n\nTXt=1\n\n\u270fk,j =\n\n1\nT\n\n(yt fk1(xt, st)) j(xt),\n\nk,j =\n\n1\nT\n\n(yt fk1(xt, st))st,j.\n\n(11)\n\nthe subgradient in tree-space direction j at round k is given by \u270fk,j + sgn(wk,j). We use the\nnotation wk to denote the tree-space parameter vector after k 1 rounds. Similarly, the subgradient\nin the trajectory space direction j0 is given by k,j0 + sgn(w0k,j), where w0k represents the trajectory\nspace parameter vector after k 1 rounds.\nBy standard results in optimization theory [Luo and Tseng, 1992], BOOSTSM admits a linear\nconvergence guarantee.\n\n6 Conclusion\n\nWe introduced a new family of models for forecasting non-stationary time series, Discriminative State-\nSpace Models. This family includes existing generative models and other state-based discriminative\nmodels (e.g. RNNs) as special cases, but also covers several novel algorithmic solutions explored\nin this paper. We presented an analysis of the problem of learning DSSMs in the most general\nsetting of non-stationary stochastic processes and proved \ufb01nite-sample data-dependent generalization\nbounds. These learning guarantees are novel even for traditional state-space models since the existing\nguarantees are only asymptotic and require the model to be correctly speci\ufb01ed. We fully analyzed the\ncomplexity of several DSSMs that are useful in practice. Finally, we also studied the generalization\nguarantees of several structural risk minimization approaches to this problem and provided an\nef\ufb01cient implementation of one such algorithm which is based on a convex objective. We report some\npromising preliminary experimental results in Appendix D.\n\nAcknowledgments\nThis work was partly funded by NSF CCF-1535987 and NSF IIS-1618662, as well as a Google\nResearch Award.\n\n8\n\n\fReferences\nRakesh D. Barve and Philip M. Long. On the complexity of learning from drifting distributions. In\n\nCOLT, 1996.\n\nTim Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986.\nGeorge Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Forecasting and Control.\n\nHolden-Day, Incorporated, 1990.\n\nPeter J Brockwell and Richard A Davis. Time Series: Theory and Methods. Springer-Verlag, New\n\nYork, 1986.\n\nJ.J.F. Commandeur and S.J. Koopman. An Introduction to State Space Time Series Analysis. OUP\n\nOxford, 2007.\n\nCorinna Cortes, Giulia DeSalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yand. Multi-armed\n\nbandits with non-stationary rewards. CoRR, abs/1710.10657, 2017.\n\nVictor H. De la Pe\u00f1a and Evarist Gin\u00e9. Decoupling: from dependence to independence: randomly\nstopped processes, U-statistics and processes, martingales and beyond. Probability and its\napplications. Springer, NY, 1999.\n\nJ. Durbin and S.J. Koopman. Time Series Analysis by State Space Methods: Second Edition. Oxford\n\nStatistical Science Series. OUP Oxford, 2012.\n\nRobert Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United\n\nKingdom in\ufb02ation. Econometrica, 50(4):987\u20131007, 1982.\n\nJames D. Hamilton. Time series analysis. Princeton, 1994.\nRudolph Emil Kalman. A new approach to linear \ufb01ltering and prediction problems. Transactions of\n\nthe ASME\u2013Journal of Basic Engineering, 82(Series D), 1960.\n\nVitaly Kuznetsov and Mehryar Mohri. Generalization bounds for time series prediction with non-\n\nstationary processes. In ALT, 2014.\n\nVitaly Kuznetsov and Mehryar Mohri. Learning theory and algorithms for forecasting non-stationary\n\ntime series. In Advances in Neural Information Processing Systems 28, pages 541\u2013549, 2015.\n\nVitaly Kuznetsov and Mehryar Mohri. Time series prediction and on-line learning. In Proceedings of\n\nThe 29th Conference on Learning Theory, COLT 2016, 2016.\n\nVitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes.\n\nMachine Learning, 106(1):93\u2013117, 2017.\n\nM. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse\n\nder Mathematik und ihrer Grenzgebiete. U.S. Government Printing Of\ufb01ce, 1991.\n\nNick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 1987.\n\nZhi-Quan Luo and Paul Tseng. On the convergence of coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 \u2013 35, 1992.\nAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,\n\ncombinatorial parameters, and learnability. In NIPS, 2010.\n\nAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained,\n\nand smoothed adversaries. In NIPS, 2011.\n\nAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform\n\nmartingale laws of large numbers. Probability Theory and Related Fields, 2015a.\n\nAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexi-\n\nties. JMLR, 16(1), January 2015b.\n\nAlexander Zimin and Christopher H. Lampert. Learning theory for conditional risk minimization. In\n\nAISTAT, 2017.\n\n9\n\n\f", "award": [], "sourceid": 2892, "authors": [{"given_name": "Vitaly", "family_name": "Kuznetsov", "institution": "Google Research"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute and Google"}]}