{"title": "Learning Theory and Algorithms for Forecasting Non-stationary Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 541, "page_last": 549, "abstract": "We present data-dependent learning bounds for the general scenario of non-stationary non-mixing stochastic processes. Our learning guarantees are expressed in terms of a data-dependent measure of sequential complexity and a discrepancy measure that can be estimated from data under some mild assumptions. We use our learning bounds to devise new algorithms for non-stationary time series forecasting for which we report some preliminary experimental results.", "full_text": "Learning Theory and Algorithms for\nForecasting Non-Stationary Time Series\n\nVitaly Kuznetsov\nCourant Institute\n\nNew York, NY 10011\n\nvitaly@cims.nyu.edu\n\nMehryar Mohri\n\nCourant Institute and Google Research\n\nNew York, NY 10011\nmohri@cims.nyu.edu\n\nAbstract\n\nWe present data-dependent learning bounds for the general scenario of non-\nstationary non-mixing stochastic processes. Our learning guarantees are expressed\nin terms of a data-dependent measure of sequential complexity and a discrepancy\nmeasure that can be estimated from data under some mild assumptions. We use\nour learning bounds to devise new algorithms for non-stationary time series fore-\ncasting for which we report some preliminary experimental results.\n\n1\n\nIntroduction\n\nTime series forecasting plays a crucial role in a number of domains ranging from weather fore-\ncasting and earthquake prediction to applications in economics and \ufb01nance. The classical statistical\napproaches to time series analysis are based on generative models such as the autoregressive moving\naverage (ARMA) models, or their integrated versions (ARIMA) and several other extensions [En-\ngle, 1982, Bollerslev, 1986, Brockwell and Davis, 1986, Box and Jenkins, 1990, Hamilton, 1994].\nMost of these models rely on strong assumptions about the noise terms, often assumed to be i.i.d.\nrandom variables sampled from a Gaussian distribution, and the guarantees provided in their support\nare only asymptotic.\nAn alternative non-parametric approach to time series analysis consists of extending the standard\ni.i.d. statistical learning theory framework to that of stochastic processes. In much of this work, the\nprocess is assumed to be stationary and suitably mixing [Doukhan, 1994]. Early work along this\napproach consisted of the VC-dimension bounds for binary classi\ufb01cation given by Yu [1994] under\nthe assumption of stationarity and -mixing. Under the same assumptions, Meir [2000] presented\nbounds in terms of covering numbers for regression losses and Mohri and Rostamizadeh [2009]\nproved general data-dependent Rademacher complexity learning bounds. Vidyasagar [1997] showed\nthat PAC learning algorithms in the i.i.d. setting preserve their PAC learning property in the -mixing\nstationary scenario. A similar result was proven by Shalizi and Kontorovitch [2013] for mixtures\nof -mixing processes and by Berti and Rigo [1997] and Pestov [2010] for exchangeable random\nvariables. Alquier and Wintenberger [2010] and Alquier et al. [2014] also established PAC-Bayesian\nlearning guarantees under weak dependence and stationarity.\nA number of algorithm-dependent bounds have also been derived for the stationary mixing setting.\nLozano et al. [2006] studied the convergence of regularized boosting. Mohri and Rostamizadeh\n[2010] gave data-dependent generalization bounds for stable algorithms for '-mixing and -mixing\nstationary processes. Steinwart and Christmann [2009] proved fast learning rates for regularized\nalgorithms with \u21b5-mixing stationary sequences and Modha and Masry [1998] gave guarantees for\ncertain classes of models under the same assumptions.\nHowever, stationarity and mixing are often not valid assumptions. For example, even for Markov\nchains, which are among the most widely used types of stochastic processes in applications, station-\narity does not hold unless the Markov chain is started with an equilibrium distribution. Similarly,\n\n1\n\n\flong memory models such as ARFIMA, may not be mixing or mixing may be arbitrarily slow [Bail-\nlie, 1996]. In fact, it is possible to construct \ufb01rst order autoregressive processes that are not mixing\n[Andrews, 1983]. Additionally, the mixing assumption is de\ufb01ned only in terms of the distribution\nof the underlying stochastic process and ignores the loss function and the hypothesis set used. This\nsuggests that mixing may not be the right property to characterize learning in the setting of stochastic\nprocesses.\nA number of attempts have been made to relax the assumptions of stationarity and mixing. Adams\nand Nobel [2010] proved asymptotic guarantees for stationary ergodic sequences. Agarwal and\nDuchi [2013] gave generalization bounds for asymptotically stationary (mixing) processes in the\ncase of stable on-line learning algorithms. Kuznetsov and Mohri [2014] established learning guar-\nantees for fully non-stationary - and '-mixing processes.\nIn this paper, we consider the general case of non-stationary non-mixing processes. We are not\naware of any prior work providing generalization bounds in this setting. In fact, our bounds appear\nto be novel even when the process is stationary (but not mixing). The learning guarantees that we\npresent hold for both bounded and unbounded memory models. Deriving generalization bounds for\nunbounded memory models even in the stationary mixing case was an open question prior to our\nwork [Meir, 2000]. Our guarantees cover the majority of approaches used in practice, including\nvarious autoregressive and state space models.\nThe key ingredients of our generalization bounds are a data-dependent measure of sequential com-\nplexity (expected sequential covering number or sequential Rademacher complexity [Rakhlin et al.,\n2010]) and a measure of discrepancy between the sample and target distributions. Kuznetsov and\nMohri [2014] also give generalization bounds in terms of discrepancy. However, unlike the result\nof Kuznetsov and Mohri [2014], our analysis does not require any mixing assumptions which are\nhard to verify in practice. More importantly, under some additional mild assumption, the discrep-\nancy measure that we propose can be estimated from data, which leads to data-dependent learning\nguarantees for non-stationary non-mixing case.\nWe devise new algorithms for non-stationary time series forecasting that bene\ufb01t from our data-\ndependent guarantees. The parameters of generative models such as ARIMA are typically estimated\nvia the maximum likelihood technique, which often leads to non-convex optimization problems. In\ncontrast, our objective is convex and leads to an optimization problem with a unique global solution\nthat can be found ef\ufb01ciently. Another issue with standard generative models is that they address non-\nstationarity in the data via a differencing transformation which does not always lead to a stationary\nprocess.\nIn contrast, we address the problem of non-stationarity in a principled way using our\nlearning guarantees.\nThe rest of this paper is organized as follows. The formal de\ufb01nition of the time series forecasting\nlearning scenario as well as that of several key concepts is given in Section 2. In Section 3, we\nintroduce and prove our new generalization bounds. In Section 4, we give data-dependent learn-\ning bounds based on the empirical discrepancy. These results, combined with a novel analysis of\nkernel-based hypotheses for time series forecasting (Appendix B), are used to devise new forecast-\ning algorithms in Section 5. In Appendix C, we report the results of preliminary experiments using\nthese algorithms.\n\n2 Preliminaries\n\nWe consider the following general time series prediction setting where the learner receives a real-\nization (X1, Y1), . . . , (XT , YT ) of some stochastic process, with (Xt, Yt) 2 Z = X \u21e5 Y. The ob-\njective of the learner is to select out of a speci\ufb01ed family H a hypothesis h: X ! Y that achieves a\nsmall generalization error E[L(h(XT +1), YT +1)|Z1, . . . , ZT ] conditioned on observed data, where\nL: Y \u21e5 Y ! [0,1) is a given loss function. The path-dependent generalization error that we\nconsider in this work is a \ufb01ner measure of the generalization ability than the averaged generaliza-\ntion error E[L(h(XT +1), YT +1)] = E[E[L(h(XT +1), YT +1)|Z1, . . . , ZT ]] since it only takes into\nconsideration the realized history of the stochastic process and does not average over the set of all\npossible histories. The results that we present in this paper also apply to the setting where the time\nparameter t can take non-integer values and prediction lag is an arbitrary number l 0. That is, the\nerror is de\ufb01ned by E[L(h(XT +l), YT +l)|Z1, . . . , ZT ] but for notational simplicity we set l = 1.\n\n2\n\n\fOur setup covers a larger number of scenarios commonly used in practice. The case X = Y p\ncorresponds to a large class of autoregressive models. Taking X = [1p=1Y p leads to growing\nmemory models which, in particular, include state space models. More generally, X may contain\nboth the history of the process {Yt} and some additional side information.\nTo simplify the notation, in the rest of the paper, we will use the shorter notation f(z) = L(h(x), y),\nfor any z = (x, y) 2 Z and introduce the family F = {(x, y) ! L(h(x), y): h 2 H} containing\nsuch functions f. We will assume a bounded loss function, that is |f| \uf8ff M for all f 2 F for\nsome M 2 R+. Finally, we will use the shorthand Zb\na to denote a sequence of random variables\nZa, Za+1, . . . , Zb.\nThe key quantity of interest in the analysis of generalization is the following supremum of the\nempirical process de\ufb01ned as follows:\n\n(ZT\n\n1 ) = sup\n\nf2F E[f(ZT +1)|ZT\n\n1 ] \n\nqtf(Zt)!,\n\nTXt=1\n\n(1)\n\nwhere q1, . . . , qT are real numbers, which in the standard learning scenarios are chosen to be uni-\nform. In our general setting, different Zts may follow different distributions, thus distinct weights\ncould be assigned to the errors made on different sample points depending on their relevance to\nforecasting the future ZT +1. The generalization bounds that we present below are for an arbitrary\nsequence q = (q1, . . . qT ) which, in particular, covers the case of uniform weights. Remarkably, our\nbounds do not even require the non-negativity of q.\nOur generalization bounds are expressed in terms of data-dependent measures of sequential com-\nplexity such as expected sequential covering number or sequential Rademacher complexity [Rakhlin\net al., 2010]. We give a brief overview of the notion of sequential covering number and refer the\nreader to the aforementioned reference for further details. We adopt the following de\ufb01nition of a\ncomplete binary tree: a Z-valued complete binary tree z is a sequence (z1, . . . , zT ) of T mappings\nzt : {\u00b11}t1 ! Z, t 2 [1, T ]. A path in the tree is = (1, . . . , T1). To simplify the notation we\nwill write zt() instead of zt(1, . . . , t1), even though zt depends only on the \ufb01rst t 1 elements\nof . The following de\ufb01nition generalizes the classical notion of covering numbers to sequential\nsetting. A set V of R-valued trees of depth T is a sequential \u21b5-cover (with respect to q-weighted `p\nnorm) of a function class G on a tree z of depth T if for all g 2 G and all 2 {\u00b1}T , there is v 2 V\nsuch that\n\n TXt=1vt() g(zt())p! 1\n\np\n\n\uf8ff kqk1\nq \u21b5,\n\nwhere k \u00b7 kq is the dual norm. The (sequential) covering number Np(\u21b5,G, z) of a function class G\non a given tree z is de\ufb01ned to be the size of the minimal sequential cover. The maximal covering\nnumber is then taken to be Np(\u21b5,G) = supz Np(\u21b5,G, z). One can check that in the case of uniform\nweights this de\ufb01nition coincides with the standard de\ufb01nition of sequential covering numbers. Note\nthat this is a purely combinatorial notion of complexity which ignores the distribution of the process\nin the given learning problem.\nData-dependent sequential covering numbers can be de\ufb01ned as follows. Given a stochastic process\n) denoting the conditional distribution at\ndistributed according to the distribution p with pt(\u00b7|zt1\ntime t, we sample a Z \u21e5 Z-valued tree of depth T according to the following procedure. Draw two\nindependent samples Z1, Z01 from p1: in the left child of the root draw Z2, Z02 according to p2(\u00b7|Z1)\nand in the right child according to p2(\u00b7|Z02). More generally, for a node that can be reached by a\npath (1, . . . , t), we draw Zt, Z0t according to pt(\u00b7|S1(1), . . . , St1(t1)), where St(1) = Zt\nand St(1) = Z0t. Let z denote the tree formed using Zts and de\ufb01ne the expected covering number\nto be Ez\u21e0T (p)[Np(\u21b5,G, z)], where T (p) denotes the distribution of z.\nIn a similar manner, one can de\ufb01ne other measures of complexity such as sequential Rademacher\ncomplexity and the Littlestone dimension [Rakhlin et al., 2015] as well as their data-dependent\ncounterparts [Rakhlin et al., 2011].\n\n1\n\n3\n\n\fThe \ufb01nal ingredient needed for expressing our learning guarantees is the notion of discrepancy\nbetween target distribution and the distribution of the sample:\n\n = sup\n\nf2F\u2713 E[f(ZT +1)|ZT\n1 ] \n\nqt E[f(Zt)|Zt1\n\n1\n\n]\u25c6.\n\n(2)\n\nTXt=1\n\nThe discrepancy is a natural measure of the non-stationarity of the stochastic process Z with\nrespect to both the loss function L and the hypothesis set H. In particular, note that if the process\nZ is i.i.d., then we simply have = 0 provided that qts form a probability distribution. It is also\npossible to give bounds on in terms of other natural distances between distribution. For instance,\nPinsker\u2019s inequality yields\n\nt=1 qtPt(\u00b7|Zt1\n\n1\n\nt=1 qtPt(\u00b7|Zt1\n\n1\n\nt=1 qtPt(\u00b7|Zt1\n\n)TV \uf8ffr 1\n\n1 ) kPT\n\n1 ) PT\n\n2 D\u21e3PT +1(\u00b7|ZT\n\n \uf8ff MPT +1(\u00b7|ZT\nwhere k \u00b7 kTV is the total variation distance and D(\u00b7 k \u00b7) the relative entropy, Pt+1(\u00b7|Zt\n1) the condi-\ntional distribution of Zt+1, andPT\n) the mixture of the sample marginals. Alterna-\ntively, if the target distribution at lag l, P = PT +l is a stationary distribution of an asymptotically\nstationary process Z [Agarwal and Duchi, 2013, Kuznetsov and Mohri, 2014], then for qt = 1/T\nwe have\nTXt=1\nwhere (l) = sups supz[kP Pl+s(\u00b7|zs\n)kTV] is the coef\ufb01cient of asymptotic stationarity. The\nprocess is asymptotically stationary if liml!1 (l) = 0. However, the most important property of\nthe discrepancy is that, as shown later in Section 4, it can be estimated from data under some\nadditional mild assumptions.\n[Kuznetsov and Mohri, 2014] also give generalization bounds for\nnon-stationary mixing processes in terms of a related notion of discrepancy. It is not known if the\ndiscrepancy measure used in [Kuznetsov and Mohri, 2014] can be estimated from data.\n\nkP Pt+l(\u00b7|Zt\n\n)kTV \uf8ff (l),\n\n)\u2318,\n\n \uf8ff\n\nM\nT\n\n1\n\n1\n\n1\n\n3 Generalization Bounds\n\nIn this section, we prove new generalization bounds for forecasting non-stationary time series. The\n\ufb01rst step consists of using decoupled tangent sequences to establish concentration results for the\nsupremum of the empirical process (ZT\n1 we say that\nZ0T\n1 is a decoupled tangent sequence if Z0t is distributed according to P(\u00b7|Zt1\n) and is independent\nof Z1t . It is always possible to construct such a sequence of random variables [De la Pe\u02dcna and Gin\u00b4e,\n1999]. The next theorem is the main result of this section.\nTheorem 1. Let ZT\nThen, the following holds:\n\n1 be a sequence of random variables distributed according to p. Fix \u270f > 2\u21b5 > 0.\n\n1 ). Given a sequence of random variables ZT\n\n1\n\nProof. The \ufb01rst step is to observe that, since the difference of the suprema is upper bounded by the\nsupremum of the difference, it suf\ufb01ces to bound the probability of the following event\n\nBy Markov\u2019s inequality, for any > 0, the following inequality holds:\n\n2\u25c6.\n(\u270f 2\u21b5)2\n2M 2kqk2\n\nv\u21e0T (p)\u21e5N1(\u21b5,F, v)\u21e4 exp\u2713\n] f(Zt))! \u270f).\n\nqt(E[f(Zt)|Zt1\n\n1\n\nP(ZT\n\n1 ) \u270f \uf8ff E\nf2F TXt=1\n( sup\nP sup\nf2F TXt=1\nqt(E[f(Zt)|Zt1\n\uf8ff exp(\u270f) E\" exp sup\n\n1\n\n] f(Zt))! \u270f!\nqt(E[f(Zt)|Zt1\n\n1\n\nf2F TXt=1\n\n4\n\n] f(Zt))!!#.\n\n\fSince Z0T\nE[f(Z0t)|ZT\n\n1 is a tangent sequence the following equalities hold: E[f(Zt)|Zt1\n1 ]. Using these equalities and Jensen\u2019s inequality, we obtain the following:\n\n] = E[f(Z0t)|Zt1\n\n1\n\n1\n\n] =\n\n1\n\nf2F\n\n] f(Zt)\u2318\nE\uf8ff exp\u21e3 sup\nTXt=1\nqt E[f(Zt)|Zt1\n= E\uf8ff exp\u21e3 sup\n1i\u2318\nEh TXt=1\nqtf(Z0t) f(Zt)|ZT\nqtf(Z0t) f(Zt)\u2318,\n\uf8ff E\uf8ff exp\u21e3 sup\nTXt=1\n\nf2F\n\nf2F\n\nwhere the last expectation is taken over the joint measure of ZT\n(Appendix A), we can further bound this expectation by\n\n1 and Z0T\n\n1 . Applying Lemma 5\n\nE\n\n(z,z0)\u21e0T (p)\n\n\uf8ff\n\nE\n\n(z,z0)\u21e0T (p)\n\n\uf8ff 1\n\n2 E\n(z,z0)\n\n= E\n\nz\u21e0T (p)\n\nf2F\n\nE\uf8ff exp\u2713 sup\ntqt\u21e3f(z0t()) f(zt())\u2318\u25c6\nTXt=1\ntqtf(zt())\u25c6\nE\uf8ff exp\u2713 sup\nTXt=1\nTXt=1\ntqtf(z0t())\u25c6 + 1\nE\uf8ff exp\u27132 sup\nE\uf8ff exp\u27132 sup\nTXt=1\nTXt=1\nE\uf8ff exp\u27132 sup\ntqtf(zt())\u25c6,\nTXt=1\n\ntqtf(z0t()) + sup\nf2F\n\n2 E\n(z,z0)\n\nf2F\n\nf2F\n\nf2F\n\nf2F\n\ntqtf(zt())\u25c6\n\nwhere for the second inequality we used Young\u2019s inequality and for the last equality we used sym-\nmetry. Given z let C denote the minimal \u21b5-cover with respect to the q-weighted `1-norm of F on\nz. Then, the following bound holds\n\nBy the monotonicity of the exponential function,\n\nE\uf8ff exp\u27132 sup\n\nc2C\n\nf2F\n\nsup\nf2F\n\nTXt=1\n\nTXt=1\n\ntqtct() + \u21b5.\n\ntqtf(zt()) \uf8ff max\nc2C\n\nTXt=1\ntqtct()\u25c6\ntqtf(zt())\u25c6 \uf8ff exp(2\u21b5) E\uf8ff exp\u27132 max\nTXt=1\ntqtct()\u25c6.\nE\uf8ff exp\u27132\nTXt=1\n\uf8ff exp(2\u21b5)Xc2C\ntqtct()\u25c6 ET\uf8ff exp\u27132T qT cT ()\u25c6T1\ntqtct()\u25c6 = E\uf8ff exp\u27132\nT1Xt=1\ntqtct()\u25c6 exp(22q2\n\uf8ff E\uf8ff exp\u27132\nT1Xt=1\n]f(Zt)) \u270f\u25c6 \uf8ff E\n[N1(\u21b5,G, v)] exp\u21e3(\u270f2\u21b5)+22M 2kqk2\n2\u2318.\n\nT M 2)\n\nv\u21e0T (p)\n\n1\n\n\n\nSince ct() depends only on 1, . . . , T1, by Hoeffding\u2019s bound,\n\nE\uf8ff exp\u27132\n\nTXt=1\n\nand iterating this inequality and using the union bound, we obtain the following:\n\nP\u2713 sup\n\nf2F\n\nTXt=1\n\nqt(E[f(Zt)|Zt1\n\n1\n\nOptimizing over completes the proof.\n\nAn immediate consequence of Theorem 1 is the following result.\n\n5\n\n\fCorollary 2. For any > 0, with probability at least 1 , for all f 2 F and all \u21b5 > 0,\nqtf(Zt) + + 2\u21b5 + Mkqk2r2 log Ev\u21e0T (P)[N1(\u21b5,G, v)]\n\nE[f(ZT +1)|ZT\n\n1 ] \uf8ff\n\n\n\nTXt=1\n\n.\n\nWe are not aware of other \ufb01nite sample bounds in a non-stationary non-mixing case. In fact, our\nbounds appear to be novel even in the stationary non-mixing case. Using chaining techniques\nbounds, Theorem 1 and Corollary 2 can be further improved and we will present these results in\nthe full version of this paper.\nWhile Rakhlin et al. [2015] give high probability bounds for a different quantity than the quantity of\ninterest in time series prediction,\n\nsup\n\nf2F TXt=1\n\nqt(E[f(Zt)|Zt1\n\n1\n\n] f(Zt))!,\n\n(3)\n\ntheir analysis of this quantity can also be used in our context to derive high probability bounds for\n(ZT\n1 ) . However, this approach results in bounds that are in terms of purely combinatorial\nnotions such as maximal sequential covering numbers N1(\u21b5,F). While at \ufb01rst sight, this may seem\nas a minor technical detail, the distinction is crucial in the setting of time series prediction. Consider\nthe following example. Let Z1 be drawn from a uniform distribution on {0, 1} and Zt \u21e0 p(\u00b7|Zt1)\nwith p(\u00b7|y) being a distribution over {0, 1} such that p(x|y) = 2/3 if x = y and 1/3 otherwise. Let G\nbe de\ufb01ned by G = {g(x) = 1x\u2713 : \u2713 2 [0, 1]}. Then, one can check that Ev\u21e0T (P)[N1(\u21b5,G, v)] = 2,\nwhile N1(\u21b5,G) 2T . The data-dependent bounds of Theorem 1 and Corollary 2 highlight the fact\nthat the task of time series prediction lies in between the familiar i.i.d. scenario and adversarial\non-line learning setting.\nHowever, the key component of our learning guarantees is the discrepancy term . Note that in the\ngeneral non-stationary case, the bounds of Theorem 1 may not converge to zero due to the discrep-\nancy between the target and sample distributions. This is also consistent with the lower bounds of\nBarve and Long [1996] that we discuss in more detail in Section 4. However, convergence can be\nestablished in some special cases. In the i.i.d. case our bounds reduce to the standard covering num-\nbers learning guarantees. In the drifting scenario, with ZT\n1 being a sequence of independent random\nvariables, our discrepancy measure coincides with the one used and studied in [Mohri and Mu\u02dcnoz\nMedina, 2012]. Convergence can also be established in asymptotically stationary and stationary\nmixing cases. However, as we show in Section 4, the most important advantage of our bounds is\nthat the discrepancy measure we use can be estimated from data.\n\n4 Estimating Discrepancy\n\nIn Section 3, we showed that the discrepancy is crucial for forecasting non-stationary time se-\nries. In particular, if we could select a distribution q over the sample ZT\n1 that would minimize the\ndiscrepancy and use it to weight training points, then we would have a better learning guarantee\nfor an algorithm trained on this weighted sample. In some special cases, the discrepancy can\nbe computed analytically. However, in general, we do not have access to the distribution of ZT\n1\nand hence we need to estimate the discrepancy from the data. Furthermore, in practice, we never\nobserve ZT +1 and it is not possible to estimate without some further assumptions. One natural\nassumption is that the distribution Pt of Zt does not change drastically with t on average. Under\nthis assumption the last s observations ZT\nTs+1 are effectively drawn from the distribution close to\nPT +1. More precisely, we can write\n\ns\n\nTXt=Ts+1\n\nf2F\u27131\n \uf8ff sup\nf2F\u2713 E[f(ZT +1)|ZT\n1 ] \n\nE[f(Zt)|Zt1\n1\ns\n\n+ sup\n\n1\n\nTXt=1\n] \nTXt=Ts+1\n\n1\n\n]\u25c6\nqt E[f(Zt)|Zt1\n]\u25c6.\n\nE[f(Zt)|Zt1\n\n1\n\nWe will assume that the second term, denoted by s, is suf\ufb01ciently small and will show that the \ufb01rst\nterm can be estimated from data. But, we \ufb01rst note that our assumption is necessary for learning in\n\n6\n\n\fthis setting. Observe that\n\nsup\n\nf2F\u21e3 E[ZT +1|ZT\n\n1 ] E[f(Zr)|Zr1\n\n1\n\n]\u2318 \uf8ff\n\n1] E[f(Zt)|Zt1\n\n1\n\n]\u2318\n\n\uf8ff M\n\nsup\n\nf2F\u21e3 E[f(Zt+1)|Zt\nTXt=r\nTXt=r kPt+1(\u00b7|Zt\n1 ] E[f(Zt)|Zt\n\n)kTV,\n\n1\n\n1) Pt(\u00b7|Zt1\n1]\u2318 \uf8ff\n\ns + 1\n2 M ,\n\nfor all r = T s + 1, . . . , T . Therefore, we must have\nf2F\u21e3 E[ZT +1|ZT\n\ns \uf8ff\n\n1\n\nsup\n\ns Xt=Ts+1\n1)Pt(\u00b7|Zt1\n\n1\n\n3\n\n)kTV. Barve and Long [1996] showed that [VC-dim(H)] 1\nwhere =suptkPt+1(\u00b7|Zt\nis a lower bound on the generalization error in the setting of binary classi\ufb01cation where ZT\n1 is a\nsequence of independent but not identically distributed random variables (drifting). This setting is a\nspecial case of the more general scenario that we are considering.\nThe following result shows that we can estimate the \ufb01rst term in the upper bound on .\nTheorem 3. Let ZT\nleast 1 , the following holds for all \u21b5 > 0:\n(pt qt) E[f(Zt)|Zt1\n\n1 be a sequence of random variables. Then, for any > 0, with probability at\n\n(pt qt)f(Zt)! + B,\n\nf2F TXt=1\n\n]! \uf8ff sup\nwhere B = 2\u21b5+ Mkqpk2q2 log Ez\u21e0T (p)[N1(\u21b5,G,z)]\n\nand where p is the uniform distribution over\n\nf2F TXt=1\n\nthe last s points.\n\nsup\n\n1\n\n\n\nThe proof of this result is given in Appendix A. Theorem 1 and Theorem 3 combined with the union\nbound yield the following result.\nCorollary 4. Let ZT\nleast 1 , the following holds for all f 2 F and all \u21b5 > 0:\n\n1 be a sequence of random variables. Then, for any > 0, with probability at\n\nE[f(ZT +1)|ZT\n1 ] \uf8ff\nTXt=1\n\nwhere e = supf2F\u21e3PT\n\n5 Algorithms\n\nqtf(Zt) +e + s + 4\u21b5 + M\u21e5kqk2 + kq pk2\u21e4q2 log 2 Ev\u21e0T (p)[N1(\u21b5,G,z)]\n\n\n\n,\n\nt=1(pt qt)f(Zt)\u2318.\n\nIn this section, we use our learning guarantees to devise algorithms for forecasting non-stationary\ntime series. We consider a broad family of kernel-based hypothesis classes with regression losses.\nWe present the full analysis of this setting in Appendix B including novel bounds on the sequential\nRademacher complexity. The learning bounds of Theorem 1 can be generalized to hold uniformly\n\nover q at the price of an additional term in O\u21e3kquk1qlog2 log2 kq uk1\n1 \u2318. We prove this result\nin Theorem 8 (Appendix B). Suppose L is the squared loss and H = {x ! w\u00b7 (x): kwkH \uf8ff \u21e4},\nwhere : X ! H is a feature mapping from X to a Hilbert space H. By Lemma 6 (Appendix B),\nwe can bound the complexity term in our generalization bounds by\n\nO\u21e3(log3 T )\n\n\u21e4r\npT\n\n+ (log3 T )kq uk1\u2318,\n\nwhere K is a PDS kernel associated with H such that supx K(x, x) \uf8ff r and u is the uniform\ndistribution over the sample. Then, we can formulate a joint optimization problem over both q and\nw based on the learning guarantee of Theorem 8, which holds uniformly over all q:\n\nmin\n\n0\uf8ffq\uf8ff1,w\u21e2 TXt=1\n\nqt(w \u00b7 (xt) yt)2 + 1\n\nTXt=1\n\ndtqt + 2kwk2\nH\n\n7\n\n+ 3kq uk1.\n\n(4)\n\n\fHere, we have upper bounded the empirical discrepancy term byPT\nby supw0\uf8ff\u21e4 |PT\n\nt=1 dtqt with each dt de\ufb01ned\ns=1 ps(w0 \u00b7 (xs) ys)2 (w0 \u00b7 (xt) yt)2|. Each dt can be precomputed\nusing DC-programming. For general loss functions, the DC-programming approach only guarantees\nconvergence to a stationary point. However, for the squared loss, our problem can be cast as an\ninstance of the trust region problem, which can be solved globally using the DCA algorithm of Tao\nand An [1998]. Note that problem (4) is not jointly convex in q and w. However, using the dual\nproblem associated to w yields the following equivalent problem, it can be rewritten as follows:\n\n0\uf8ffq\uf8ff1\u21e2 max\n\nmin\n\n\u21b5 n 2\n\n\u21b52\nt\n\nqt \u21b5T K\u21b5 + 22\u21b5T Yo + 1(d\u00b7q) + 3kq uk1,\n\nTXt=1\n\n(5)\n\nwhere d = (d1, . . . , dT )T , K is the kernel matrix and Y = (y1, . . . , yT )T . We use the change of\nvariables rt = 1/qt and further upper bound 3kq uk1 by 03kr T 2uk2, which follows from\n|qt ut| = |qtut(rt T )| and H\u00a8older\u2019s inequality. Then, this yields the following optimization\nproblem:\n\nmin\n\nr2D\u21e2 max\n\n\u21b5 n 2\n\nTXt=1\n\nrt\u21b52\n\nt \u21b5T K\u21b5 + 22\u21b5T Yo + 1\n\nTXt=1\n\ndt\nrt\n\n2,\n+ 3kr T 2uk2\n\n(6)\n\nwhere D = {r: rt 1, t 2 [1, T ]}. The optimization problem (6) is convex since D is a convex set,\nthe \ufb01rst term in (6) is convex as a maximum of convex (linear) functions of r. This problem can be\nsolved using standard descent methods, where, at each iteration, we solve a standard QP in \u21b5, which\nadmits a closed-form solution. Parameters 1, 2, and 3 are selected through cross-validation.\nAn alternative simpler algorithm based on the data-dependent bounds of Corollary 4 consists of\n\ufb01rst \ufb01nding a distribution q minimizing the (regularized) discrepancy and then using that to \ufb01nd a\nhypothesis minimizing the (regularized) weighted empirical risk. This leads to the following two-\nstage procedure. First, we \ufb01nd a solution q\u21e4 of the following convex optimization problem:\n\nmin\n\nq0\u21e2 sup\n\nw0\uf8ff\u21e4\u21e3 TXt=1\n\n(pt qt)(w0 \u00b7 (xt) yt)2\u2318 + 1kq uk1,\n\n(7)\n\nwhere 1 and \u21e4 are parameters that can be selected via cross-validation. Our generalization bounds\nhold for arbitrary weights q but we restrict them to being positive sequences. Note that other reg-\nularization terms such as kqk2\n2 from the bound of Corollary 4 can be incorporated\nin the optimization problem, but we discard them to minimize the number of parameters. This\nproblem can be solved using standard descent optimization methods, where, at each step, we use\nDC-programming to evaluate the supremum over w0. Alternatively, one can upper bound the supre-\n\n2 and kq pk2\n\nmum byPT\n\nt=1 qtdt and then solve the resulting optimization problem.\n\nThe solution q\u21e4 of (7) is then used to solve the following (weighted) kernel ridge regression problem:\n\nmin\n\nw \u21e2 TXt=1\n\nH.\nq\u21e4t (w \u00b7 (xt) yt)2 + 2kwk2\n\n(8)\n\nNote that, in order to guarantee the convexity of this problem, we require q\u21e4 0.\n6 Conclusion\n\nWe presented a general theoretical analysis of learning in the broad scenario of non-stationary non-\nmixing processes, the realistic setting for a variety of applications. We discussed in detail several\nalgorithms bene\ufb01tting from the learning guarantees presented. Our theory can also provide a \ufb01ner\nanalysis of several existing algorithms and help devise alternative principled learning algorithms.\n\nAcknowledgments\nThis work was partly funded by NSF IIS-1117591 and CCF-1535987, and the NSERC PGS D3.\n\n8\n\n\fReferences\nT. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling.\n\nThe Annals of Probability, 38(4):1345\u20131367, 2010.\n\nA. Agarwal and J. Duchi. The generalization ability of online algorithms for dependent data.\n\nTheory, IEEE Transactions on, 59(1):573\u2013587, 2013.\n\nInformation\n\nP. Alquier and O. Wintenberger. Model selection for weakly dependent time series forecasting. Technical\n\nReport 2010-39, Centre de Recherche en Economie et Statistique, 2010.\n\nP. Alquier, X. Li, and O. Wintenberger. Prediction of time series by statistical learning: general losses and fast\n\nrates. Dependence Modelling, 1:65\u201393, 2014.\n\nD. Andrews. First order autoregressive processes and strong mixing. Cowles Foundation Discussion Papers\n\n664, Cowles Foundation for Research in Economics, Yale University, 1983.\n\nR. Baillie. Long memory processes and fractional integration in econometrics. Journal of Econometrics, 73\n\n(1):5\u201359, 1996.\n\nR. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. In COLT, 1996.\nP. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability\n\nLetters, 32(4):385 \u2013 391, 1997.\n\nT. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986.\nG. E. P. Box and G. Jenkins. Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated, 1990.\nP. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer-Verlag, New York, 1986.\nV. H. De la Pe\u02dcna and E. Gin\u00b4e. Decoupling: from dependence to independence: randomly stopped processes,\n\nU-statistics and processes, martingales and beyond. Probability and its applications. Springer, NY, 1999.\nP. Doukhan. Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York, 1994.\nR. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom\n\nin\ufb02ation. Econometrica, 50(4):987\u20131007, 1982.\n\nJ. D. Hamilton. Time series analysis. Princeton, 1994.\nV. Kuznetsov and M. Mohri. Generalization bounds for time series prediction with non-stationary processes.\n\nIn ALT, 2014.\n\nA. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting\n\nalgorithms with stationary -mixing observations. In NIPS, pages 819\u2013826, 2006.\n\nR. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages\n\n5\u201334, 2000.\n\nD. Modha and E. Masry. Memory-universal prediction of stationary random processes. Information Theory,\n\nIEEE Transactions on, 44(1):117\u2013133, Jan 1998.\n\nM. Mohri and A. Mu\u02dcnoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT,\n\n2012.\n\nM. Mohri and A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, 2009.\nM. Mohri and A. Rostamizadeh. Stability bounds for stationary '-mixing and -mixing processes. Journal of\n\nMachine Learning Research, 11:789\u2013814, 2010.\n\nV. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. In GRC, 2010.\nA. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and\n\nlearnability. In NIPS, 2010.\n\nA. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Stochastic, constrained, and smoothed adversaries.\n\nIn NIPS, 2011.\n\nA. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martingale laws of large num-\n\nbers. Probability Theory and Related Fields, 2015.\n\nC. Shalizi and A. Kontorovitch. Predictive PAC learning and process decompositions. In NIPS, 2013.\nI. Steinwart and A. Christmann. Fast learning from non-i.i.d. observations. In NIPS, 2009.\nP. D. Tao and L. T. H. An. A D.C. optimization algorithm for solving the trust-region subproblem. SIAM\n\nJournal on Optimization, 8(2):476\u2013505, 1998.\n\nM. Vidyasagar. A Theory of Learning and Generalization: With Applications to Neural Networks and Control\n\nSystems. Springer-Verlag New York, Inc., 1997.\n\nB. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability,\n\n22(1):94\u2013116, 1994.\n\n9\n\n\f", "award": [], "sourceid": 371, "authors": [{"given_name": "Vitaly", "family_name": "Kuznetsov", "institution": "Courant Institute"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute and Google"}]}