{"title": "Structural Risk Minimization for Nonparametric Time Series Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 308, "page_last": 314, "abstract": null, "full_text": "Structural Risk Minimization for \n\nNonparametric Time Series Prediction \n\nRon Meir* \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\nrmeir@dumbo.technion.ac.il \n\nAbstract \n\nThe problem of time series prediction is studied within the uniform con(cid:173)\nvergence framework of Vapnik and Chervonenkis. The dependence in(cid:173)\nherent in the temporal structure is incorporated into the analysis, thereby \ngeneralizing the available theory for memoryless processes. Finite sam(cid:173)\nple bounds are calculated in terms of covering numbers of the approxi(cid:173)\nmating class, and the tradeoff between approximation and estimation is \ndiscussed. A complexity regularization approach is outlined, based on \nVapnik's method of Structural Risk Minimization, and shown to be ap(cid:173)\nplicable in the context of mixing stochastic processes. \n\n1 Time Series Prediction and Mixing Processes \n\nA great deal of effort has been expended in recent years on the problem of deriving robust \ndistribution-free error bounds for learning, mainly in the context of memory less processes \n(e.g. [9]). On the other hand, an extensive amount of work has been devoted by statisticians \nand econometricians to the study of parametric (often linear) models of time series, where \nthe dependence inherent in the sample, precludes straightforward application of many of \nthe standard results form the theory of memoryless processes. In this work we propose \nan extension of the framework pioneered by Vapnik and Chervonenkis to the problem of \ntime series prediction. Some of the more elementary proofs are sketched, while the main \ntechnical results will be proved in detail in the full version of the paper. \nConsider a stationary stochastic process X = { ... ,X -1, X 0, X 1, ... }, where Xi is a ran(cid:173)\ndom variable defined over a compact domain in R and such that IXil ::; B with probability \n1, for some positive constant B. The problem of one-step prediction, in the mean square \nsense, can then be phrased as that of finding a function f (.) of the infinite past, such that \nE IXo - f(X=~) 12 is minimal, where we use the notation xf = (Xi, Xi ti, ... ,Xj ), \n\n\u00b7This work was supported in part by the a grant from the Israel Science Foundation \n\n\fStructural Risk Minimization/or Nonparametric Time Series Prediction \n\n309 \n\nj ~ i. It is well known that the optimal predictor in this case is given by the conditional \nmean, E[XoIX:!J While this solution, in principle, settles the issue of optimal predic(cid:173)\ntion, it does not settle the issue of actually computing the optimal predictor. First of all, \nnote that ~o compute the conditional mean, the probabilistic law generating the stochastic \nprocess X must be known. Furthermore, the requirement of knowing the full past, X=-~, \nis of course rather stringent. In this work we consider the more practical situation, where \na finite sub-sequence Xi\" = (Xl, X 2,\u00b7\u00b7\u00b7 ,XN) is observed, and an optimal prediction is \nneeded, conditioned on this data. Moreover, for each finite sample size N we allow the \npre.dictors to be based only on a finite lag vector of size d. Ultimately, in order to achieve \nfull generality one may let d -+ 00 when N -+ 00 in order to obtain the optimal predictor. \nWe first consider the problem of selecting an empirical estimator from a class of functions \nFd,n : Rd -+ R, where n is a complexity index of the class (for example, the number \nof computational nodes in a feedforward neural network with a single hidden layer), and \nIf I ::; B for f E Fd,n. Consider then an empirical predictor fd,n,N(Xi=~), i > N, \nfor Xi based on the finite data set Xi\" and depending on the d-dimensional lag vector \nXi=~, where fd,n,N E Fd,n. It is possible to split the error incurred by this predictor into \nthree terms, each possessing a rather intuitive meaning. It is the competition between these \nterms which determines the optimal solution, for a fixed amount of data. First, define the \nloss of a functional predictor f : Rd -+ R as L(f) = E IXi - f(xi=~) 12 , and let fd,n \nbe the optimal function in Fd,n minimizing this loss. Furthermore, denote the optimal lag \nd predictor by fd' and its associated loss by L'd. We are then able to split the loss of the \nempirical predictor fd,n,N into three basic components, \n\nL(fd,n,N) = (Ld,n,N - L'd,n) + (L'd,n - L'd) + L'd, \n\n(I) \nwhere Ld,n,N = L(fd,n,N). The third term, L'd, is related to the error incurred in using a fi(cid:173)\nnite memory model (of lag size d) to predict a process with potentially infinite memory. We \ndo not at present have any useful upper bounds for this term, which is related to the rate of \nconvergence in the martingale convergence theorem, which to the best of our knowledge is \nunknown for the type of mixing processes we study in this work. The second term in (1) , is \nrelated to the so-called approximation error, given by Elfei (X:=-~) - fel,n (Xf=~) 12 to which \nit can be immediately related through the inequality IIalP - IblPI ::; pia - bll max( a, b) Ip-l . \nThis term measures the excess error incurred by selecting a function f from a class of lim(cid:173)\nited complexity Fd,n, while the optimal lag d predictor fei may be arbitrarily complex. Of \ncourse, in order to bound this term we will have to make some regularity assumptions about \nthe latter function. Finally, the first term in (1) r~resents the so called estimation error, \nand is the only term which depends on the data Xl . Similarly to the problem of regression \nfor i.i.d. data, we expect that the approximation and estimation terms lead to conflicting \ndemands on the choice of the the complexity, n, of the functional class Fd,n. Clearly, in \norder to minimize the approximation error the complexity should be made as large as pos(cid:173)\nsible. However, doing this will cause the estimation error to increase, because of the larger \nfreedom in choosing a specific function in Fd,n to fit the data. However, in the case of time \nseries there is an additional complication resulting from the fact that the misspecification \nerror L'd is minimized by choosing d to be as large as possible, while this has the effect \nof increasing both the approximation as well as the estimation errors. We thus expect that \nsOrhe optimal values of d and n exist for each sample size N. \nUp to this point we have not specified how to select the empirical estimator f d,n,N. In this \nwork we follow the ideas of Vapnik [8], which have been studied extensively in the con(cid:173)\ntext of i.i.d observations, and restrict our selection to that hypothesis which minimizes the \nempirical error, given by LN(f) = N~d 2::~d+l IXi - f(x:=~)12 . For this function it is \neasy to establish (see for example [8]) that (Ld,n,N - L'd,n) ::; 2 sUP!E.rd,n IL(f) - LN(f)I\u00b7 \nThe main distinction from the i.i.d case, of course, is that random variables appearing in \n\n\f310 \n\nR. Meir \n\nthe empirical error, LN(f), are no longer independent. It is clear at this point that some \nassumptions are needed regarding the stochastic process X, in order that a law of large \nnumbers may be established. In any event, it is obvious that the standard approach of using \nrandomization and symmetrization as in the i.i.d case [3] will not work here. To circum(cid:173)\nvent this problem, two approaches have been proposed. The first makes use of the so-called \nmethod of sieves together with extensions of the Bernstein inequality to dependent data [6]. \nThe second approach, to be pursued here, is based on mapping the problem onto one char(cid:173)\nacterized by an i.i.d process [10], and the utilization of the standard results for the latter \ncase. \n\nIn order to have some control of the estimation error discussed above, we will restrict our(cid:173)\nselves in this work to the class of so-called mixing processes. These are processes for which \nthe 'future' depends only weakly on the 'past', in a sense that will now be made precise. \nFollowing the definitions and notation of Yu [10], which will be utilized in the sequel, let \n(7t = (7(Xf) and (7:+m = (7(Xt~m)' be the sigma-algebras of events generated by the ran-\ndom variables Xf = (X1 ,X2 , \u2022\u2022\u2022 ,Xt) and Xi1.m = (X1+m,Xl+m+1 , \u2022\u2022 . ), respectively. \nWe then define 13m, the coefficient of absolute regularity, as \n13m = SUPt>l Esup {IP(BI(7I) - P(B)I : BE (7:+m} , where the expectation is taken \nwith respect-to (71 = (7(XD. A stochastic process is said to be 13-mixing if (3m -t 0 as \nm -t 00. We note that there exist many other definitions of mixing (see [2] for details). \nThe motivation for using the 13-mixing coefficient is that it is the weakest form of mixing \nfor which uniform laws of large numbers can be established. In this work we consider two \ntype of processes for which this coefficient decays to zero, namely algebraically decaying \nprocesses for which 13m ~ /3m- r , /3, r > 0, and exponentially mixing processes for which \n13m ~ /3 exp{ -bm K }, jJ, b, I\\, > O. Note that for Markov processes mixing implies expo(cid:173)\nnential mixing, so that at least in this case, there is no loss of generality in assuming that \nthe process is exponentially mixing. Note also that the usual i.i.d process may be obtained \nfrom either the exponentially or algebraically mixing process, by taking the limit I\\, -t 00 \nor r -t 00, respectively. \n\nl)aN} and Tj = {i : (2j -\n\nl)aN + 1 ~ i ~ (2j -\n\nIn this section we follow the approach taken by Yu [10] in deriving uniform laws of large \nnumbers for mixing processes, extending her mainly asymptotic results to finite sample \nbehavior, and somewhat broadening the class of processes considered by her. The basic \nidea in [10], as in many related approaches, involves the construction of an independent(cid:173)\nblock sequence, which is shown to be 'close' to the original process in a well-defined \nprobabilistic sense. We briefly recapitulate the construction, slightly modifying the nota(cid:173)\ntion in [10] to fit in with the present paper. Divide the sequence xi' into 2J-lN blocks, \neach of size aN; we assume for simplicity that N = 2J-lNaN. The blocks are then num(cid:173)\nbered according to their order in the block-sequence. For 1 ~ j ~ J-lN define H j = \nl)aN + 1 ~ i ~ (2j)aN}. \n{i : 2(j -\nDenote the random variables corresponding to the H j and Tj indices as X(j) = {Xi : \ni E H j } and X' (j) = {Xi : i E Tj }. The sequence of H-blocks is then denoted by \nX aN = {X(j)}j:l. Now, construct a sequence of independent and identically distributed \n(i.i.d.) blocks {3(j) )}j:l' where 3(j) = {~i : i E H j }, such that the sequence is indepen(cid:173)\ndent of Xi\" and each block has the same distribution as the block X(j) from the original \nsequence. Because the process is stationary, the blocks 3(j) are not only independent but \nalso identically distributed. The basic idea in the construction of the independent block \nsequence is that it is 'close', in a well-defined sense to the original blocked sequence X aN . \nMoreover, by appropriately selecting the number of blocks, J-lN, depending on the mixing \nnature of the sequence, one may relate properties of the original sequence X f\", to those of \nthe independent block sequence 3 aN (see Lemma 4.1 in [10)). \nLet F be a class of bounded functions, such that 0 ~ f ~ B for any f E F. In order to \n\n\fStructural Risk Minimizationfor Nonparametric Time Series Prediction \n\n311 \n\nrelate the uniform deviations (with respect to F) of the original sequence Xi' to those of \nthe independent-block sequence BaN' use is made of Lemma 4.1 from [10]. We also utilize \nLemma 4.2 from [10] and modify it so that it holds for finite sample size. Consider the \nblock-independent sequence BaN and define EJ.LN 1 = J.L1N 'E~:1 f(B(j)) where f(=.(j\u00bb) = \n'EiEHj f(~i)' j = 1,2, ... , J-lN, is a sequence of independent random variables such that \n111 ~ aNB. In the remainder of the paper we use variables with a tilde above them to \ndenote quantities related to the transformed block sequence. Finally, we use the symbol \nEN to denote the empirical average with respect to the original sequence, namely EN f = \n(N - d)-1 'E~d+1 f(Xi). The following result can be proved by a simple extension of \nLemma 4.2 in [10] . \nLemma 1.1 Suppose F is a permissible class of boundedfunctions, If I ~ B for f E :F. \nThen \n\np {sup lEN f - Efl > t:} ~ 2P {sup IEJ.LN 1 - Ell> aNt:} + 2J-lNf3aN' \n\n(2) \n\nfE';: \n\nfE';: \n\nThe main merit of Lemma 1.1 is in the transformation of the problem from the domain of \ndependent processes, implicit in the quantity lEN f - Efl, to one characterized by indepen(cid:173)\ndent processes, implicit in the term EJ.LN 1 - Ell corresponding to the independent blocks. \nThe price paid for this transformation is the extra term 2J-lN f3aN which appears on the r.h.s \nof the inequality appearing in Lemma 1.1 . \n\n2 Error Bounds \n\n-\n\nd+l \n\n........ ........ \n\nThe development in Section 1 was concerned with a scalar stochastic process X. In order \nto use the results in the context of time series, we first define a new vector-valued pro-\ncess X' = { ... ,X- 1 ,XO,X1 , ... } where Xi = (Xi,Xi - 1, . :.... ,Xi-d) E ~ . For \nthis sequence the f3-mixing coefficients obey the inequality f3m(X ' ) ~ f3m-d(X). Let F \nbe a space of functions mapping Rd -7 R, and for each f E F let the loss function be \ngiven by ff(Xf-d) = IXi - f(X:~~W\u00b7 The loss space is given by L,;: = {ff: f E F} . \nIt is well known in the theory of empirical processes (see [7] for example), that in or(cid:173)\nder to obtain upper bounds on uniform deviations of i.i.d sequences, use must be made \nof the so-called covering number of the function class F, with respect to the empiri-\ncal it,N norm, given by it ,N(f, g) = N- 1 'E~1 If(Xd - g(Xi)l\u00b7 Similarly, we de(cid:173)\nnote the empirical norm with respect to the independent block sequence by [1 ,J.LN' where \n[1,J.LN(f,g) = J-l,'/ 'E~:1 11(x(j)) - g(X(j) I, and where f(X(j\u00bb) = 'EiEHj Xi and simi(cid:173)\nlarly for g. Following common practice we denote the t:-covering number of the functional \nspace F using the metric p by N(t:, F, p). \nDefinition 1 Let L';: be a class of real-valued functions from RD --t R, D = d + 1. For \neachff E L,;:andx = (Xl,X2, .. . ,XaN ), Xi E R D , let if (x) = 'E~:lff(Xi)' Then \ndefine \u00a3,;: = {if: if E L';:} , where if : RaND -7 R+. \nIn order to obtain results in terms of the covering numbers of the space L';: rather than \u00a3,;:, \nwhich corresponds to the transformed sequence, we need the following lemma, which is \nnot hard to prove. \nLemma 2.1 For any t: > 0 \n\nN (t:, \u00a3,;:, [1 ,J.LN) ~ N (t:jaN, L';:, h,N). \n\n\f312 \n\nR. Meir \n\nPROOF The result follows by sequence of simple inequalities, showing that ll.J1.N (j, g) ~ \nI \naNh,N(f, g). \nWe now present the main result of this section, namely an upper bound for the uniform \ndeviations of mixing processes, which in turn yield upper bounds on the error incurred by \nthe empirically optimal predictor fd ,n.N. \n\nTheorem 2.1 Let X = { . .. ,Xl' X o, Xl, ... } be a bounded stationary (3-mixing stochas(cid:173)\ntic process, with IXil ~ B, and let F be a class of bo unded functions, f : Rd ~ [0, B]. \nFor each sample size N, let f~ be the function in :F which minimizes the empirical error, \nand 1* is the function in F minimizing the true error L(f). Then, \n\nwhere c' = c/128B. \n\nPROOF The theorem is established by making use of Lemma 1.1, and the basic results from \nthe theory of uniform convergence for i.i.d. processes, together with Lemma 2.1 relating \nthe covering numbers of the spaces iF and LF. The covering numbers of LF and Fare \neasily related using N(c, LF, Ll (P)) ~ N(c/2B, F, Ll (P)) . \nI \nUp to this point we have not specified J..tN and aN, and the result is therefore quite general. \nIn order to obtain weak consistency we require that that the r.h.s. of (3) converge to zero \nfor each c > O. This immediately yields the following conditions on J..tN (and thus also on \naN through the condition 2aNJ..tN = N). \n\nCorollary 2.1 Under the conditions of Theorem 2.1, and the added requirements that d = \no(aN) and N(c, F, h,N) < 00, the following choices of J..tN are sufficient to guarantee the \nweak consistency of the empirical predictor f N: \n\nJ..tN ,..\", N/t/(1+/t) \nJ..tN\"\"\" N s/{1+s), 0 < s < r \n\n(exponential mixing), \n\n(algebraic mixing), \n\n(4) \n\n(5) \n\nwhere the notation aN ,..\", bN implies that O(bN) ~ aN ~ O(bN ). \n\nPROOF Consider first the case of exponential mixing. In this case the r.h.s. of (3) clearly \nconverges to zero because of the finiteness of the covering number. The fastest rate of \nconvergence is achieved by balancing the two terms in the equation, leading to the choice \n'\" N/t/(1+/t). In the case of algebraic mixing, the second term on the r.h.s. of (3) is \nJ..tN \nof the order O(J..tNa\"i/) where we have used d = o(aN). Since J..tNaN '\" N, a sufficient \ncondition to guarantee that this term converge to zero is that J..tN ,..\", Ns/(1+s), 0 < s < r, \nI \nas was claimed. \nIn order to derive bounds on the expected error, we need to make an assumption concerning \nthe covering number of the space F. In particular, we know from the work Haussler [4J \nthat the covering number is upper bounded as follows \n\nN(c , F, L 1 (P)) ~ e(Pdim(F) + 1) -7-\n\n(2 B) Pdim(F) \n\n' \n\nfor any measure P. Thus, assuming the finiteness of the pseudo-dimension of F guarantees \na finite covering number. \n\n\fStructural Risk Minimization/or Nonparametric Time Series Prediction \n\n313 \n\n3 Structural Risk Minimization \n\nThe results in Section 2 provide error bounds for estimators formed by minimizing the em(cid:173)\npirical error over a fixed class of d-dimensional functions. It is clear that the complexity of \nthe class of functions plays a crucial role in the procedure. If the class is too rich, mani(cid:173)\nfested by very large covering numbers, clearly the estimation error term will be very large. \nOn the other hand, biasing the class of functions by restricting its complexity, leads to poor \napproximation rates. A well-known strategy for overcoming this dilemma is obtained by \nconsidering a hierarchy of functional classes with increasing complexity. For any given \nsample size, the optimal trade-off between estimation and approximation can then be de(cid:173)\ntermined by balancing the two terms. Such a procedure was developed in the late seventies \nby Vapnik [8], and termed by him structural risk minimization (SRM). Other more recent \napproaches, collectively termed complexity regularization, have been extensively studied \nin recent years (e.g. [1]). It should be borne in mind, however, that in the context of time \nseries there is an added complexity, that does not exist in the case of regression. Recall \nthat the results derived in Section 2 assumed some fixed lag vector d. In general the op(cid:173)\ntimal value of d is unknown, and could in fact be infinite. In order to achieve optimal \nperformance in a nonparametric setting, it is crucial that the size of the lag be chosen adap(cid:173)\ntively as well. This added complexity needs to be incorporated into the SRM framework, \nif optimal performance in the face of unknown memory size is to be achieved. \nLet Fd,n, d, n E N be a sequence of functions, and define F = U~l U~=l Fd,n ' For any \nFd,n let \n\nwhich from [4] is upper bounded by cc-Pdim(Fd.n). We observe in passing that Lugosi and \nNobel [5] have recently considered situations where the pseudo-dimension Pdim(Fd,n) is \nunknown, and the covering number is estimated empirically from the data. Although this \nline of thought is potentially very useful, we do not pursue it here, but rather assume that \nupper bounds on the pseudo-dimensions of Fd,n are known, as is the case for many classes \nof functions used in practice (see for example [9]). \n\nIn line with the standard approach in [8] we introduce a new empirical function, which \ntakes into account both the empirical error as well as the complexity costs penalizing overly \ncomplex models (large complexity index n and lag size d). Let \n\n(6) \nwhere LN(f) is the empirical error of the predictor f and the complexity penalties ~ are \ngiven by \n\nIogN1(c, Fd,n) + Cn \n\nJ-lN /64(2B)4 \n\n/-LN /64(2B)4 . \n\n(7) \n\n(8) \n\nThe specific form and constants in these definitions are chosen with hindsight, so as to \nachieve the optimal rates of convergence in Theorem 3.1 below. The constants Cn and Cd \nare positive constants obeying l:~=1 e- Cn \n:::; 1 and similarly for Cd . A possible choice is \nCn = 210g n + 1 and Cd = 210g d + 1. The value of J-lN can be chosen in accordance with \nCorollary 2.1. \n\nLet id,n,N minimize the empirical error LN(f) within the class of functions Fd ,n' \",!e \nassume that the classes Fd,n are compact, so that such a minimizer exists. Further, let IN \n\n\f314 \n\nbe the function in F minimizing the complexity penalized loss (6), namely \n\nLd n N(1~) = min min Ld n N(1~ n N) \n\n, \n\n, \n\nd2: 1 n2: 1 \n\n\" \n\n\" \n\nR. Meir \n\n(9) \n\nThe following basic result establishes the consistency of the structural risk minimization \napproach, and yields upper bounds on its performance. \n\nTheorem 3.1 Let Fd,n, d, n E N be sequence offunctional classes, where 1 E Fd,n is \na mapping from Rd to R The expected loss of the function iN, selected according to the \nSRM principle, is upper bounded by \n\nEL(iN) ::; min {inf L(J) + Cl \n\nd,n \n\nd,n \n\nThe main merit of Theorem 3.1 is the demonstration that the SRM procedure achieves an \noptimal balance between approximation and estimation, while retaining its non parametric \nattributes. In particular, if the optimal lag d predictor 1J belongs to Fd,no for some no, the \nSRM predictor would converge to it at the same rate as if no were known in advance. The \nsame type of adaptivity is obtained with respect to the lag size d. The non parametric rates \nof convergence of the SRM predictor will be discussed in the full paper. \n\nReferences \n\n[1] A. Barron. Complexity Regularization with Application to Artificial Neural Net(cid:173)\n\nworks. \nTopics, pages 561-576. Kluwer Academic Press, 1991. \n\nIn G. Roussas, editor, Nonparametric Functional Estimation and Related \n\n[2] L. Gyorfi, W. HardIe, P. Sarda, and P. Vieu. Nonparametric Curve Estimation from \n\nTime Series. Springer Verlag, New York, 1989. \n\n[3] D. Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net \nand Other Learning Applications. Information and Computation, 100:78-150, 1992. \n[4] D. Haussler. Sphere Packing Numbers for Subsets of the Boolean n-Cube with \nJ. Combinatorial Theory, Series A \n\nBounded Vapnik-Chervonenkis Dimesnion. \n69:217-232,1995. \n\n[5] G. Lugosi and A. Nobel. Adaptive Model Selection Using Empirical Complexities. \n\nSubmitted to Annals Statis., 1996. \n\n[6] D. Modha and E. Masry. Memory Universal Prediction of Stationary Random Pro(cid:173)\n\ncesses. IEEE Trans. Inj. Th., January, 1998. \n\n[7] D. Pollard. Convergence of Empirical Processes. Springer Verlag, New York, 1984. \n[8] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, \n\nNew York, 1992. \n\n[9] M. Vidyasagar. A Theory of Learning and Generalization. Springer Verlag, New \n\nYork,1996. \n\n[10] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. \n\nAnnals of Probability, 22:94-116, 1984. \n\n\f", "award": [], "sourceid": 1475, "authors": [{"given_name": "Ron", "family_name": "Meir", "institution": null}]}