Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
Ron Meir
The problem of time series prediction is studied within the uniform con(cid:173) vergence framework of Vapnik and Chervonenkis. The dependence in(cid:173) herent in the temporal structure is incorporated into the analysis, thereby generalizing the available theory for memoryless processes. Finite sam(cid:173) ple bounds are calculated in terms of covering numbers of the approxi(cid:173) mating class, and the tradeoff between approximation and estimation is discussed. A complexity regularization approach is outlined, based on Vapnik's method of Structural Risk Minimization, and shown to be ap(cid:173) plicable in the context of mixing stochastic processes.
1 Time Series Prediction and Mixing Processes
A great deal of effort has been expended in recent years on the problem of deriving robust distribution-free error bounds for learning, mainly in the context of memory less processes (e.g. [9]). On the other hand, an extensive amount of work has been devoted by statisticians and econometricians to the study of parametric (often linear) models of time series, where the dependence inherent in the sample, precludes straightforward application of many of the standard results form the theory of memoryless processes. In this work we propose an extension of the framework pioneered by Vapnik and Chervonenkis to the problem of time series prediction. Some of the more elementary proofs are sketched, while the main technical results will be proved in detail in the full version of the paper. Consider a stationary stochastic process X = { ... ,X -1, X 0, X 1, ... }, where Xi is a ran(cid:173) dom variable defined over a compact domain in R and such that IXil ::; B with probability 1, for some positive constant B. The problem of one-step prediction, in the mean square sense, can then be phrased as that of finding a function f (.) of the infinite past, such that E IXo - f(X=~) 12 is minimal, where we use the notation xf = (Xi, Xi ti, ... ,Xj ),
·This work was supported in part by the a grant from the Israel Science Foundation
Structural Risk Minimization/or Nonparametric Time Series Prediction
309
j ~ i. It is well known that the optimal predictor in this case is given by the conditional mean, E[XoIX:!J While this solution, in principle, settles the issue of optimal predic(cid:173) tion, it does not settle the issue of actually computing the optimal predictor. First of all, note that ~o compute the conditional mean, the probabilistic law generating the stochastic process X must be known. Furthermore, the requirement of knowing the full past, X=-~, is of course rather stringent. In this work we consider the more practical situation, where a finite sub-sequence Xi" = (Xl, X 2,··· ,XN) is observed, and an optimal prediction is needed, conditioned on this data. Moreover, for each finite sample size N we allow the pre.dictors to be based only on a finite lag vector of size d. Ultimately, in order to achieve full generality one may let d -+ 00 when N -+ 00 in order to obtain the optimal predictor. We first consider the problem of selecting an empirical estimator from a class of functions Fd,n : Rd -+ R, where n is a complexity index of the class (for example, the number of computational nodes in a feedforward neural network with a single hidden layer), and If I ::; B for f E Fd,n. Consider then an empirical predictor fd,n,N(Xi=~), i > N, for Xi based on the finite data set Xi" and depending on the d-dimensional lag vector Xi=~, where fd,n,N E Fd,n. It is possible to split the error incurred by this predictor into three terms, each possessing a rather intuitive meaning. It is the competition between these terms which determines the optimal solution, for a fixed amount of data. First, define the loss of a functional predictor f : Rd -+ R as L(f) = E IXi - f(xi=~) 12 , and let fd,n be the optimal function in Fd,n minimizing this loss. Furthermore, denote the optimal lag d predictor by fd' and its associated loss by L'd. We are then able to split the loss of the empirical predictor fd,n,N into three basic components,
L(fd,n,N) = (Ld,n,N - L'd,n) + (L'd,n - L'd) + L'd,
(I) where Ld,n,N = L(fd,n,N). The third term, L'd, is related to the error incurred in using a fi(cid:173) nite memory model (of lag size d) to predict a process with potentially infinite memory. We do not at present have any useful upper bounds for this term, which is related to the rate of convergence in the martingale convergence theorem, which to the best of our knowledge is unknown for the type of mixing processes we study in this work. The second term in (1) , is related to the so-called approximation error, given by Elfei (X:=-~) - fel,n (Xf=~) 12 to which it can be immediately related through the inequality IIalP - IblPI ::; pia - bll max( a, b) Ip-l . This term measures the excess error incurred by selecting a function f from a class of lim(cid:173) ited complexity Fd,n, while the optimal lag d predictor fei may be arbitrarily complex. Of course, in order to bound this term we will have to make some regularity assumptions about the latter function. Finally, the first term in (1) r~resents the so called estimation error, and is the only term which depends on the data Xl . Similarly to the problem of regression for i.i.d. data, we expect that the approximation and estimation terms lead to conflicting demands on the choice of the the complexity, n, of the functional class Fd,n. Clearly, in order to minimize the approximation error the complexity should be made as large as pos(cid:173) sible. However, doing this will cause the estimation error to increase, because of the larger freedom in choosing a specific function in Fd,n to fit the data. However, in the case of time series there is an additional complication resulting from the fact that the misspecification error L'd is minimized by choosing d to be as large as possible, while this has the effect of increasing both the approximation as well as the estimation errors. We thus expect that sOrhe optimal values of d and n exist for each sample size N. Up to this point we have not specified how to select the empirical estimator f d,n,N. In this work we follow the ideas of Vapnik [8], which have been studied extensively in the con(cid:173) text of i.i.d observations, and restrict our selection to that hypothesis which minimizes the empirical error, given by LN(f) = N~d 2::~d+l IXi - f(x:=~)12 . For this function it is easy to establish (see for example [8]) that (Ld,n,N - L'd,n) ::; 2 sUP!E.rd,n IL(f) - LN(f)I· The main distinction from the i.i.d case, of course, is that random variables appearing in