{"title": "Random Projection Filter Bank for Time Series Data", "book": "Advances in Neural Information Processing Systems", "page_first": 6562, "page_last": 6572, "abstract": "We propose Random Projection Filter Bank (RPFB) as a generic and simple approach to extract features from time series data. RPFB is a set of randomly generated stable autoregressive filters that are convolved with the input time series to generate the features. These features can be used by any conventional machine learning algorithm for solving tasks such as time series prediction, classification with time series data, etc. Different filters in RPFB extract different aspects of the time series, and together they provide a reasonably good summary of the time series. RPFB is easy to implement, fast to compute, and parallelizable. We provide an error upper bound indicating that RPFB provides a reasonable approximation to a class of dynamical systems. The empirical results in a series of synthetic and real-world problems show that RPFB is an effective method to extract features from time series.", "full_text": "Random Projection Filter Bank for Time Series Data\n\nAmir-massoud Farahmand\n\nMitsubishi Electric Research Laboratories (MERL)\n\nCambridge, MA, USA\nfarahmand@merl.com\n\nMitsubishi Electric Research Laboratories (MERL)\n\nSepideh Pourazarm\n\nCambridge, MA, USA\n\nsepid@bu.edu\n\nMitsubishi Electric Research Laboratories (MERL)\n\nDaniel Nikovski\n\nCambridge, MA, USA\nnikovski@merl.com\n\nAbstract\n\nWe propose Random Projection Filter Bank (RPFB) as a generic and simple\napproach to extract features from time series data. RPFB is a set of randomly\ngenerated stable autoregressive \ufb01lters that are convolved with the input time series\nto generate the features. These features can be used by any conventional machine\nlearning algorithm for solving tasks such as time series prediction, classi\ufb01cation\nwith time series data, etc. Different \ufb01lters in RPFB extract different aspects of\nthe time series, and together they provide a reasonably good summary of the time\nseries. RPFB is easy to implement, fast to compute, and parallelizable. We provide\nan error upper bound indicating that RPFB provides a reasonable approximation\nto a class of dynamical systems. The empirical results in a series of synthetic and\nreal-world problems show that RPFB is an effective method to extract features\nfrom time series.\n\n1\n\nIntroduction\n\nThis paper introduces Random Projection Filter Bank (RPFB) for feature extraction from time series\ndata. RPFB generates a feature vector that summarizes the input time series by projecting the time\nseries onto the span of a set of randomly generated dynamical \ufb01lters. The output of RPFB can\nthen be used as the input to any conventional estimator (e.g., ridge regression, SVM, and Random\nForest [Hastie et al., 2001; Bishop, 2006; Wasserman, 2007]) to solve problems such as time series\nprediction, and classi\ufb01cation and fault prediction with time series input data. RPFB is easy to\nimplement, is fast to compute, and can be parallelized easily.\nRPFB consists of a set of randomly generated \ufb01lters (i.e., dynamical systems that receive inputs),\nwhich are convolved with the input time series. The \ufb01lters are stable autoregressive (AR) \ufb01lters, so\nthey can capture information from the distant past of the time series. This is in contrast with more\nconventional approach of considering only a \ufb01xed window of the past time steps, which may not\ncapture all relevant information. RPFB is inspired from the random projection methods [Vempala,\n2004; Baraniuk and Wakin, 2009], which reduce the input dimension while preserving important\nproperties of the data, e.g., being an approximate isometric map.\nIt is also closely related to\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRandom Kitchen Sink [Rahimi and Recht, 2009] for approximating potentially in\ufb01nite-dimensional\nreproducing kernel Hilbert space (RKHS) with a \ufb01nite set of randomly selected features. RPFB can\nbe thought of as the dynamical system (or \ufb01lter) extension of these methods. RPFB is also related to\nthe methods in the Reservoir Computing literature [Luko\u0161evi\u02c7cius and Jaeger, 2009] such as Echo\nState Network and Liquid State Machine, in which a recurrent neural network (RNN) with random\nweights provides a feature vector to a trainable output layer. The difference of RPFB with them is that\nwe are not considering an RNN as the underlying excitable dynamical system, but a set of AR \ufb01lters.\nThe algorithmic contribution of this work is the introduction of RPFB as a generic and simple to use\nfeature extraction method for time series data (Section 3). RPFB is a particularly suitable choice for\nindustrial applications where the available computational power is limited, e.g., a fault prognosis\nsystem for an elevator that has only a micro-controller available. For these industrial applications, the\nuse of powerful methods such as various adaptable RNN architectures [Hochreiter and Schmidhuber,\n1997; Cho et al., 2014; Oliva et al., 2017; Goodfellow et al., 2016], which learn the feature extractor\nitself, might be computationally infeasible.\nThe theoretical contribution of this work is the \ufb01nite sample analysis of RPFB for the task of time\nseries prediction (Section 4). The theory has two main components. The \ufb01rst is a \ufb01lter approximation\nerror result, which provides an error guarantee on how well one might approximate a certain class of\ndynamical systems with a set of randomly generated \ufb01lters. The second component is a statistical\nresult providing a \ufb01nite-sample guarantee for time series prediction with a generic class of linear\nsystems. Combining these two, we obtain a \ufb01nite-sample guarantee for the use of RPFB for time\nseries prediction of a certain class of dynamical systems.\nFinally, we empirically study RPFB along several standard estimators on a range of synthetic and\nreal-world datasets (Section 5). Our synthetic data is based on Autoregressive Moving Average\n(ARMA) processes. This lets us closely study various aspects of the method. Moving to real-world\nproblems, we apply RPFB to the fault diagnosis problem from ball bearing vibration measurements.\nWe compare the performance of RPFB with that of the \ufb01xed-window history-based approach, as\nwell as LSTM, and we obtain promising empirical results. Due to space limitation, most of the\ndevelopment of the theory and experimental results are reported in the supplementary material, which\nis an extended version of this paper. For more empirical studies, especially in the context of fault\ndetection and prognosis, refer to Pourazarm et al. [2017].\n\n2 Learning from Time Series Data\nConsider a sequence (X1, Y1), . . . , (XT , YT ) of dependent random variables with X \u2208 X and\nY \u2208 Y. Depending on how Xt and Yt are de\ufb01ned, we can describe different learning/decision\nmaking problems. For example, suppose that Yt = f\u2217(Xt) + \u03b5t, in which f\u2217 is an unknown function\nof the current value of Xt and \u03b5t is independent of the history X1:t = (X1, . . . , Xt) and has a\nzero expectation, i.e., E [\u03b5t] = 0. Finding an estimate \u02c6f of f\u2217 using data is the standard regression\n(or classi\ufb01cation) problem depending on whether Y \u2282 R (regression) or Y = {0, 1, . . . , c \u2212\n1} (classi\ufb01cation). For example, suppose that we are given a dataset of m time series Dm =\n{(Xi,1, Yi,1), . . . , (Xi,Ti, Yi,Ti)}m\ni=1, each of which might have a varying length Ti. There are many\nmethods to de\ufb01ne an estimator for f\u2217, e.g., K-Nearest Neighbourhood, decision tree, SVM, various\nneural networks [Hastie et al., 2001; Bishop, 2006; Wasserman, 2007; Goodfellow et al., 2016]. An\nimportant class of estimators is based on (regularized) empirical risk minimization (ERM):\n\nm(cid:88)\n\nTi(cid:88)\n\nl(f (Xi,t), Yi,t) + \u03bbJ(f ).\n\n(1)\n\n\u02c6f \u2190 argmin\nf\u2208F\n\n1\nm\n\n1\nTi\n\ni=1\n\nt=1\n\nHere F : X \u2192 Y(cid:48) is a function space (e.g., an RKHS with the domain X ; with Y(cid:48) = R). The loss\nfunction is l : Y(cid:48) \u00d7 Y \u2192 [0,\u221e), and it determines the decision problem that is being solved, e.g.,\nl(y1, y2) = |y1 \u2212 y2|2 for the squared loss commonly used in regression. The optional regularizer\n(or penalizer) J(f ) controls the complexity of the function space, e.g., it can be (cid:107)f(cid:107)2F when F is an\nRKHS. The difference of this scenario with more conventional scenarios in the supervised learning\nand statistics is that here the input data does not satisfy the usual independence assumption anymore.\nLearning with dependent input data has been analyzed before [Steinwart et al., 2009; Steinwart and\nChristmann, 2009; Mohri and Rostamizadeh, 2010; Farahmand and Szepesv\u00e1ri, 2012].\n\n2\n\n\fm(cid:88)\n\nTi(cid:88)\n\n\u02c6f \u2190 argmin\nf\u2208F\n\n1\nm\n\n1\n\nTi \u2212 H\n\nMore generally, however, Yt is not a function of only Xt, but is a function of the history X1:t, possibly\ncontaminated by a (conditionally) independent noise: Yt = f\u2217(X1:t) + \u03b5t. In the learning problem,\nf\u2217 is an unknown function. The special case of f\u2217(X1:t) = f\u2217(Xt) is the same as the previous\nsetting.\nTo learn an estimator by directly using the history X1:t is challenging as it is a time-varying vector\nwith an ever increasing dimension. A standard approach to deal with this issue is to use a \ufb01xed-\nwindow history-based estimator, which shall be explained next (cf. Kakade et al. [2017] for some\nrecent theoretical results). The RPFB is an alternative approach that we describe in Section 3.\nIn the \ufb01xed-window history-based approach (or window-based, for short), we only look at a \ufb01xed\nwindow of the immediate past values of X1:t. That is, we use samples in the form of Zt (cid:44) Xt\u2212H+1:t\nwith a \ufb01nite integer H that determines the length of the window. For example, the regularized\nleast-squares regression estimator would then be\n\n|f (Xi,t\u2212H+1:t)) \u2212 Yi,t|2 + \u03bbJ(f ),\n\n(2)\n\ni=1\nwhich should be compared to (1).\nA problem with this approach is that for some stochastic processes, a \ufb01xed-sized window of length H\nis not enough to capture all information about the process. As a simple illustrative example, consider\na simple moving average MA(1) univariate random process (i.e., X = R):\n\nt=H\n\nXt = U (t) + bU (t \u2212 1) = (1 + bz\u22121)Ut,\n\nb \u2208 (\u22121, 1)\n\nin which z\u22121 is the time-delay operator (cf. Z-transform, Oppenheim et al. 1999), i.e., z\u22121Xt = Xt\u22121.\nSuppose that Ut = U (t) (t = 1, 2, . . . ) is an unobservable random process that drives Xt. For\nexample, it might be an independent and identically distributed (i.i.d.) Gaussian noise, which we do\nnot observe (so it is our latent variable). To predict Yt = Xt+1 given the previous observations X1:t,\nwe write Ut = Xt\n\n1+bz\u22121 , so\n\nXt+1 = Ut+1 + bUt = Ut+1 +\n\nb\n\n1 + bz\u22121 Xt = Ut+1 + b\n\n(\u2212b)kXt\u2212k.\n\n(3)\n\nThis means that Xt is an autoregressive process AR(\u221e). The prediction of Xt+1 requires the value of\nUt+1, which is unavailable at time t, and all the past values X1:t. Since Ut+1 is unavailable, we cannot\nuse it in our estimate, so this is the intrinsic dif\ufb01culty of prediction. On the other hand, the values of\nX1:t are available to us and we can use them to predict Xt+1. But if we use a \ufb01xed-horizon window\nof the past values (i.e., only use Xt\u2212H+1:t for a \ufb01nite H \u2265 1), we would miss some information\nthat could potentially be used. This loss of information is more prominent when the magnitude of\nb is close to 1. This example shows that even for a simple MA(1) process with unobserved latent\nvariables, a \ufb01xed-horizon window is not a complete summary of the stochastic process.\nMore generally, suppose that we have a univariate linear ARMA process\n\nA(z\u22121)Xt = B(z\u22121)Ut,\n\n(4)\nwith A and B both being polynomials in z\u22121.1 The random process Ut is not available to us,\nand we want to design a predictor (\ufb01lter) for Xt+1 based on the observed values X1:t. Sup-\npose that A and B are of degree more than 1, so we can write A(z\u22121) = 1 + z\u22121A(cid:48)(z\u22121) and\nB(z\u22121) = 1 + z\u22121B(cid:48)(z\u22121).2 Assuming that A and B are both invertible, we use (4) to get Ut =\nB\u22121(z\u22121)A(z\u22121)Xt. Also we can write (4) as (1+z\u22121A(cid:48)(z\u22121))Xt+1 = (1+z\u22121B(cid:48)(z\u22121))Ut+1 =\nUt+1 + B(cid:48)(z\u22121)Ut. Therefore, we have\n\n(cid:88)\n\nk\u22650\n\n(cid:20) B(cid:48)(z\u22121)A(z\u22121)\n\nB(z\u22121)\n\n(cid:21)\n\n\u2212 A(cid:48)(z\u22121)\n\nXt = Ut+1 +\n\nB(cid:48)(z\u22121) \u2212 A(cid:48)(z\u22121)\n\nB(z\u22121)\n\nXt.\n\n(5)\n\nXt+1 = Ut+1 +\n\nSo if the unknown noise process Ut has a zero mean (i.e., E [Ut|U1:t\u22121] = 0), the estimator\n\n\u02c6Xt+1(X1:t) =\n\nB(cid:48)(z\u22121) \u2212 A(cid:48)(z\u22121)\n\nB(z\u22121)\n\nXt,\n\n(6)\n\n1We assume that A and B both have roots within the unit circle, i.e., they are stable.\n2The fact that both of these polynomials have a leading term of 1 does not matter in this argument.\n\n3\n\n\fis unbiased, i.e., \u02c6Xt+1(X1:t) = E [Xt+1|X1:t].\nIf we knew the model of the dynamical system (A and B), we could design the \ufb01lter (6) to provide an\nunbiased prediction for the future values of Xt+1. If the learning problem is such that it requires us\nto know an estimate of the future observations of the dynamical system, this scheme would allow us\nto design such an estimator. The challenge here is that we often do not know A and B (or similar for\nother types of dynamical systems). Estimating A and B for a general dynamical system is a dif\ufb01cult\ntask. The use of maximum likelihood-based approaches is prone to local minimum since U is not\nknown, and one has to use EM-like algorithms, cf. White et al. [2015] and references therein. Here\nwe suggest a simple alternative based on the idea of projecting the signal onto the span of randomly\ngenerated dynamical systems. This would be RPFB, which we describe next.\n\n3 Random Projection Filter Bank\n\nThe idea behind RPFB is to randomly generate many simple dynamical systems that can approximate\ndynamical systems such as the optimal \ufb01lter in (6) with a high accuracy. Denote the linear \ufb01lter in (6)\nas\n\nfor two polynomials p and q, both in z\u22121. Suppose that deg(q) = deg(B) = dq and deg(A) = dA,\nthen deg(p) = dp = max{dA \u2212 1, dq \u2212 1}. Assume that q has roots z1, . . . , zdq \u2208 C without any\nmultiplicity. This means that\n\nB(cid:48)(z\u22121) \u2212 A(cid:48)(z\u22121)\n\nB(z\u22121)\n\np(z\u22121)\nq(z\u22121)\n\n,\n\n=\n\nq(z\u22121) =\n\n(z\u22121 \u2212 zi).\n\ndq(cid:89)\n\ni=1\n\ndq(cid:88)\n\ni=1\n\n4\n\nIn complex analysis in general, and in control engineering and signal processing in particular, the roots\nof q are known as the poles of the dynamical system and the roots of p are its zeros. Any discrete-time\nlinear time-invariant (LTI) dynamical system has such a frequency domain representation.3\nWe have two cases of either dp < dq or dp \u2265 dq. We focus on the \ufb01rst case and describe the RPFB,\nand the intuition behind it. Afterwards we will discuss the second case.\nCase 1: Suppose that dp < dq, which implies that dA \u2212 1 < dq. We may write\n\np(z\u22121)\nq(z\u22121)\n\n=\n\nbi\n\n1 \u2212 ziz\u22121 ,\n\n(7)\n\nfor some choice of bis. This means that we can write (5) as\n\nXt+1 = Ut+1 +\n\n= Ut+1 +\n\ndq(cid:88)\n\ni=1\n\nB(cid:48)(z\u22121) \u2212 A(cid:48)(z\u22121)\n\nB(z\u22121)\n\nXt\n\nbi\n\n1 \u2212 ziz\u22121 Xt.\n\nThat is, if we knew the set of complex poles Zp = {z1, . . . , zdq} and their corresponding coef\ufb01cients\nBp = {b1, . . . , bdq}, we could provide an unbiased estimate of Xt+1 based on X1:t. From now on,\nwe assume that the underlying unknown system is a stable one, that is, |zi| \u2264 1.\nRandom projection \ufb01lter bank is based on randomly generating many simple stable dynamical\nsystems, which is equivalent to generating many random poles within the unit circle. Since any stable\nLTI \ufb01lter has a representation (7) (or a similar one in Case 2), we can approximate the true dynamical\nsystem as a linear combination of randomly generated poles (i.e., \ufb01lters). If the number of \ufb01lters is\nlarge enough, the approximation will be accurate.\nTo be more precise, we cover the set of {z \u2208 C : |z| \u2264 1} with N (\u03b5) random points N\u03b5 =\nN (\u03b5)} such that for any zi \u2208 Zp, there exists a Z(cid:48) \u2208 N\u03b5 with |zi \u2212 Z(cid:48)(zi)| < \u03b5. Roughly\n{Z(cid:48)\n1, . . . , Z(cid:48)\n3For continuous-time systems, we may use Laplace transform instead of Z-transform, and have similar\n\nrepresentations.\n\n\fspeaking, we require N (\u03b5) = O(\u03b5\u22122) random points to cover the unit circle with the resolution of \u03b5.\nWe then de\ufb01ne the RPFB as the following set of AR \ufb01lters denoted by \u03c6(z\u22121):4\n\n(cid:32)\n\n(cid:33)\n\n\u03c6(z\u22121) : z\u22121 (cid:55)\u2192\n\n1\n1 \u2212 Z(cid:48)\n1z\u22121 , . . . ,\n\n1\n1 \u2212 Z(cid:48)\nN (\u03b5)z\u22121\n\n.\n\n(8)\n\nWith a slight abuse of notation, we use \u03c6(X) to refer to the (multivariate) time series generated after\npassing a signal X = (X1, . . . , Xt) through the set of \ufb01lters \u03c6(z\u22121). More concretely, this means\niz\u22121 (i = 1, . . . ,N (\u03b5)).\nthat we convolve the signal X with the impulse response of each of \ufb01lters\n1\u2212az\u22121 is the sequence (at)t\u22650, and the convolution X \u2217 Y\nRecall that the impulse response of\nbetween two sequences (Xt)t\u22650 and (Yt)t\u22650 is a new sequence\n\n1\n1\u2212Z(cid:48)\n\n1\n\n(X \u2217 Y )t =\n\nX\u03c4 Yt\u2212\u03c4 .\n\n(9)\n\n(cid:88)\n\n\u03c4\n\nWe use [\u03c6(X)]i \u2208 CN (\u03b5) to refer to the i-th time-step of the multivariate signal \u03c6(X1:i).\nThe intuition of why this is a good construction is that whenever |z1 \u2212 z2| is small, the behaviour\nof \ufb01lter\n1\u2212z2z\u22121 . So whenever N\u03b5 provides a good coverage of the unit circle,\nthere exists a sequence (b(cid:48)\n\n1\u2212z1z\u22121 is similar to\n\nj) such that the dynamical system\n\n1\n\n1\n\n= \u03c6(z\u22121)b(cid:48) =\n\np(cid:48)(z\u22121)\nq(cid:48)(z\u22121)\nq (7). As this is a linear model, parameters b(cid:48) can be estimated using\nbehaves similar to the unknown p\nordinary least-squares regression, ridge regression, Lasso, etc. For example, the ridge regression\nm(cid:88)\nestimator for b(cid:48) is\n\nb(cid:48)\n1 \u2212 Z(cid:48)\njz\u22121\n\nTi(cid:88)\n\nj=1\n\nj\n\n|[\u03c6(Xi)]tb \u2212 Xi,t+1|2 + \u03bb(cid:107)b(cid:107)2\n2 .\n\n\u02c6b \u2190 argmin\n\n1\nm\n\n1\nTi\n\ni=1\n\nt=1\n\nN (\u03b5)(cid:88)\n\nAfter obtaining \u02c6b, we de\ufb01ne\n\nb\n\n\u02c6bj\n1 \u2212 Z(cid:48)\njz\u22121 X1:t,\nwhich is an estimator of \u02c6X(X1:t) (6), i.e., \u02c6X(X1:t) \u2248 \u02dcX(X1:t; \u02c6b).\nCase 2: Suppose that dp \u2265 dq, which implies that dA \u2212 1 \u2265 dq. Then, we may write\n\n\u02dcX(X1:t; \u02c6b) =\n\nj=1\n\np(z\u22121)\nq(z\u22121)\n\n= R(z\u22121) +\n\n\u03c1(z\u22121)\nq(z\u22121)\n\n,\n\nwhere \u03c1 and R are obtained by the Euclidean division of p by q, i.e., p(z\u22121) = R(z\u22121)q(z\u22121) +\n\u03c1(z\u22121) and deg(R) \u2264 dA \u2212 1 \u2212 dq and deg(\u03c1) < dq. We can write:\n\ndA\u22121\u2212dq(cid:88)\n\np(z\u22121)\nq(z\u22121)\n\n=\n\ndq(cid:88)\n\n\u03bdjz\u2212j +\n\nbi\n\n1 \u2212 ziz\u22121 .\n\n(10)\n\nj=0\n\ni=1\n\nThis is similar to (7) of Case 1, with the addition of lag terms. If we knew the set of complex poles\nand their corresponding coef\ufb01cients as well as the coef\ufb01cients of the residual lag terms, \u03bdj, we could\nprovide an unbiased estimate of Xt+1 based on X1:t. Since we do not know the location of poles, we\nrandomly generate them as before. For this case, the feature set (8) should be expanded to\n\n(cid:32)(cid:104)\n\n1, z\u22121, .., z\u2212(dA\u22121\u2212dq)(cid:105)\n\n\u03c6(z\u22121) : z\u22121 (cid:55)\u2192\n\n,\n\n1\n1 \u2212 Z(cid:48)\n1z\u22121 , ..,\n\n1\n1 \u2212 Z(cid:48)\nN (\u03b5)z\u22121\n\n,\n\n(11)\n\n(cid:33)\n\n4One could generate different types of \ufb01lters, for example those with nonlinearities, but in this work we focus\n\non linear AR \ufb01lters to simplify the analysis.\n\n5\n\nN (\u03b5)(cid:88)\n\n\fAlgorithm 1 Random Projection Filter Bank\n// Dm = {(Xi,1, Yi,1), . . . , (Xi,Ti , Yi,Ti )}m\n// l : Y(cid:48) \u00d7 Y \u2192 R: Loss function\n// F: Function space\n// n: Number of \ufb01lters in the random projection \ufb01lter bank\nDraw Z(cid:48)\nn uniformly random within the unit circle\nDe\ufb01ne \ufb01lters \u03c6(z\u22121) =\nfor i = 1 to m do\n\ni=1: Input data\n\n1, . . . , Z(cid:48)\n\n1\n1\u2212Z(cid:48)\n\nnz\u22121\n\n1\n1\u2212Z(cid:48)\n\n1z\u22121 , . . . ,\n\n(cid:16)\n\n(cid:17)\n\nend for\nFind the estimator using extracted features (X(cid:48)\ntion:\n\n\u02c6f \u2190 argmin\nf\u2208F\n\nreturn \u02c6f and \u03c6\n\nm(cid:88)\n\nTi(cid:88)\n\ni=1\n\nt=1\n\nPass the i-th time series through all the random \ufb01lters \u03c6(z\u22121), i.e., X(cid:48)\n\ni,1:Ti = \u03c6(z\u22121) \u2217 Xi,1:Ti\n\ni,1:Ti ), e.g., by solving the regularized empirical risk minimiza-\n\nl(f (X\n\n(cid:48)\ni,t), Yi,t) + \u03bbJ(f ).\n\n(12)\n\nwhich consists of a history window of length dA \u2212 1 \u2212 dq and the random projection \ufb01lters. The\nregressor should then estimate both bis and \u03bdis in (10).\nRPFB is not limited to time series prediction with linear combination of \ufb01ltered signals. One may\nuse the generated features as the input to any other estimator too. RPFB can be used for other\nproblems such as classi\ufb01cation with time series too. Algorithm 1 shows how RPFB is used alongside\na regularized empirical risk minimization algorithm. The inputs to the algorithm are the time series\ndata Dm, with appropriate target values created depending on the problem, the pointwise loss function\nl, the function space F of the hypotheses (e.g., linear, RKHS, etc.), and the number of \ufb01lters n in\nthe RPFB. The \ufb01rst step is to create the RPFB by randomly selecting n stable AR \ufb01lters. We then\npass each time series in the dataset through the \ufb01lter bank in order to create \ufb01ltered features, i.e., the\nfeature are created by convolving the input time series with the \ufb01lters\u2019 impulse responses. Finally,\ntaking into account the problem type (regression or classi\ufb01cation) and function space, we apply\nconventional machine learning algorithms to estimate \u02c6f. Here we present a regularized empirical risk\nminimizer (12) as an example, but other choices are possible too, e.g., decision trees or K-NN. We\nnote that the use of \u03c6(z\u22121) \u2217 Xi,1:Ti in the description of the algorithm should be interpreted as the\nconvolution of the impulse response of \u03c6(z\u22121), which is in the time domain, with the input signal.\n\u221a\u22121, we also\nRemark 1. In practice, whenever we pick a complex pole Z(cid:48) = a + jb with j =\npick its complex conjugate \u00afZ(cid:48) = a \u2212 jb in order to form a single \ufb01lter\n(1\u2212Z(cid:48)z\u22121)(1\u2212 \u00afZ(cid:48)z\u22121). This\nguarantees that the output of this second-order \ufb01lter is real valued.\nRemark 2. RPFB is described for a univariate time series Xt \u2208 R. To deal with multivariate time\nseries (i.e., Xt \u2208 Rd with d > 1) we may consider each dimension separately and pass each one\nthrough RPFB. The \ufb01lters in RPFB can be the same or different for each dimension. The state of\nthe \ufb01lters, of course, depends on their input, so it would be different for each dimension. If we\nhave n \ufb01lters and d-dimensional time series, the resulting vector X(cid:48)\ni,t in Algorithm 1 would be nd\ndimensional. Randomly choosing multivariate \ufb01lters is another possibility, which is a topic of future\nresearch.\nRemark 3. The Statistical Recurrent Unit (SRU), recently introduced by Oliva et al. [2017], has\nsome similarities to RPFB. SRU uses a set of exponential moving averages at various time scales\nto summarize a time series, which are basically AR(1) \ufb01lters with real-valued poles. SRU is more\ncomplex, and potentially more expressive, than RPFB as it has several adjustable weights. On the\nother hand, it does not have the simplicity of RPFB. Moreover, it does not yet come with the same\nlevel of theoretical justi\ufb01cations as RPFB has.\n\n1\n\n4 Theoretical Guarantees\n\nThis section provides a \ufb01nite-sample statistical guarantee for a time series predictor that uses RPFB\nto extract features. We speci\ufb01cally focus on an empirical risk minimization-based (ERM) estimator.\nNote that Algorithm 1 is not restricted to time series prediction problem, or the use of ERM-based\n\n6\n\n\festimator, but the result of this section is. We only brie\ufb02y present the results, and refer the reader\nto the same section in the supplementary material for more detail, including the proofs and more\ndiscussions.\nConsider the time series (X1, X2, . . . ) with Xt \u2208 X \u2282 [\u2212B, B] for some \ufb01nite B > 0. We denote\nX \u2217 = \u222at\u22651X t. The main object of interest in time series prediction is the conditional expectation of\nXt+1 given X1:t, which we denote by h\u2217, i.e.,5\n\n(13)\nWe assume that h\u2217 belongs to the space of linear dynamical systems that has M \u2208 N stable poles all\nwith magnitude less than 1 \u2212 \u03b50 for some \u03b50 > 0, and an \u039b-bounded (cid:96)p-norm on the weights:\n\nh\u2217(X1:t) = E [Xt+1|X1:t] .\n\nH\u03b50,M,p,\u039b (cid:44)\n\nwi\n\n1 \u2212 ziz\u22121 : |zi| \u2264 1 \u2212 \u03b50,(cid:107)w(cid:107)p \u2264 \u039b\n\n.\n\n(14)\n\nIf the value of \u03b50, M, p, or \u039b are clear from context, we might refer to H\u03b50,M,p,\u039b by H. Given a\nfunction (or \ufb01lter) h \u2208 H, here h(x1:t) refers to the output at time t of convolving a signal x1:t\nthrough h.\nTo de\ufb01ne RPFB, we randomly draw n \u2265 M independent complex numbers Z(cid:48)\nwithin a complex circle with radius 1 \u2212 \u03b50, i.e., |Z(cid:48)\n\ni| \u2264 1 \u2212 \u03b50 (cf. Algorithm 1). The RPFB is\n\n1, . . . , Z(cid:48)\n\nn uniformly\n\n(cid:40) M(cid:88)\n\ni=1\n\n(cid:41)\n\n\u03c6(z\u22121) =\n\n1\n1 \u2212 Z(cid:48)\n1z\u22121 , . . . ,\n\n1\n1 \u2212 Z(cid:48)\nnz\u22121\n\nGiven these random poles, we de\ufb01ne the following approximation (\ufb01lter) spaces\n\n\u02dcH\u039b = \u02dcHn,p,\u039b =\n\n\u03b1i\n1 \u2212 Z(cid:48)\n\niz\u22121 : (cid:107)\u03b1(cid:107)p \u2264 \u039b\n\n.\n\n(15)\n\n(cid:18)\n\n(cid:40) n(cid:88)\n\ni=1\n\n(cid:19)\n\n.\n\n(cid:41)\n\n(cid:104)\u02c6h(cid:48)(cid:105)\n\nConsider that we have a sequence (X1, X2, . . . , XT , XT +1, XT +2). By denoting Yt = Xt+1, we\nde\ufb01ne ((X1, Y1), . . . , (XT , YT ), (XT +1, YT +1)). We assume that |Xt| is B-bounded almost surely.\nDe\ufb01ne the estimator \u02c6h by solving the following ERM:\n\nT(cid:88)\n\nt=1\n\n\u02c6h(cid:48) \u2190 argmin\nh\u2208 \u02dcH\u039b\n\n1\nT\n\n(cid:104)\u02c6h(cid:48)(cid:105)\n\nHere TrB\n\n|h(X1:t) \u2212 Yt|2 ,\n\n\u02c6h \u2190 TrB\n\n.\n\n(16)\n\ntruncates the values of \u02c6h(cid:48) at the level of \u00b1B. So \u02c6h belongs to the following space\n\n(cid:110)\n\n(cid:105)\n(cid:104)\u02dch\n\n(cid:111)\n\n\u02dcH\u039b,B =\n\nTrB\n\n: \u02dch \u2208 \u02dcH\u039b\n\n.\n\n(17)\n\nA central object in our result is the notion of discrepancy, introduced by [Kuznetsov and Mohri, 2015].\nDiscrepancy captures the non-stationarity of the process with respect to the function space.6 of\nDe\ufb01nition 1 (Discrepancy\u2014Kuznetsov and Mohri 2015). For a stochastic process X1, X2, . . . , a\nfunction space H : X \u2217 \u2192 R, and T \u2208 N, de\ufb01ne\n\n(cid:40)\n\nE(cid:104)|h(X1:T +1) \u2212 YT +1|2 |X1:T +1\n\n(cid:105) \u2212 1\n\nE(cid:104)|h(X1:t) \u2212 Yt|2 |X1:t\n\n(cid:105)(cid:41)\n\n.\n\n\u2206T (H) (cid:44) sup\nh\u2208H\n\nT(cid:88)\n\nt=1\n\nT\n\nIf the value of T is clear from the context, we may use \u2206(H) instead. The following is the main\ntheoretical result of this work.\n\n5We use h instead of f to somehow emphasize that the discussion is only for the time series prediction\n\nproblem, and not general estimation problem with a time series.\n\n6Our de\ufb01nition is a simpli\ufb01ed version of the original de\ufb01nition (by selecting qt = 1/T in their notation).\n\n7\n\n\fTheorem 1. Consider the time series (X1, . . . , XT +2), and assume that |Xt| \u2264 B (a.s.). Without\nloss of generality suppose that B \u2265 1. Let 0 < \u03b50 < 1, M \u2208 N, and \u039b > 0 and assume\nthat the conditional expectation h\u2217(X1:t) = E [Xt+1|X1:t] belongs to the class of linear \ufb01lters\nH\u03b50,M,2,\u039b (14). Set an integer n \u2265 M for the number of random projection \ufb01lters and let \u02dcH\u039b =\n\u02dcHn,2,\u039b (15) and the truncated space be \u02dcH\u039b,B (17). Consider the estimator \u02c6h that is de\ufb01ned as (16).\nn and T \u2265 2. Fix \u03b4 > 0. It then holds that there\nWithout loss of generality assume that \u039b \u2265 \u03b50\n\u221a\n(cid:114)\nexists constants c1, c2 > 0 such that with probability at least 1 \u2212 \u03b4, we have\n\nB2\n\nlog(cid:0) 20n\n\n(cid:1)\n\n\u03b4\n\nlog3(T )\n\nn log(1/\u03b4)\n\nc2B2\u039b2\n\n+ 2\u2206( \u02dcH\u039b,B).\n\n+\n\nT\n\n\u03b54\n0\n\nn\n\n(cid:12)(cid:12)(cid:12)\u02c6h(X1:T +1) \u2212 h\u2217(X1:T +1)\nThe O((cid:112) n\n\n(cid:12)(cid:12)(cid:12)2 \u2264 c1B2\u039b\n\n\u03b50\n\nThe upper bounds has three terms: estimation error, \ufb01lter approximation error, and the discrepancy.\nT ) term corresponds to the estimation error. It decreases as the length T of the time\nseries increases. As we increase the number of \ufb01lters n, the upper bounds shows an increase of\nthe estimation error. This is a manifestation of the effect of the input dimension on the error of\nthe estimator. The O(n\u22121) term provides an upper bound to the \ufb01lter approximation error. This\nerror decreases as we add more \ufb01lters. This indicates that RPFB provides a good approximation to\nthe space of dynamical systems H\u03b50,M,2,\u039b (14). Both terms show the proportional dependency on\nthe magnitude B of the random variables in the time series, and inversely proportional dependency\non the minimum distance \u03b50 of the poles to the unit circle. Intuitively, this is partly because the\noutput of a pole becomes more sensitive to its input as it gets closer to the unit circle. Finally, the\ndiscrepancy term \u2206( \u02dcH\u039b,B) captures the non-stationarity of the process, and has been discussed in\ndetail by Kuznetsov and Mohri [2015]. Understanding the conditions when the discrepancy gets close\nto zero is an interesting topic for future research.\nBy setting the number of RP \ufb01lters to n = T 1/3\u039b2/3\nsimplify the upper bound to\n\n, and under the condition that \u039b \u2264 T , we can\n\n(cid:12)(cid:12)(cid:12)\u02c6h(X1:T +1) \u2212 h\u2217(X1:T +1)\n\n\u03b52\n0\n\n(cid:113)\n(cid:12)(cid:12)(cid:12)2 \u2264 cB2\u039b4/3 log3(T )\n\n\u03b52\n0T 1/3\n\nlog( 1\n\u03b4 )\n\n+ 2\u2206( \u02dcH\u039b,B),\n\nwhich holds with probability at least 1 \u2212 \u03b4, for some constant c > 0. As T \u2192 \u221e, the error converges\nto the level of discrepancy term.\nWe would like to comment that in the statistical part of the proof, instead of using the independent\nblock technique of Yu [1994] to analyze a mixing processes [Doukhan, 1994], which is a common\ntechnique used by many prior work such as Meir [2000]; Mohri and Rostamizadeh [2009, 2010];\nFarahmand and Szepesv\u00e1ri [2012], we rely on more recent notions of sequential complexities [Rakhlin\net al., 2010, 2014] and the discrepancy [Kuznetsov and Mohri, 2015] of the function space-stochastic\nprocess couple.\nThis theorem is for Case 1 in Section 3, but a similar result also holds for Case 2. We also mention\nthat as the values of M, \u03b50, and \u039b of the true dynamical system space H\u03b50,M,2,\u039b are often unknown,\nthe choice of number of \ufb01lters n in RPFB, the size of the space M, etc. cannot be selected based on\nthem. Instead one should use a model selection procedure to pick the appropriate values for these\nparameters.\n\n5 Experiments\n\nWe use a ball bearing fault detection problem to empirically study RPFB and compare it with a \ufb01xed-\nwindow history-based approach. The supplementary material provides several other experiments,\nincluding the application of LSTM to solve the very same problem, close comparison of RPFB with\n\ufb01xed-window history-based approach on an ARMA time series prediction problem, and a heart rate\nclassi\ufb01cation problem. For further empirical studies, especially in the context of fault detection and\nprognosis, refer to Pourazarm et al. [2017].\nReliable operation of rotating equipments (e.g., turbines) depends on the condition of their bearings,\nwhich makes the detection of whether a bearing is faulty and requires maintenance of crucial\nimportance. We consider a bearing vibration dataset provided by Machinery Failure Prevention\n\n8\n\n\fFigure 1: (Bearing Dataset) Classi\ufb01cation error on the test dataset using RPFB and \ufb01xed-window\nhistory-based feature sets. The RPFB results are averaged over 20 independent randomly selected\nRPFB. The error bars show one standard error.\n\nTechnology (MFPT) Society in our experiments.7 Fault detection of bearings is an example of\nindustrial applications where the computational resources are limited, and fast methods are required,\ne.g., only a micro-controller or a cheap CPU, and not a GPU, might be available.\nThe dataset consists of three univariate time series corresponding to a baseline (good condition/class\n0), an outer race fault (class 1), and inner race fault (class 2). The goal is to \ufb01nd a classi\ufb01er that\npredicts the class label at the current time t given the vibration time series X1:t. In a real-world\nscenario, we train the classi\ufb01er on a set of previously recorded time series, and later let it operate on\na new time series observed from a device. The goal would be to predict the class label at each time\nstep as new data arrives. Here, however, we split each of three time series to a training and testing\nsubsets. More concretely, we \ufb01rst pass each time series through RPFB (or de\ufb01ne a \ufb01xed-window\nof the past H values of them). We then split the processed time series, which has the dimension of\nthe number of RPFB or the size of the window, to the training and testing sets. We select the \ufb01rst\n3333 time steps to de\ufb01ne the training set, and the next 3333 data points as the testing dataset. As we\nhave three classes, this makes the size of training and testing sets both equal to 10K. We process each\ndimension of the features to have a zero mean and a unit variance for both feature types. We perform\n20 independent runs of RPFB, each of which with a new set of randomly selected \ufb01lters.\nFigure 1 shows the classi\ufb01cation error of three different classi\ufb01er (Logistic Regression (LR) with the\n(cid:96)2 regularization, Random Forest (RF), and Support Vector Machine (SVM) with Gaussian kernel)\non both feature types, with varying feature sizes. We observe that as the number of features increases,\nthe error of all classi\ufb01ers decreases too. It is also noticeable that the error heavily depends on the type\nof classi\ufb01er, with SVM being the best in the whole range of number of features. The use of RPFB\ninstead of \ufb01xed-window history-based one generally improves the performance of LR and SVM, but\nnot for RF. Refer to the supplementary material for more detail on the experiment.\n\n6 Conclusion\n\nThis paper introduced Random Projection Filter Bank (RPFB) as a simple and effective method\nfor feature extraction from time series data. RPFB comes with a \ufb01nite-sample error upper bound\nguarantee for a class of linear dynamical systems. We believe that RPFB should be a part of the\ntoolbox for time series processing.\nA future research direction is to better understand other dynamical system spaces, beyond the linear\none considered here, and to design other variants of RPFB beyond those that are de\ufb01ned by stable\nlinear autoregressive \ufb01lters. Another direction is to investigate the behaviour of the discrepancy\nfactor.\n\n7Available from http://www.mfpt.org/faultdata/faultdata.htm.\n\n9\n\n52550100200400Features No0.00.10.20.30.40.50.60.7Classification ErrorLR (RPFB)LR (Window)RF (RPFB)RF (Window)SVM (RPFB)SVM (Window)\fAcknowledgments\n\nWe would like to thank the anonymous reviewers for their helpful feedback.\n\nReferences\nRichard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations\n\nof computational mathematics, 9(1):51\u201377, 2009. 1\n\nChristopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 1, 2\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\nneural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n2\n\nPaul Doukhan. Mixing: Properties and Examples, volume 85 of Lecture Notes in Statistics. Springer-\n\nVerlag, Berlin, 1994. 8\n\nAmir-massoud Farahmand and Csaba Szepesv\u00e1ri. Regularized least-squares regression: Learning\nfrom a \u03b2-mixing sequence. Journal of Statistical Planning and Inference, 142(2):493 \u2013 505, 2012.\n2, 8\n\nIan Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. 2\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data\n\nMining, Inference, and Prediction. Springer, 2001. 1, 2\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997. 2\n\nSham Kakade, Percy Liang, Vatsal Sharan, and Gregory Valiant. Prediction with a short memory.\n\narXiv:1612.02526v2, 2017. 3\n\nVitaly Kuznetsov and Mehryar Mohri. Learning theory and algorithms for forecasting non-stationary\ntime series. In Advances in Neural Information Processing Systems (NIPS - 28), pages 541\u2013549.\nCurran Associates, Inc., 2015. 7, 8\n\nMantas Luko\u0161evi\u02c7cius and Herbert Jaeger. Reservoir computing approaches to recurrent neural\n\nnetwork training. Computer Science Review, 3(3):127\u2013149, 2009. 2\n\nRon Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning,\n\n39(1):5\u201334, 2000. 8\n\nMehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes.\nIn Advances in Neural Information Processing Systems 21, pages 1097\u20131104. Curran Associates,\nInc., 2009. 8\n\nMehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary \u03c6-mixing and \u03b2-mixing\nprocesses. Journal of Machine Learning Research (JMLR), 11:789\u2013814, 2010. ISSN 1532-4435.\n2, 8\n\nJunier B. Oliva, Barnab\u00e1s P\u00f3czos, and Jeff Schneider. The statistical recurrent unit. In Proceedings\nof the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of\nMachine Learning Research, pages 2671\u20132680. PMLR, August 2017. 2, 6\n\nAlan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-Time Signal Processing. Prentice\n\nHall, second edition, 1999. 3\n\nSepideh Pourazarm, Amir-massoud Farahmand, and Daniel N. Nikovski. Fault detection and\nprognosis of time series data with random projection \ufb01lter bank. In Annual Conference of the\nPrognostics and Health Management Society (PHM), pages 242\u2013252, 2017. 2, 8\n\nAli Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization\nwith randomization in learning. In Advances in Neural Information Processing Systems (NIPS -\n21), pages 1313\u20131320, 2009. 2\n\n10\n\n\fAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,\ncombinatorial parameters, and learnability. In Advances in Neural Information Processing Systems\n(NIPS - 23), 2010. 8\n\nAlexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform\n\nmartingale laws of large numbers. Probability Theory and Related Fields, 2014. 8\n\nIngo Steinwart\n\nand Andreas Christmann.\n\nIn Advances in Neural\n\ntions.\n1768\u20131776. Curran Associates,\n3736-fast-learning-from-non-iid-observations.pdf. 2\n\nFast\n\nlearning from non-i.i.d. observa-\nInformation Processing Systems (NIPS - 22), pages\nInc., 2009.\nURL http://papers.nips.cc/paper/\n\nIngo Steinwart, Don Hush, and Clint Scovel. Learning from dependent observations. Journal of\n\nMultivariate Analysis, 100(1):175\u2013194, 2009. 2\n\nSantosh S. Vempala. The Random Projection Method. DIMACS Series in Discrete Mathematics and\nTheoretical Computer Science. American Mathematical Society, 2004. ISBN 9780821837931. 1\n\nLarry Wasserman. All of Nonparametric Statistics (Springer Texts in Statistics). Springer, 2007. 1, 2\n\nMartha White, Junfeng Wen, Michael Bowling, and Dale Schuurmans. Optimal estimation of\nmultivariate ARMA models. In Proceedings of the 29th AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2015. 4\n\nBin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, January 1994. 8\n\n11\n\n\f", "award": [], "sourceid": 3293, "authors": [{"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": "Mitsubishi Electric Research Laboratories (MERL)"}, {"given_name": "Sepideh", "family_name": "Pourazarm", "institution": "MERL"}, {"given_name": "Daniel", "family_name": "Nikovski", "institution": null}]}