{"title": "Learning filter widths of spectral decompositions with wavelets", "book": "Advances in Neural Information Processing Systems", "page_first": 4601, "page_last": 4612, "abstract": "Time series classification using deep neural networks, such as convolutional neural networks (CNN), operate on the spectral decomposition of the time series computed using a preprocessing step. This step can include a large number of hyperparameters, such as window length, filter widths, and filter shapes, each with a range of possible values that must be chosen using time and data intensive cross-validation procedures. We propose the wavelet deconvolution (WD) layer as an efficient alternative to this preprocessing step that eliminates a significant number of hyperparameters. The WD layer uses wavelet functions with adjustable scale parameters to learn the spectral decomposition directly from the signal. Using backpropagation, we show the scale parameters can be optimized with gradient descent. Furthermore, the WD layer adds interpretability to the learned time series classifier by exploiting the properties of the wavelet transform. In our experiments, we show that the WD layer can automatically extract the frequency content used to generate a dataset. The WD layer combined with a CNN applied to the phone recognition task on the TIMIT database achieves a phone error rate of 18.1\\%, a relative improvement of 4\\% over the baseline CNN. Experiments on a dataset where engineered features are not available showed WD+CNN is the best performing method. Our results show that the WD layer can improve neural network based time series classifiers both in accuracy and interpretability by learning directly from the input signal.", "full_text": "Learning \ufb01lter widths of spectral decompositions with\n\nwavelets\n\nHaidar Khan\n\nDepartment of Computer Science\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\nkhanh2@rpi.edu\n\nB\u00fclent Yener\n\nDepartment of Computer Science\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\nyener@rpi.edu\n\nAbstract\n\nTime series classi\ufb01cation using deep neural networks, such as convolutional neu-\nral networks (CNN), operate on the spectral decomposition of the time series\ncomputed using a preprocessing step. This step can include a large number of\nhyperparameters, such as window length, \ufb01lter widths, and \ufb01lter shapes, each\nwith a range of possible values that must be chosen using time and data intensive\ncross-validation procedures. We propose the wavelet deconvolution (WD) layer\nas an ef\ufb01cient alternative to this preprocessing step that eliminates a signi\ufb01cant\nnumber of hyperparameters. The WD layer uses wavelet functions with adjustable\nscale parameters to learn the spectral decomposition directly from the signal. Using\nbackpropagation, we show the scale parameters can be optimized with gradient\ndescent. Furthermore, the WD layer adds interpretability to the learned time series\nclassi\ufb01er by exploiting the properties of the wavelet transform. In our experiments,\nwe show that the WD layer can automatically extract the frequency content used\nto generate a dataset. The WD layer combined with a CNN applied to the phone\nrecognition task on the TIMIT database achieves a phone error rate of 18.1%, a rel-\native improvement of 4% over the baseline CNN. Experiments on a dataset where\nengineered features are not available showed WD+CNN is the best performing\nmethod. Our results show that the WD layer can improve neural network based\ntime series classi\ufb01ers both in accuracy and interpretability by learning directly\nfrom the input signal.\n\n1\n\nIntroduction\n\nThe spectral decomposition of signals plays an integral role in problems involving time series classi\ufb01-\ncation or prediction using machine learning. Effective spectral decomposition requires knowledge\nabout the relevant frequency ranges present in an input signal. Since this information is usually\nunknown, it is encoded as a set of hyperparameters that are hand-tuned for the problem of interest.\nThis approach can be summarized by the application of \ufb01lters to a signal and transformation to\nthe time/frequency domain in a preprocessing step with the short-time Fourier transform (STFT)\n[27], wavelet transform [10], or empirical mode decomposition [16]. The resulting time series of\nfrequency components is then used for the classi\ufb01cation or prediction task. Examples of this approach\nare present across the spectrum of problems involving time series, including \ufb01nancial time series\nprediction [7], automatic speech recognition [41, 2, 38], and biological time series analysis [4, 24].\nAs the parameters of a spectral decomposition are important for time-series problems and are\ngenerally not transferable, it is useful to develop methods to ef\ufb01ciently optimize the parameters for\neach problem. Currently these methods are dominated by cross-validation procedures which incur\nheavy costs in both computation time and data when used to optimize a large set of hyperparameters.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) The typical setup in a signal classi\ufb01cation problem using deep neural networks with\nconvolution and max-pooling layers applied to a preprocessed spectrogram followed by fully con-\nnected classi\ufb01cation layers. (b) The proposed setup where the convolutional network operates directly\non the input signal via the wavelet deconvolution layer, eliminating the preprocessing step and\nassociated hyperparameters and learning the spectral decomposition using gradient descent. This is\naccomplished by convolving the input signal and a set of wavelet \ufb01lters with learnable scales.\n\nWe propose a method to ef\ufb01ciently optimize the parameters of the spectral decomposition based on\nthe wavelet transform in a neural network framework.\nOur proposed method, called the wavelet deconvolution (WD) layer, learns the spectral decomposition\nrelevant to the classi\ufb01cation task with backpropagation and gradient descent. This approach results\nin a reduction in hyperparameters and a model that is interpretable using properties of the wavelet\ntransform.\nThis rest of this paper is organized as follows. In Section 2, we introduce the wavelet transform and\nrelevant background. In Section 3, we describe the wavelet deconvolution layer and show how the\nscale parameters can be learned with gradient descent. Section 4 covers related work. We present our\nexperiments and results in Section 5. We conclude in Section 7.\n\n2 Wavelet transform\n\nGiven a signal x(t) de\ufb01ned over t = 1...T , we begin by describing the continuous wavelet transform\n(CWT) of the signal [14, 30]. The transform is de\ufb01ned by the choice of a mother wavelet function\n\u03a8 that is scaled to form a set of wavelet functions, each of which is convolved with the signal. The\nmother wavelet function is chosen such that it has small local support and satis\ufb01es the zero mean and\nnormalization properties [10]:\n\n(cid:90) \u221e\n(cid:90) \u221e\n\n\u2212\u221e\n\n\u2212\u221e\n\n\u03c8(t)dt = 0\n\n|\u03c8(t)|2 =\n\n\u03c8(t)\u03c8\u2217(t)dt = 1\n\nA common choice of mother wavelet function, called the Ricker or Mexican Hat wavelet, is given by\nthe second derivative of a Gaussian:\n\n\u03c8(t) =\n\n\u221a\n2\n\u03c01/4\n\n3\u03c3\n\n(\n\nt2\n\n\u03c32 \u2212 1)e\n\n\u2212 t2\n\u03c32\n\nwhich satis\ufb01es both conditions. By scaling this function by s and translating by b, we can de\ufb01ne\nwavelet functions \u03c8s,b:\n\n2\n\nOutputOutputConvolutionConvolutionInput(spectrogram)Wavelet deconvolution Input (signal)(a)(b)\fNote that s > 0 is required and negative scales are unde\ufb01ned. The CWT of a signal, which\ndecomposes x as a set of coef\ufb01cients de\ufb01ned at each scale s and translation b, can then be written as:\n\nWx(s, b) =\n\n1\u221a\ns\n\n\u03c8(\n\nt \u2212 b\ns\n\n)x(t)dt\n\n\u03c8s,b(t) =\n\n1\u221a\ns\n\n\u03c8(\n\nt \u2212 b\ns\n\n)\n\n(cid:90) \u221e\n\n\u2212\u221e\n\nSince \u03c8 has small local support, the integral (or sum) can be ef\ufb01ciently calculated. This transforms\nthe signal x from a one dimensional domain to a two dimensional domain of time and scales by\nconvolution with a wavelet function at each scale s.\nThe scale parameters must be chosen according to prior knowledge about the problem and the input\nsignal. Converting from frequencies to scales is one method of guiding the choice of scales. For\nexample, a conversion from scales to frequencies can be estimated using the center frequency of the\nmother wavelet Fc, Fs = Fc\ns [10]. However, converting from scales to frequency is not useful unless\nprior knowledge about the signal is available or assumptions are made on the relevant frequency\ncontent of the signal. We are interested in the setting where this is not the case, i.e. the relevant\nfrequency content of the signal is not known, and show how the scale parameters can be learned for a\ngiven problem using a combination of the wavelet transform and convolutional neural networks [22].\n\n3 Wavelet deconvolutions\n\nIn this discussion, we focus on the context of machine learning problems on time series using neural\nnetworks. The setting is summarized as follows: we are given a set of data x1, x2, ...xn and targets\ny1, y2, ...yn. Each xi is a (possibly multivariate) discrete time series signal that is preprocessed and\npassed to a neural network that learns to predict the target yi. In many applications of interest, the\nneural network is a convolutional neural network and the preprocessing step takes the form of a\ndecomposition of the signal into the time and frequency domain.\nWe replace the preprocessing step of converting the signal from the time domain to the time/frequency\ndomain with a layer in the neural network. This layer, called the wavelet deconvolution (WD) layer,\ncalculates the wavelet transform (WT) on an input signal in the forward pass, feeding the transformed\nsignal to subsequent layers. It also computes the gradients of the loss function with respect to the\nscale parameters in the backward pass, allowing the network to adapt the transform to the problem it\nis tasked to solve.\nThe bene\ufb01ts of adding the WD layer as the \ufb01rst layer in a network include:\n\n\u2022 Learning the relevant spectral content in the input with backpropagation\n\u2022 Implicitly adapting the kernel support (\ufb01lter size) via the scale parameter\n\u2022 Reducing the number of hyperparameters to be optimized with expensive (in time and data)\n\ncross-validation procedures\n\nAnother bene\ufb01t of this approach can be seen by considering the case of learning an optimal\ntime/frequency decomposition of the signal using only CNN. Theoretically, a CNN could learn\nan optimal decomposition of the signal from the input in the time domain [11]. However, this would\nrequire careful selection of the correct \ufb01lter sizes and costly data and training time. The WD layer\ncircumvents these costs by \ufb01xing the form of the decomposition of the signal as the WT and learning\nthe \ufb01lter sizes. We note that any parametrized time-frequency decomposition of the signal can replace\nthe WT in this method provided the parameters are differentiable with respect to the error. A further\nline of research could be relaxing the parameterization and allowing the layer to learn an empirical\nmode decomposition from the data such as the Hilbert Huang Transform [16], however we leave this\nas future work.\nWe now describe the details of the WD layer and show that the gradients of the scales can be calculated\nusing backpropagation. The single-channel case is presented here but the extension to a multi-channel\n\n3\n\n\fsignal is obtained by applying the transform to each channel. Given an input signal x \u2208 RN with\nN samples and a set of scales s \u2208 RM with s > 0, the forward pass on the WD layer produces the\nsignal z \u2208 RN\u00d7M :\n\nzi = x \u2217 \u03c8si\u2200i = 1...M\nWe can equivalently write the convolution (\u2217) as a summation:\n\nK(cid:88)\n\nzij =\n\n\u03c8si,kxj+k\n\nk=1\n\nfor i = 1...M and j = 1...N\n\nwhere \u03c8si is the wavelet function at scale si discretized over a grid of K points.\n\n(\n\nt2\ns2\ni\n\n3si\n\n\u03c0 1\n\n4\n\n\u03c8si,t =\n\n2\n\u221a\nt \u2208 {\u2212 K \u2212 1\n\n2\n\n\u2212 1)e\n\n\u2212 t2\ns2\ni\n\nK \u2212 1\n\n2\n\n}\n\n, ...0, ...\n\nThe backward pass of the WD layer involves calculating \u03b4E/\u03b4si, where E is the differentiable loss\nfunction being minimized by the network. Typically the loss function is the mean squared error or\ncategorical cross entropy. Backpropagation on the layers following the WD layer yield \u03b4E/\u03b4zij,\nwhich is the gradient with respect to the output of the WD layer. We can write the partial derivative\nof E with respect to each scale parameter si as:\n\nK(cid:88)\n\nk=1\n\n\u03b4E\n\u03b4si\n\n=\n\n\u03b4E\n\n\u03b4\u03c8si,k\n\n\u03b4\u03c8si,k\n\n\u03b4si\n\nThe gradient with respect to the \ufb01lter \u03c8si,k can be written using \u03b4E/\u03b4zij:\n\nN(cid:88)\n\nj=1\n\nN(cid:88)\n\nj=1\n\n\u03b4E\n\n\u03b4\u03c8si,k\n\n=\n\n\u03b4E\n\u03b4zij\n\n\u03b4zij\n\u03b4\u03c8si,k\n\n=\n\n\u03b4E\n\u03b4zij\n\nxj+k\n\nDe\ufb01ning A, M, G and their partial derivatives as:\n\nA =\n\n2\n\u221a\n\n\u03c0 1\n\n4\n\n3si\n\n,\n\n\u03b4A\n\u03b4si\n\n= \u2212 3\n\u03c0 1\n\n4\n\n(3si)\u2212 3\n\n2 M = (\n\nt2\nk\ns2\ni\n\n\u2212 1),\n\n\u03b4M\n\u03b4si\n\n= \u2212 2t2\nk\ns3\ni\n\n\u2212 t2\nk\ns2\ni ,\n\nG = e\n\n\u03b4G\n\u03b4si\n\n=\n\n2t2\nk\ns3\ni\n\n\u2212 t2\nk\ne\ns2\n\nThen the gradient of \u03c8si,k = AM G with respect to the scale si is:\n\n\u03b4\u03c8si,k\n\n\u03b4si\n\n= A(M\n\n\u03b4G\n\u03b4si\n\n+ G\n\n\u03b4M\n\u03b4si\n\n) + M G\n\n\u03b4A\n\u03b4si\n\nFinally, we can write:\n\nK(cid:88)\n\nk=1\n\n\u03b4E\n\u03b4si\n\n=\n\n[(\n\n4t4\nk\ns4\ni\n\n\u2212 9t2\nk\ns2\ni\n\n+ 1)\n\n\u2212 t2\nk\ns2\ni\n\n(cid:112)3s3\n\ni\n\ne\n\u03c0 1\n\n4\n\nN(cid:88)\n\nj=1\n\n]\n\n\u03b4E\n\u03b4zij\n\nxj+k\n\nThe gradients of the loss with respect to the scale parameters, \u03b4E\n\u03b4si\ngradient descent steps:\n\n, are used to update the scales with\n\n4\n\n\fi = si \u2212 \u03b3\n(cid:48)\ns\n\n\u03b4E\n\u03b4si\n\nwhere \u03b3 is the learning rate of the optimizer. In order for the wavelet function \u03c8s to be de\ufb01ned we\ninclude the constraint si > 0 for i = 1...M.\nIt is not immediately clear how the scale parameters s1...sM should be initialized for problems where\nthe relevant frequency content is unknown. We show empirically that a random initialization of s\nsuch that the whole space of possible frequencies can be explored suf\ufb01ces. This can be done by\ndividing the range of frequencies present in the signal into bins with the number of bins equal to the\nnumber of scales. The bins, or frequency bands, can be of equal size (the typical case) or variable\nsize (depending on the prior knowledge of the signal). The bins are then converted to ranges in the\nscale domain from which the initial scales are randomly chosen.\nAn interesting feature of the WD layer is the \ufb02exibility with respect to the width of the \ufb01lter (K).\nSince at small scales the support of \u03c8 is small as well, the width can be chosen to be small for some\ncomputational improvement. In fact, K can be set dynamically according to the current value of the\nscale e.g. K = min(10s, N ).\nAnother bene\ufb01t to this approach is that a level of interpretability is added to the network. This can be\nachieved by examining the values of the scale parameters after training as they reveal the frequency\ncontent important to the classi\ufb01cation task. In our experiments we show that the scale parameters\nconverge to the true frequencies used to generate an arti\ufb01cial dataset.\n\n4 Related work\n\nImproving the performance of deep neural networks, particularly CNN, by using the network to learn\nfeatures from close to the raw input has been proven to be a successful approach [15, 21, 28, 29, 34].\nThere are two main directions to this line of research, each with advantages and challenges.\nOne direction involves applying convolutional \ufb01lters directly to raw input signals, assuming multiple\nlayers of convolutions and max-pooling will be able to learn the appropriate features [32, 18, 3, 25,\n40, 39]. While this is theoretically feasible, the network architecture selection and optimization\nbecome complicated as the number of layers is increased. Interpretability also suffers as the stack of\nfeatures is dif\ufb01cult to interpret.\nThe other direction involves tuning the parameters of the preprocessing step by gradient descent. For\nexample, using backpropagation to calculate the gradients of the mel-\ufb01lter banks commonly used in\nautomatic speech recognition. The gradients are then used to optimize the shape of the \ufb01lters [31]. By\njointly optimizing the feature extraction steps with the rest of the network, the feature extraction can\nbe modi\ufb01ed to be optimal for the classi\ufb01cation task at hand. However, the drawback to this approach\nis that it requires a set of hand-crafted features with parameters that are differentiable with respect to\nthe loss function. In addition, the shapes of the hand crafted \ufb01lters are distorted by the gradients after\nmany update steps since each point in the \ufb01lter is updated independently. This causes the resulting\n\ufb01lters to be uninterpretable and sometimes unstable [31].\nOur work combines these two approaches by assuming a standard form for the feature extraction with\nprovable qualities, i.e. the wavelet transform, and modifying the parameters of the transform using\ngradient descent. This combination simpli\ufb01es the optimization process and circumvents the need for\na pre-designed feature extraction step.\nWavelet coef\ufb01cients and features extracted from wavelet coef\ufb01cients have been used to train convo-\nlutional neural networks previously [23, 36, 19]. However, this work is the \ufb01rst, to the best of our\nknowledge, to optimize the scale parameters with gradient descent.\n\n5 Results\n\nOur experiments on arti\ufb01cial and real-world datasets show that including the wavelet deconvolution\nlayer as the \ufb01rst layer of a neural network leads to improved accuracy as well as a reduction in\ntunable hyperparameters. Furthermore, we show that the learned scales in the WD layer converge\n\n5\n\n\fto the frequencies present in the signal, adding interpretability to the learned model. In all of our\nexperiments, we implement and optimize the networks using the Tensor\ufb02ow library [1] with the\nKeras wrapper [9]. 1\n\n5.1 Arti\ufb01cial data\n\nWe generate an arti\ufb01cial dataset to compare the performance of the WD layer to a CNN and verify that\nthe learned scale parameters converge to the frequencies present in the signal. This dataset consists\nof randomly generated signals, with each signal containing three frequency components separated by\ntime. The signals are separated into two classes based on the ordering of the frequency components;\none class contains signals with components ordered from low frequency to high frequency while\nthe other class contains signals with components ordered from high to low. Clearly, these two\nclasses cannot be classi\ufb01ed using a simple Fourier transform as the temporal order of the frequency\ncomponents is important. Fig 2 shows examples from each class and demonstrates that they are\nindistinguishable using the Fourier transform. The purpose of this experiment and the design of this\ndataset is to show the WD layer can learn the spectral content relevant to a task.\n\nFigure 2: A positive (Class1) example and negative (Class2) example from the arti\ufb01cial two class\ndataset are shown in the \ufb01rst row. Examples from Class1 are generated by changing the frequency of\nthe signal from low to high over time. Examples from Class2 are generated by changing the frequency\nof the signal from high to low over time. The plots in the second row show the Fourier Transform of\nthe two examples. Without accounting for the frequency content over time, the two examples look\nidentical.\n\nWe train two networks on examples from each class and compare their performance. The baseline\nnetwork is a 4 layer CNN with Max-pooling [21] ending with a single unit for classi\ufb01cation. The other\nnetwork replaces the \ufb01rst layer with a WD layer while maintaining the same number of parameters.\nBoth networks are optimized with Adam [20] using a \ufb01xed learning rate of 0.001 and a batch size of 4.\nFig 3 shows the learning dynamics of both of these networks on this problem as well as a comparison\nof their performance. The network with the WD layer learns much faster than the CNN thanks to its\n\ufb02exibility in learning \ufb01lter size and scales as well as achieving a near perfect AUC-ROC score.\nAdditionally, we observed that the scale parameters learned by the WD network converged to the\nfrequencies of the signal components. We experimented with different initializations of the scale\nparameters to verify this behavior was consistent. This is shown in the third panel in Fig 3.\n\n5.2 TIMIT\n\nThe TIMIT dataset is a well-known phone recognition benchmark dataset for automatic speech\nrecognition (ASR). Performance on the TIMIT dataset has steadily improved over the years due to\nmany iterations in engineering features for the phone recognition problem. Prior to the rebirth of\ndeep neural networks in ASR, the community converged to HMM-GMM trained on features based\non mel-\ufb01lters applied to the speech signal and transformed using a discrete cosine transform (DCT)\n[41]. More recent work improves performance by omitting the DCT step and applies CNN directly to\n\n1Code implementing the WD layer can be found at https://github.com/haidark/WaveletDeconv\n\n6\n\n\fFigure 3: Top left: ROC curve of the WD network (blue) and the convolutional network (red) on the\nsynthetic dataset. The WD network achieves a perfect score on the test dataset. Top right: Validation\nloss over training epochs for the WD network (blue) and the convolutional network (red). The WD\nnetwork learns much faster than the CNN and achieves a lower loss value overall. Bottom: The width\nparameters of the WD network over training epochs (solid lines) from several random initializations\nalongside the true frequency used to generate the synthetic dataset (dashed lines). We observed that\nthe width parameters converged to the true frequency components indicating that the WD layer is\nable to uncover the relevant frequency information for a problem.\n\nTable 1: best reported PER on the Timit dataset without context dependence\n\nMethod\n\nPER (Phone Error Rate)\n\nDNN with ReLU units [37]\n\nDNN + RNN [12]\n\nCNN [38]\n\nWD + CNN (this work)\n\nLSTM RNN [13]\n\nHierarchical CNN [38]\n\n20.8\n18.8\n18.9\n18.1\n17.7\n16.5\n\nthe output of the mel-\ufb01lter banks [15, 2, 17, 28]. Our goal is to extend this further by removing the\nmel-\ufb01lter banks and attempting to learn the appropriate \ufb01lters using the WD layer. Clearly, this is a\ndif\ufb01cult task as the mel-\ufb01lter banks represent the result of decades of research and experimentation.\nOur motivation is to show that the WD layer is adaptable to different problem spaces and provide an\napproach that circumvents the need for extensive feature engineering.\nTo ensure a fair comparison to previous results, we replicated the non-hierarchical CNN given in [38],\na 4 layer network with 1 convolutional layer followed by 3 fully connected layers. In our network,\nwe remove the mel-\ufb01lter bank preprocessing steps and added the WD layer as the \ufb01rst layer, using\nthe speech signal directly as input to the WD layer. The WD layer passes the wavelet transform of\nthe signal along with the \u2206 and \u2206\u2206 values to the CNN for classi\ufb01cation in the forward pass. In\nthe backward pass, the gradients of the scale parameters are calculated using the gradients from the\nCNN. We also use the optimization parameters presented in [38], except we replace SGD with the\nAdam [20] optimizer. A minibatch size of 100 was used. The learning rate was set to 0.001 initially\nand halved when the validation error plateaued. The dataset in its benchmark form consists of 3696\nspoken sentences used for training and 192 sentences for testing. The sentences are segmented into\nframes and labeled by phones from a set of 39 different phones. 10% of the 3696 training dataset\nare used as a validation set for hyperparameters optimization and early stopping for regularization.\nThe standard set of features used by CNN-based approaches consists of 41 mel-\ufb01lter bank features\nextracted from the signal along with their \u2206 and \u2206\u2206 change values. The results using this network\nare shown in Table 1 alongside other strong performing methods.\nAlthough the WD layer does not outperform the best CNN based approach using the hand-crafted\nsignal decomposition, it is clear that the approach is competitive achieving a close 18.1% PER despite\nremoving all of the engineered features. This is expected as the mel-\ufb01lter bank based decomposition is\nwell suited to this speech recognition task. Removing the mel-\ufb01lter bank features puts the WD+CNN\n\n7\n\n\fmodel at a signi\ufb01cant disadvantage compared to the other methods because the model must \ufb01rst learn\nthe appropriate preprocessing steps.\nHowever, these results show that with minimal engineering effort and a reduction in tunable hyper-\nparameters the WD layer offers an effective alternative to the mel \ufb01lter bank features. Introducing\nthe WD layer to the CNN eliminated 7 tunable hyperparameters from the preprocessing step of the\nbaseline CNN. This is signi\ufb01cant as it shows the WD layer can learn a set of features equivalent to a\ncarefully crafted feature extraction process directly from the data. By plotting the frequency response\nof the learned wavelets, as shown in Fig 4, we observe that they resemble the triangular \ufb01lters of the\nmel-\ufb01lter bank.\n\nFigure 4: Visualization of 10 learned wavelet \ufb01lters from the network trained on the TIMIT dataset.\nThe left pane shows the scaled wavelet functions and the right pane shows the frequency response\nof each of the learned wavelet \ufb01lters. The learned wavelet \ufb01lters closely resemble the triangular\nmel \ufb01lter banks commonly used in automatic speech recognition systems both in shape and spacing\nacross the frequency spectrum.\n\n5.3 UCR Haptics dataset\n\nIn this set of experiments, we evaluate the performance of the WD layer on a dataset for which an\nengineered preprocessing stage is not known. As shown in Table 2, methods using hand crafted\nfeatures perform poorly on the dataset.\nThe Haptics dataset is part of the UCR Time Series Classi\ufb01cation dataset [8]. The data is comprised\nof time series of length 1092 divided into 5 classes. The data is provided mean-centered and\nnormalized and a training/testing split of the data is set. We further split the training data with 10-fold\ncrossvalidation for early stopping and selecting number of \ufb01lters. We train the following 7 layer\nnetwork on the dataset: WD layer with 7 scales, three convolutional+maxpooling layers with 32\n\ufb01lters each and 2x5 kernels (\ufb01rst dimension is scales and second dimension is time), three fully\nconnected layers with 100, 25, and 5 units. Dropout [35] with p = 0.3 was added after every layer for\nregularization. The nonlinear activations after every layer were ReLU [26]. The network was trained\nusing Adam [20] with default parameters: lr = 0.001, \u03b21 = 0.9, and \u03b22 = 0.999. The network was\ntrained for 1000 epochs with a batch size of 2 and the weights with the best validation loss were\nsaved.\nThe results in Table 2 show that the WD layer achieves the best performance with an error of 0.425,\nimproving the next best performing method by an absolue 2.4%. The second best performing method,\nthe Fully Convolutional Network (FCN) [40], is a network of 1-D convolutional units that also does\nnot require any preprocessing or feature extraction steps and has a similar number of parameters to\nour method. Other methods such as Dynamic Time Warping [5], COTE [6], and BOSS [33] depend\non feature extraction steps which may not be suitable to this task. We believe the improvement shown\nhere, especially with respect to other CNN based methods with similar model complexities, shows the\nWD layer learns a spectral decomposition of the time series which results in improved classi\ufb01cation\naccuracy.\n\n8\n\n\fTable 2: Testing error on the Haptics dataset\n\nMethod\nDTW [5]\nBOSS [33]\nResNet [40]\nCOTE [6]\nFCN [40]\n\nWD + CNN (this work)\n\nTest Error\n\n0.623\n0.536\n0.495\n0.488\n0.449\n0.425\n\n6 Discussion\n\nWe demonstrate that the WD layer provides a powerful and \ufb02exible approach to learning the param-\neters of the spectral decomposition of a signal. Combined with the backpropagation algorithm for\ncalculating gradients with respect to a loss function, the WD layer can automatically set the \ufb01lter\nwidths to maximize classi\ufb01cation accuracy. Although any parameterized transform can be used, there\nare two bene\ufb01ts to using the wavelet transform that are not realized by other transforms. Firstly,\nthe wavelet functions are differentiable with respect to the scales which allows optimization with\nthe backpropagation algorithm. Secondly, the scale parameters control both the target frequency as\nwell as the \ufb01lter width allowing a multiscale decomposition of the signal within a single layer of the\nnetwork.\nOne challenge to the optimization of the WD layer using stochastic gradient descent (SGD) with a\n\ufb01xed learning rate is that the scale parameters can change too slowly relative to their magnitude and\nconvergence can be slow. This is caused by the multiscale feature of the wavelet transform. When\nthe magnitude of the scale parameter is small, small changes to the parameters can capture change\nin high frequencies effectively. At lower frequencies when the magnitude of the scale parameter is\nlarge, many steps are required. Fortunately, more advanced optimization techniques with variable\nand per-parameter learning rates, such as Adam [20] and Adadelta [42], circumvent this problem.\nWe found that using Adam (a standard choice for deep neural networks) with the default parameters\ngreatly sped up training over using SGD with a \ufb01xed learning rate. Thus, this method requires a\nvariable learning rate in order to effectively learn the scale parameters.\n\n7 Conclusion\n\nIn this paper, we used the wavelet transform and convolutional neural networks to learn the parameters\nof a spectral decomposition for classi\ufb01cation of signals. By learning the wavelet scales of the wavelet\ntransform with backpropagation and gradient descent, we avoid having to choose the parameters of\nthe spectral decomposition using cross-validation. We showed that the decomposition learned by\nbackpropagation equaled or outperformed hand-selected spectral decompositions. In addition, the\nlearned scale parameters reveal the frequency content of the signal important to the classi\ufb01cation task,\nadding a layer of interpretability to the deep neural network. As future work, we plan to investigate\nhow to extend the WD layer to signals in higher dimensions, such as images and video, as well as\ngeneralizing the wavelet transform to empirical mode decompositions.\n\nAcknowledgments\n\nThis work was supported in part by NSF Award #1302231.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\n\n9\n\n\fBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nViegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint\narXiv:1603.04467, 2016.\n\n[2] Ossama Abdel-Hamid, Mohamed Abdel-rahman, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Con-\nvolutional Neural Networks for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and\nLanguage Processing, 22(10):1533\u20131545, 2014.\n\n[3] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying Convolutional\nNeural Networks concepts to hybrid NN-HMM model for speech recognition. In 2012 IEEE International\nConference on Acoustics, Speech and Signal Processing (ICASSP), pages 4277\u20134280. IEEE, mar 2012.\n\n[4] Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and B\u00fclent Yener. Multiway analysis of\n\nepilepsy tensors. Bioinformatics, 23(13):i10\u2013i18, 2007.\n\n[5] Ghazi Al-Naymat, Sanjay Chawla, and Javid Taheri. SparseDTW: A novel approach to speed up dynamic\ntime warping. Conferences in Research and Practice in Information Technology Series, 101(December\n2003):117\u2013127, 2009.\n\n[6] Anthony Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. Time-series classi\ufb01cation with COTE:\nThe collective of transformation-based ensembles. 2016 IEEE 32nd International Conference on Data\nEngineering, ICDE 2016, 27(9):1548\u20131549, 2016.\n\n[7] Peter Brockwell and Richard Davis. Introduction to Time Series and Forecasting. Springer, 2002.\n\n[8] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and\n\nGustavo Batista. The UCR Time Series Classi\ufb01cation Archive, jul 2015.\n\n[9] Fran\u00e7ois Chollet. keras, 2015.\n\n[10] Ingrid Daubechies. The wavelet transform, time-frequency localization and signal analysis.\n\nTransactions on Information Theory, 36(5):961\u20131005, 1990.\n\nIEEE\n\n[11] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Dung. Deep Learning. MIT press Cambridge,\n\n2016.\n\n[12] G\u00e1bor Gosztolya, Tam\u00e1s Gr\u00f3sz, L\u00e1szl\u00f3 T\u00f3th, and David Imseng. Building context-dependent DNN acoustic\nmodels using Kullback-Leibler divergence-based state tying. ICASSP, IEEE International Conference on\nAcoustics, Speech and Signal Processing - Proceedings, 2015-Augus:4570\u20134574, 2015.\n\n[13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\nneural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,\nnumber 3, pages 6645\u20136649. IEEE, may 2013.\n\n[14] David K. Hammond, Pierre Vandergheynst, and R\u00e9mi Gribonval. Wavelets on graphs via spectral graph\n\ntheory. Applied and Computational Harmonic Analysis, 30(2):129\u2013150, 2011.\n\n[15] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew\nSenior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep Neural Networks\nfor Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal\nProcessing Magazine, 29(6):82\u201397, nov 2012.\n\n[16] Norden E Huang and Zhaohua Wu. a Review on Hilbert-Huang Transform : Method and Its Applications.\n\nReviews of Geophysics, 46(2007):1\u201323, 2008.\n\n[17] Po Sen Huang, Haim Avron, Tara N. Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel\nmethods match deep neural networks on TIMIT. ICASSP, IEEE International Conference on Acoustics,\nSpeech and Signal Processing - Proceedings, pages 205\u2013209, may 2014.\n\n[18] Navdeep Jaitly and Geoffrey E. Hinton. Learning a Better Representation of Speech Sound Waves\nusing Restricted Boltzmann Machines. Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE\nInternational Conference on, 1:5884\u20135887, 2011.\n\n[19] Haidar Khan, Lara Marcuse, Madeline Fields, Kalina Swann, and Bulent Yener. Focal onset seizure\nprediction using convolutional networks. IEEE Transactions on Biomedical Engineering, 9294(c):1\u20131,\n2017.\n\n10\n\n\f[20] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint\n\narXiv:1412.6980, pages 1\u201315, dec 2014.\n\n[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. Communications of the ACM, 60(6):84\u201390, may 2017.\n\n[22] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132323, 1998.\n\n[23] Piotr Mirowski, Deepak Madhavan, Yann LeCun, and Ruben Kuzniecky. Classi\ufb01cation of patterns of EEG\n\nsynchronization for seizure prediction. Clinical Neurophysiology, 120(11):1927\u20131940, 2009.\n\n[24] Piotr W. Mirowski, Yann LeCun, Deepak Madhavan, and Ruben Kuzniecky. Comparing SVM and\nconvolutional networks for epileptic seizure prediction from intracranial EEG. In Proceedings of the 2008\nIEEE Workshop on Machine Learning for Signal Processing, MLSP 2008, pages 244\u2013249, 2008.\n\n[25] Abdel-Rahman Mohamed, George Dahl, and Geoffrey Hinton. Deep Belief Networks for Phone Recogni-\n\ntion. Scholarpedia, 4(5):1\u20139, 2009.\n\n[26] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed Linear Units Improve Restricted Boltzmann Machines. In\nProceedings of the 27th International Conference on Machine Learning, number 3, pages 807\u2013814, 2010.\n\n[27] Alan V Oppenheim and Ronald W Schafer. Discrete-Time Signal Processing. Pearson Education, 2014.\n\n[28] Dimitri Palaz, Ronan Collobert, and Mathew Magimai. Doss. End-to-end Phoneme Sequence Recognition\n\nusing Convolutional Neural Networks. arXiv preprint arXiv:1312.2137, pages 2\u20139, 2013.\n\n[29] Dimitri Palaz, Mathew Magimai-Doss, and Ronan Collobert. Analysis of CNN-based speech recognition\nsystem using raw speech as input. In Proceedings of the Annual Conference of the International Speech\nCommunication Association, INTERSPEECH, pages 11\u201315, 2015.\n\n[30] Robi Polikar. The engineer\u2019s ultimate guide to wavelet analysis-the wavelet tutorial. available at http://www.\n\npublic. iastate. edu/\u02dc rpolikar/WAVELETS/WTtutorial. html, 14(2):81\u201387, 1996.\n\n[31] Tara N. Sainath, Brian Kingsbury, Abdel-Rahman Rahman Mohamed, and Bhuvana Ramabhadran. Learn-\ning Filter Banks Within a Deep Neural Network Framework. Proceedings of IEEE Workshop on Automatic\nSpeech Recognition and Understanding (ASRU), pages 297\u2013302, dec 2013.\n\n[32] Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, and Oriol Vinyals. Learning the speech\nfront-end with raw waveform CLDNNs. In Proceedings of the Annual Conference of the International\nSpeech Communication Association, INTERSPEECH, pages 1\u20135, 2015.\n\n[33] Patrick Sch\u00e4fer. The BOSS is concerned with time series classi\ufb01cation in the presence of noise. Data\n\nMining and Knowledge Discovery, 29(6):1505\u20131530, 2015.\n\n[34] Patrice Y Simard, Dave Steinkraus, and John C Platt. Best practices for convolutional neural networks\napplied to visual document analysis. In Seventh International Conference on Document Analysis and\nRecognition, 2003. Proceedings., volume 1, pages 958\u2013963. IEEE Comput. Soc, 2003.\n\n[35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA Simple Way to Prevent Neural Networks from Over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[36] Sebastian Stober, Avital Avtial Sternin, Adrian M. Owen, and Jessica A. Grahn. Deep Feature Learning for\n\nEEG Recordings. Arxiv, pages 1\u201324, nov 2015.\n\n[37] Laszlo Toth. Phone recognition with deep sparse recti\ufb01er neural networks. ICASSP, IEEE International\n\nConference on Acoustics, Speech and Signal Processing - Proceedings, pages 6985\u20136989, 2013.\n\n[38] L\u00e1szl\u00f3 T\u00f3th. Convolutional deep maxout networks for phone recognition. Proceedings of the Annual\nConference of the International Speech Communication Association, INTERSPEECH, pages 1078\u20131082,\n2014.\n\n[39] Zoltan T\u00fcske, Pavel Golik, Ralf Schl\u00fcter, and Hermann Ney. Acoustic modeling with deep neural networks\nusing raw time signal for LVCSR. In Proceedings of the Annual Conference of the International Speech\nCommunication Association, INTERSPEECH, pages 890\u2013894, 2014.\n\n[40] Zhiguang Wang, Weizhong Yan, and Tim Oates. Time Series Classi\ufb01cation from Scratch with Deep Neural\nNetworks : A Strong Baseline. In Neural Networks (IJCNN), 2017 International Joint Conference on,\npages 1578\u20131585, 2017.\n\n11\n\n\f[41] Dong Yu and Li Deng. Automatic Speech Recognition, volume 1976 of Signals and Communication\n\nTechnology. Springer London, London, 2015.\n\n[42] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv, page 6, 2012.\n\n12\n\n\f", "award": [], "sourceid": 2242, "authors": [{"given_name": "Haidar", "family_name": "Khan", "institution": "Rensselaer Polytechnic Institute"}, {"given_name": "Bulent", "family_name": "Yener", "institution": "Rensselaer Polytechnic Institute (RPI)"}]}