{"title": "Neural Network Modeling of Speech and Music Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 779, "page_last": 785, "abstract": null, "full_text": "Neural Network Modeling of Speech and Music \n\nSignals \n\nAxel Robel \n\nTechnical University Berlin, Einsteinufer 17, Sekr. EN-8, 10587 Berlin, Germany \nTel: +49-30-31425699, FAX: +49-30-31421143, email: roebel@kgw.tu-berlin.de \n\nAbstract \n\nTime series prediction is one of the major applications of neural net(cid:173)\nworks. After a short introduction into the basic theoretical foundations \nwe argue that the iterated prediction of a dynamical system may be in(cid:173)\nterpreted as a model of the system dynamics. By means of RBF neural \nnetworks we describe a modeling approach and extend it to be able to \nmodel instationary systems. As a practical test for the capabilities of the \nmethod we investigate the modeling of musical and speech signals and \ndemonstrate that the model may be used for synthesis of musical and \nspeech signals. \n\n1 Introduction \n\nSince the formulation of the reconstruction theorem by Takens [10] it has been clear that \na nonlinear predictor of a dynamical system may be directly derived from a systems time \nseries. The method has been investigated extensively and with good success for the pre(cid:173)\ndiction of time series of nonlinear systems. Especially the combination of reconstruction \ntechniques and neural networks has shown good results [12]. \n\nIn our work we extend the ideas of predicting nonlinear systems by the more demanding \ntask of building system models, which are able to resynthesize the systems time series. In \nthe case of chaotic or strange attractors the resynthesis of identical time series is known to \nbe impossible. However, the modeling of the underlying attractor leads to the possibility \nto resynthesis time series which are consistent with the system dynamics. Moreover, the \nmodels may be used for the analysis of the system dynamics, for example the estimation \nof the Lyapunov exponents [6]. In the following we investigate the modeling of music and \nspeech signals, where the system dynamics are known to be instationary. Therefore, we \n\n\f780 \n\nA. Robel \n\ndevelop an extension of the modeling approach, such that we are able to handle instationary \nsystems. \n\nIn the following, we first give a short review concerning the state space reconstruction from \ntime series by delay coordinate vectors, a method that has been introduced by Takens [10] \nand later extended by Sauer et al. [9]. Then we explain the structure of the neural net(cid:173)\nworks we used in the experiments and the enhancements necessary to be able to model \ninstationary dynamics. As an example we apply the neural models to a saxophone tone \nand a speech signal and demonstrate that the signals may be resynthesized using the neural \nmodels. Furthermore, we discuss some of the problems and outline further developments \nof the application. \n\n2 Reconstructing attractors \n\nAssume an n-dimensional dynamical system f(-) evolving on an attractor A. A has frac(cid:173)\ntal dimension d, which often is considerably smaller then n. The system state z is ob(cid:173)\nserved through a sequence of measurements h(Z), resulting in a time series of measure(cid:173)\nments Yt = h(z(t)). Under weak assumptions concerning h(-) and fe) the fractal embed(cid:173)\nding theorem[9] ensures that, for D > 2d, the set of all delayed coordinate vectors \n\n(1) \n\nwith an arbitrary delay time T, forms an embedding of A in the D-dimensional recon(cid:173)\nstruction space. We call the minimal D, which yields an embedding of A, the embedding \ndimension De. Because an embedding preserves characteristic features of A, especially it \nis one to one, it may be employed for building a system model. For this purpose the recon(cid:173)\nstruction of the attractor is used to uniquely identify the systems state thereby establishing \nthe possibility of uniquely predicting the systems evolution. The prediction function may \nbe represented by a hyperplane over the attractor in an (D + 1) dimensional space. By \niterating this prediction function we obtain a vector valued system model which, however, \nis valid only at the respective attractor. \nFor the reconstruction of instationary systems dynamics we confine ourselves to the case \nof slowly varying parameters and model the in stationary system using a sequence of attrac(cid:173)\ntors. \n\n3 RBF neural networks \n\nThere are different topologies of neural networks that may be employed for time series \nmodeling. In our investigation we used radial basis function networks which have shown \nconsiderably better scaling properties, when increasing the number of hidden units, than \nnetworks with sigmoid activation function [8]. As proposed by Verley sen et. al [11] we \ninitialize the network using a vector quantization procedure and then apply backpropaga(cid:173)\ntion training to finally tune the network parameters. The tuning of the parameters yields \nan improvement factor of about ten in prediction error compared to the standard RBF net(cid:173)\nwork approach [8, 3] . Compared to earlier results [7] the normalization of the hidden layer \nactivations yields a small improvement in the stability of the models. \n\nThe resulting network function for m-dimensional vector valued output is of the form \n\n_ \nN(x) = \"w\u00b7 \n\nexp (_(C'(7-X)2) \n\n7 JZ:::iexp(-(Cl(7~X)2) \n\n.I \n\n_ \n+ b \n, \n\n(2) \n\n\fNeural Network Modeling of Speech and Music Signals \n\n781 \n\nx(i+1) \n\nx(i+ TL------\n\n, , , , \n\n, , , , , , \n\nk(i) \n\nControl \n\n,'~, \" \"-, \n\n\" , \n- x(i- T) \n\n,-,' \n\n/ ~\" \n\nx(i) \n\n~, \n\n\\, \" \nx(i-3Tj \n\nx(i-2T) \n\nFig. I: Input/Output structure of the neural model. \n\nwhere (T J represents the standard deviation of the Gaussian, the input x and the centers \nc are n-dimensional vectors and band Wj are m-dimensional parameters of the network. \nNetworks of the form eq. (2) with a finite number of hidden units are able to approximate \narbitrary closely all continuous mappings Rn -+ Rm [4]. This universal approximation \nproperty is the foundation of using neural networks for time series modeling, where we \ndenote them as neural models. In the context of the previous section the neural models are \napproximating the systems prediction function. \n\nTo be able to represent instationary dynamics, we extend the network according to figure 1 \nto have an additional input, that enables the control of the actual mapping \n\n(3) \n\nThis model is close to the Hidden Control Neural Network described in [2]. From the uni(cid:173)\nversal approximation properties of the RBF-networks stated above it follows, that eq. (3) \nwith appropriate control sequence k( i) is able to approximate any sequence of functions. \nIn the context of time series prediction the value i represents the actual sample time. The \ncontrol sequence may be optimized during training, as described in [2], The optimization \nof k( i) requires prohibitively large computational power if the number of different control \nvalues, the domain of k is large. However, as long as the systems instationarity is described \nby a smooth function of time, we argue that it is possible to select k( i) to be a fixed linear \nfunction of i. With the preselected k(i) the training of the network adapts the parameters \ntj and (Ttj such that the model evolution closely follows the systems instationarity. \n\n4 Neural models \n\nAs is shown in figure I we use the delayed coordinate vectors and a selected control se(cid:173)\nquence to train the network to predict the sequence of the following T time samples. The \n\n\f782 \n\nA. Robel \n\nvector valued prediction avoids the need for a further interpolation of the predicted sam(cid:173)\nples. Otherwise, an interpolation would be necessary to obtain the original sample fre(cid:173)\nquency, but, because the Nyquist frequency is not regarded in choosing T, is not straight(cid:173)\nforward to achieve. \n\nAfter training we initialize the network input with the first input vector (X'o, k(O)) of the \ntime series and iterate the network function shifting the network input and using the latest \noutput unit to complete the new input. The control input may be copied from the training \nphase to resynthesize the training signal or may be varied to emulate another sequence of \nsystem dynamics. \n\nThe question that has to be posed in this context is concerned with the stability of the \nmodel. Due to the prediction error of the model the iteration will soon leave the recon(cid:173)\nstructed attractor. Because there exists no training data from the neighborhood of the at(cid:173)\ntractor the minimization of the prediction error of the network does not guaranty the sta(cid:173)\nbility of the model [5). Nevertheless, as we will see in the examples, the neural models are \nstable for at least some parameters D and T. \nDue to the high density of training data the method for stabilizing dynamical models pre(cid:173)\nsented in [5) is difficult to apply in our situation. Another approach to increase the model \nstability is to lower the gradient of the prediction function for the directions normal to the \nattractor. This may be obtained by disturbing the network input during training with a \nsmall noise level. While conceptually straightforward, we found that this method is only \npartly successful. While the resulting prediction function is smoother in the neighborhood \nof the attractor, the prediction error for training with noise is considerably higher as ex(cid:173)\npected from the noise free results, such that the overall effect often is negative. To circum(cid:173)\nvent the problems of training with noise further investigations will consider a optimization \nfunction with regularization that directly penalizes high derivatives of the network with \nrespect to the input units [1). The stability of the models is a major subject of further re(cid:173)\nsearch. \n\n5 Practical results \n\nWe have applied our method to two acoustic time series, a single saxophone tone, consist(cid:173)\ning of 16000 samples sampled at 32kHz and a speech signal of the word manna l . The latter \ntime series consists of 23000 samples with a sampling rate of 44.1kHz. Both time series \nhave been normalized to stay within the interval [-1, 1]. The estimation of the dimension \nof the underlying attractors yields a dimension of about 2-3 in both cases. \n\nWe chose the control input k(i) to be linear increasing from -0.8 to 0.8. Stable models \nwe found for both time series using D > 5. Namely for the parameter T we observed \nconsiderable impact on the model quality. While smaller T results in better one step ahead \nprediction, the iterated model often becomes unstable. This might be explained by the \ndecrease in variation within the prediction hyperplane, that has to be learned. For small T \nthe model tends to become linear and does not capture the nonlinear characteristics of the \nsystem. Therefore the iteration of those models failed . \nTo large values of T results in an insufficient one step ahead prediction error, which pushes \nthe model far away from the attractor also producing unstable behavior. \n\n'The name of our parallel computer \n\n\fNeural Network Modeling of Speech and Music Signals \n\n783 \n\n\u00ab \n\n0.8 \n\n0 .6 \n\n0.4 \n\n0.2 \n\n0 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n-0.8 \n\n-1 \n\nSaxophone (synthesized) \n\nSOOO \n\n10000 \n\n15000 \n\nIii' \n;g. \nis \nCl. \n\n20 \n\n10 \n\n0 \n\n-10 \n\n-20 \n\n-30 \n\n-40 \n\n-50 \n\n-60 \n\n-70 \n\nPower spectrum estimation \n\nOriginal -\nReSynth _. __ .. \n\n0 \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\nfIfO \n\n0.5 \n\nFig. 2: Synthesized saxophone signal and power spectrum estimation for the original \n(solid) and synthesized (dashed) signal. \n\n0.8 '---~--~----'-''---' -aJ~--'''/ \n\ncontrol input \n\n0.6 \n\n0.4 \n\n0.2 \n\no \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n/ \n// \n\n. ________________________ .,/0 \n\nOriCg~~~: 7-\n\n1/ \n\n\u00ab \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n-0.8 \n\n-1 \n\nSaxophone (synthesized with long cnt) \n\n5000 \n\n10000 \n\n15000 \n\n20000 \n\n25000 \n\n-0.8 \"----~--~-~---'-----' \n2S000 \n\n15000 \n\n20000 \n\n5000 \n\n1 0000 \n\nFig. 3: Varying the synthesized tone by varying the control input sequence. \n\n5.1 Modeling a saxophone \n\nIn the following we consider the results for the saxophone model. The model we present \nconsists of 10 input units, 200 hidden units and 5 output units and was trained with addi(cid:173)\ntional Gaussian noise at the input. The standard deviation of the noise is 0.0005 and the \nRMS training error obtained is 0.005. The resulting saxophone model is able to resyn(cid:173)\nthesize a signal which is nearly indistinguishable from the original one. The resynthesized \ntime series is shown in figure 2. The time series follows the original one with a small phase \nshift, which stems from a small difference in the onset of the model. Also in figure 2 the \npower spectrum of the saxophone signal and the neural model is shown. From the spec(cid:173)\ntrum we see the close resemblance of the sound. \n\nOne major demand for the practical application of the proposed musical instrument mod(cid:173)\nels is the possibility to control the synthesized sound. At the present state there exists only \none control input to the model. Nevertheless, it is interesting to investigate the effect of \nvarying the control input of the model. We tried different control input sequences to syn(cid:173)\nthesize saxophone tones. It turns out that the model reinains stable such that we are able to \ncontrol the envelope of the sound. An example of a tone with increased duration is shown \nin figure 3. In this example the control input first follows the trained version, then remains \nconstant to produce a longer duration of the tone and then increases to reproduce the decay \nof the tone from the trained time series. \n\n\f784 \n\n\u00ab \n\n0.8 \n\n0.6 \n\n0.4 \n\n02 \n\n0 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n-0.8 \n\n-1 \n\n0 \n\ntraining signal (word: manna) \n\nresynthsized signal (word: manna) \n\nA. Robel \n\n\u00ab \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n-0.8 \n\n-1 \n\n0 \n\n5000 \n\n1 0000 \n\n15000 \n\n20000 \n\n25000 \n\n5000 \n\n10000 \n\n15000 \n\n20000 \n\n25000 \n\nFig. 4: Original and synthesized signal of the word manna. \n\n5.2 Modeling a speech signal \n\nFor modeling the time series of the spoken word manna we used a similar network com(cid:173)\npared to the saxophone model. Due to the increased instationarity in the signal we needed \nan increased number of RBF units in the network. The best results up to now has been \nobtained with a network of 400 hidden units, delay time T = 8, output dimension 8 and \ninput dimension 11. \n\nIn figure 4 we show the original and the resynthesized signal. The quality of the model is \nnot as high as in the case of the saxophone_ Nevertheless, the word is quite understand(cid:173)\nable. From the figure we see, that the main problems stem from the transitions between \nconsecutive phonemes. These transitions are rather quick in time and, therefore, there ex(cid:173)\nists only a small amount of data describing the dynamics of the transitions. We assume that \nmore training examples of the same word will cure the problem. However, it will probably \nrequire a well trained speaker to reproduce the dynamics in speaking the same word twice_ \n\n6 Further developments \n\nThere are two practical applications that directly follow from the presented results_ The \nfirst one is to synthesize music signals. To consider musicians demands, we need to en(cid:173)\nhance the control of the synthesized signals. Therefore, in the future we will try to enlarge \nthe models, incorporating different flavors of sound into the same model and adding addi(cid:173)\ntional control inputs. Especially we plan to build models for different volume and pitch. \nAs a second application we will further investigate the possibilities for using the neural \nmodels as a speech synthesizer. An interesting topic of further research would be the ex(cid:173)\ntension of the model with an intonation control input that incorporates the possibility to \nsynthesize different intonations of the same word from one model. \n\n7 Summary \n\nThe article describes a methodology to build instationary models from time series of dy(cid:173)\nnamical systems. We give theoretical arguments for the universality of the models and \ndiscuss some of the restrictions and actual problems. As practical test for the method we \napply the models. to the demanding task of the synthesis of musical and speech signals. It is \ndemonstrated that the models are capable to resynthesize the trained signals. At the present \n\n\fNeural Network Modeling of Speech and Music Signals \n\n785 \n\nstate the envelope and duration of the synthesized signals may be controlled. Intended fur(cid:173)\nther developments have been shortly described . \n\nReferences \n\n[1] C. M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural \n\nComputation, 7(1): 108-116, 1995. \n\n[2] E. Levin. Hidden control neural architecture modelling of nonlinear time varying \nsystems and its applications. IEEE Transactions on Neural Networks, 4(2):109-116, \n1993. \n\n[3] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. \n\nNeural Computation, 1 :281-294, 1989. \n\n[4] J. Park and I. Sandberg. Universal approximation using radial-basis-function net(cid:173)\n\nworks. Neural Computation, 3(2):246-257, 1991. \n\n[5] J. C. Principe and J.-M. Kuo. Dynamic modelling of chaotic time series with neural \nnetworks. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Neural Information \nProcessing Systems 7 (NIPS 94), 1995. \n\n[6] A. Robel. Neural models for estimating lyapunov exponents and embedding dimen(cid:173)\nsion from time series of nonlinear dynamical systems. In Proceedings of the Intern. \nConference on Artijicial Neural Networks, ICANN'95, Vol. II, pages 533-538, Paris, \n1995. \n\n[7] A. Robel. Rbf networks for synthesis of speech and music signals. In 3. Work(cid:173)\nshop Fuzzy-Neuro-Systeme '95, Darmstadt, 1995. Deutsche Gesellschaft fur Infor(cid:173)\nmatik e.v. \n\n[8] A. Robel. Scaling properties of neural networks for the prediction of time series. In \nProceedings of the 1996 IEEE Workshop on Neural Networks for Signal Processing \nVI, 1996. \n\n[9] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. Journal of Statistical Physics, \n\n65(3/4):579-616, 1991 . \n\n[to] F. Takens. Detecting Strange Attractors in Turbulence, volume 898 of Lecture Notes \nin Mathematics (Dynamical Systems and Turbulence, Warwick 1980), pages 366-\n381. D.A. Rand and L.S. Young, Eds. Berlin: Springer, 1981. \n\n[11] M. Verleysen and K. Hlavackova. An optimized RBF network for approximation of \nfunctions. In Proceedings of the European Symposium on A rtijicial Neural Networks, \nESANN'94,1994. \n\n[12] A. S. Weigend and N. A. Gershenfeld. Time Series Prediction: Forecasting the Fu(cid:173)\n\nture and Understanding the Past. Addison-Wesley Pub. Comp., 1993. \n\n\f", "award": [], "sourceid": 1276, "authors": [{"given_name": "Alex", "family_name": "R\u00f6bel", "institution": null}]}