{"title": "Analysis of Drifting Dynamics with Neural Network Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 735, "page_last": 741, "abstract": "", "full_text": "Analysis of Drifting Dynamics with \n\nNeural Network Hidden Markov Models \n\nJ. Kohlmorgen \n\nGMD FIRST \n\nK.-R. Miiller \nGMD FIRST \n\nRudower Chaussee 5 \n12489 Berlin, Germany \n\nRudower Chaussee 5 \n12489 Berlin, Germany \n\nK. Pawelzik \n\nMPI f. Stromungsforschung \n\nBunsenstr. 10 \n\n37073 Gottingen, Germany \n\nAbstract \n\nWe present a method for the analysis of nonstationary time se(cid:173)\nries with multiple operating modes. In particular, it is possible to \ndetect and to model both a switching of the dynamics and a less \nabrupt, time consuming drift from one mode to another. This is \nachieved in two steps. First, an unsupervised training method pro(cid:173)\nvides prediction experts for the inherent dynamical modes. Then, \nthe trained experts are used in a hidden Markov model that allows \nto model drifts. An application to physiological wake/sleep data \ndemonstrates that analysis and modeling of real-world time series \ncan be improved when the drift paradigm is taken into account. \n\n1 \n\nIntroduction \n\nModeling dynamical systems through a measured time series is commonly done by \nreconstructing the state space with time-delay coordinates [10]. The prediction of \nthe time series can then be accomplished by training neural networks [11]. H, how(cid:173)\never, a system operates in multiple modes and the dynamics is drifting or switching, \nstandard approaches like multi-layer perceptrons are likely to fail to represent the \nunderlying input-output relations. Moreover, they do not reveal the dynamical \nstructure of the system. Time series from alternating dynamics of this type can \noriginate from many kinds of systems in physics, biology and engineering. \nIn [2, 6, 8], we have described a framework for time series from switching dynamics, \nin which an ensemble of neural network predictors specializes on the respective \noperating modes. We now extend the ability to describe a mode change not only \nas a switching but - if appropriate - also as a drift from one predictor to another. \nOur results indicate that physiological signals contain drifting dynamics, which \n\n\f736 \n\nJ. Kohlmorgen, K-R. Maller and K. Pawelzik \n\nunderlines the potential relevance of our method in time series analysis. \n\n2 Detection of Drifts \n\nThe detection and analysis of drifts is performed in two steps. First, an unsupervised \n(hard-)segmentation method is applied. In this approach, an ensemble of competing \nprediction experts Ii, i = 1, ... , N, is trained on a given time series. The optimal \nchoice of function approximators Ii depends on the specific application. In general, \nhowever, neural networks are a good choice for the prediction of time series [11]. In \nthis paper, we use radial basis function (RBF) networks of the Moody-Darken type \n[5] as predictors, because they offer a fast and robust learning method. \nUnder a gaussian assumption, the probability that a particular predictor i would \nhave produced the observed data y is given by \n\n(1) \n\nwhere K is the normalization term for the gaussian distribution. If we assume that \nthe experts are mutually exclusive and exhaustive, we have p(y) = LiP(Y I i)p(i). \nWe further assume that the experts are - a priori - equally probable, \n\np(i) = liN. \n\n(2) \n\nIn order to train the experts, we want to maximize the likelihood that the ensemble \nwould have generated the time series. This can be done by a gradient method. For \nthe derivative of the log-likelihood log L = log(P(y\u00bb with respect to the output of \nan expert, we get \n\n(3) \n\nThis learning rule can be interpreted as a weighting of the learning rate of each \nexpert by the expert's relative prediction performance. It is a special case of the \nMixtures of Experts [1] learning rule, with the gating network being omitted. Note \nthat according to Bayes' rule the term in brackets is the posterior probability that \nexpert i is the correct choice for the given data y, i.e. p(i I y). Therefore, we can \nsimply write \n\nalogL \n\nali \n\n. \n\nex: p(z I y)(y - Ii)\u00b7 \n\n(4) \n\nFurthermore, we imposed a low-pass filter on the prediction errors Ci = (y - 1i)2 \nand used deterministic annealing of f3 in the training process (see [2, 8] for details). \nWe found that these modifications can be essential for a successful segmentation \nand prediction of time series from switching dynamics. \nAs a prerequisite of this method, mode changes should occur infrequent, i.e. be(cid:173)\ntween two mode changes the dynamics should operate stationary in one mode for a \ncertain number of time steps. Applying this method to a time series yields a (hard) \nsegmentation of the series into different operating modes together with prediction \nexperts for each mode. In case of a drift between two modes, the respective segment \ntends to be subdivided into several parts, because a single predictor is not able to \nhandle the nonstationarity. \n\n\fAnalysis of Drifting Dynamics with Neural Network Hidden Markov Models \n\n737 \n\nThe second step takes the drift into account. A segmentation algorithm is applied \nthat allows to model drifts between two stationary modes by combining the two \nrespective predictors, Ii and h. The drift is modeled by a weighted superposition \n(5) \n\nwhere a(t) is a mixing coefficient and Xt = (Xt,Xt-r, .\u2022. ,Xt_(m_l)r)T is the vector \nof time-delay coordinates of a (scalar) time series {Xt}. Furthermore, m is the \nembedding dimension and T is the delay parameter of the embedding. Note that \nthe use of multivariate time series is straightforward. \n\n3 A Hidden Markov Model for Drift Segmentation \n\nIn the following, we will set up a hidden Markov model (HMM) that allows us \nto use the Viterbi algorithm for the analysis of drifting dynamics. For a detailed \ndescription of HMMs, see [9] and the references therein. An HMM consists of (1) \na set S of states, (2) a matrix A = {poi,,} of state transition probabilities, (3) an \nobservation probability distribution p(Yls) for each state s, which is a continuous \ndensity in our case, and (4) the initial state distribution 7r = {7r8 }. \nLet us first consider the construction of S, the set of states, which is the crucial \npoint of this approach. Consider a set P of 'pure' states (dynamical modes). Each \nstate s E P represents one of the neural network predictors Ik(,) trained in the first \nstep. The predictor of each state performs the predictions autonomously. Next, \nconsider a set M of mixture states, where each state s E M represents a linear \nmixture of two nets /;.(.) and h(.). Then, given a state s E S, S = P U M, the \nprediction of the overall system is performed by \n\n;ifsEP \n;ifsEM \n\n(6) \n\nFor each mixture state s EM, the coefficients a( s) and b( 8) have to be set together \nwith the respective network indices i(s) and j(s). For computational feasibility, the \nnumber of mixture states has to be restricted. Our intention is to allow for drifts \nbetween any two network outputs of the previously trained ensemble. We choose \na(s) and b(s) such that 0 < a(s) < 1 and b(s) = 1 - a(s). Moreover, a discrete set \nof a( s) values has to be defined. For simplicity, we use equally distant steps, \n\nr \n\nar = R + 1 ' r = 1, ... , R. \n\n(7) \n\nR is the number of intermediate mixture levels. A given resolution R between any \ntwo out of N nets yields a total number of mixed states IMI = R\u00b7 N\u00b7 (N - 1)/2. \nIf, for example, the resolution R = 32 is used and we assume N = 8, then there are \nIMI = 896 mixture states, plus IFI = N = 8 pure states. \nNext, the transition matrix A = {poi,,} has to be chosen. It determines the tran(cid:173)\nsition probability for each pair of states. In principle, this matrix can be found \nusing a training procedure, as e.g. the Baum-Welch method [9]. However, this is \nhardly feasible in this case, because of the immense size of the matrix. In the above \nexample, the matrix A has (896 + 8)2 = 817216 elements that would have to be \nestimated. Such an exceeding number of free parameters is prohibitive for any adap(cid:173)\ntive method. Therefore, we use a fixed matrix. In this way, prior knowledge about \n\n\f738 \n\n1. Kohlmorgen, K-R. Maller and K. Pawelzik \n\nthe dynamical system can be incorporated. In our applications either switches or \nsmooth drifts between two nets are allowed, in such a way that a (monotonous) drift \nfrom one net to another is a priori as likely as a switch. All the other transitions are \ndisabled by setting P.,. = O. Defining p(y Is) and 7r is straightforward. Following \neq.(I) and eq.(2), we assume gauflsian noise \n\np(y Is) = Ke- f3(1I-g.)'l, \n\n(8) \n\nand equally probable initial states, 7r. = 151-1. \nThe Viterbi algorithm [9] can then be applied to the above stated HMM, without \nany further training of the HMM parameters. It yields the drift segmentation of a \ngiven time series, i.e. the most likely state sequence (the sequence of predictors or \nlinear mixtures of two predictors) that could have generated the time series, in our \ncase with the assumption that mode changes occur either as (smooth) drifts or as \ninfrequent switches. \n\n4 Drifting Mackey-Glass Dynamics \n\nAs an example, consider a high-dimensional chaotic system generated by the \nMackey-Glass delay differential equation \n\n0.2x(t - td) \ndx(t) _ 01 () \n- - - - xt +--~----:'-::-::-\n1 + x(t - td)1\u00b0 . \n\ndt \n\n. \n\n(9) \n\nIt was originally introduced as a model of blood cell regulation [4]. Two stationary \noperating modes, A and B, are established by using different delays, td = 17 and \n23, respectively. After operating 100 time steps in mode A (with respect to a \nsubsampling step size T :;; 6), the dynamics is drifting to mode B. The drift takes \nanother 100 time steps. It is performed by mixing the equations for td = 17 and \n23 during the integration of eq.(9). The mixture is generated according to eq.(5), \nusing an exponential drift \n\n(-4t) \n\na(t) = exp 100 \n\n' \n\nt = 1, . .. ,100. \n\n(10) \n\nThen, the system runs stationary in mode B for the following 100 time steps, where(cid:173)\nupon it is switching back to mode A at t = 300, and the loop starts again (Fig.l(a\u00bb. \nThe competing experts algorithm is applied to the first 1500 data points of the gen(cid:173)\nerated time series, using an ensemble of 6 predictors h(Xt), i = 1, ... ,6. The input \nto each predictor is a vector Xt of time-delay coordinates of the scalar time series \n{xt}. The embedding dimension is m = 6 and the delay parameter is T = 1 on the \nsubsampled data. The RBF predictors consist of 40 basis functions each. \nAfter training, nets 2 and 3 have specialized on mode A, nets 5 and 6 on mode B. \nThis is depicted in the drift segmentation in Fig.l(b). Moreover, the removal of \nfour nets does not increase the root mean squared error (RMSE) of the prediction \nsignificantly (Fig.l(c\u00bb, which correctly indicates that two predictors completely \ndescribe the dynamical system. The sequence of nets to be removed is obtained by \nrepeatedly computing the RMSE of all n subsets with n - 1 nets each, and then \nselecting the subset with the lowest RMSE of the respective drift segmentation. \nThe segmentation of the remaining nets, 2 and 5, nicely reproduces the evolution \nof the dynamiCS, as seen in Fig.1(d). \n\n\fAnalysis of Drifting Dynamics with Neural Network Hidden Markov Models \n\n739 \n\n\u2022 \n\n1.4 \n\n1.3 \n1.2 \n\nt .t \n\nE 0.8 \n\" 0.8 \n\n0.7 \n\n0.8 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0 \n\n01 \n\n0.011 \n\n0.08 \n\n0.07 \n\nw \n!!lo.os \na: \n\n0.05 \n\n0.04 \n\n0.03 \n\n0.02 \n\n0 \n\n8 \n\n5 r7-\nI \n\n3-~ \n\n! \n\n7-\nj \n\nJ \n\n- , -\n\nrr-\nr' \nI \nf ( \nI \n\niJ?r-\n\n~~ \n\n2 \n\n' - - ' - - '-'--\n\n50 \n\n100 \n\n150 \nt \n(a) \n\n200 \n\n250 \n\n300 \n\no \n\n8 \n\n200 ~ ~ ~ 1~ 1200 1~ 1~ \n\nt \n\n(b) \n\nIT r7- r \n\nI \n\nI \nI \n'-_ L \n\nf \n, \n'-'--\n\n5 7-\nI \n\n( \n\n4 \n\n! \n\n2~'-\n\n'--'--\n\n2 \n\n.Remo\\Iod N.,. \n\n3 \n\n(c) \n\no \n\n200 ~ ~ ~ I~ 1200 I~ 1~ \n\nt \n(d) \n\nFigure 1: (a) One 'loop' of the drifting Mackey-Glass time series (see text). (b) \nThe resulting drift segmentation invokes four nets. The dotted line indicates the \nevolution of the mixing coefficient a(t) of the respective nets. For example, between \nt = 100 and 200 it denotes a drift from net 3 to net 5, which appears to be exponen(cid:173)\ntial. (c) Increase of the prediction error when predictors are successively removed. \n(d) The two remaining predictors model the dynamics of the time series properly. \n\n5 Wake/Sleep EEG \n\nIn [7] , we analyzed physiological data recorded from the wake/sleep transition of a \nhuman. The objective was to provide an unsupervised method to detect the sleep \nonset and to give a detailed approximation of the signal dynamics with a high time \nresolution, ultimately to be used in diagnosis and treatment of sleep disorders. The \napplication of the drift segmentation algorithm now yields a more detailed modeling \nof the dynamical system. \nAs an example, Fig. 2 shows a comparison of the drift segmentation (R = 32) \nwith a manual segmentation by a medical expert. The experimental data was mea(cid:173)\nsured during an afternoon nap of a healthy human. The computer-based analysis \nis performed on a single-channel EEG recording (occipital-l), whereas the manual \nsegmentation was worked out using several physiological recordings (EEG, EOG, \nECG, heart rate, blood pressure, respiration) . \nThe two-step drift segmentation method was applied using 8 RBF networks. How(cid:173)\never, as shown in Fig. 2, three nets (4, 6, and 8) are finally found by the Viterbi \nalgorithm to be sufficient to represent the most likely state sequence. Before the \nsleep onset, at t ~ 3500 (350s) in the manual analysis, a mixture of two wake-state \n\n\f740 \n\nJ Kohimorgen, K-R Maller and K. Pawelzik \n\nnet8 t - - - - - - - , \nnet7 \nnet6 1------+-------, \nnetS \nnet4 \nnet3 \nnet2 \nnet1 \nW1 \nW2 \nS1 \nS2 \n\nn.a. \nart. \n\ndata \n\no \n\n2000 \n\n4000 \n\n6000 \n\n8000 \nt \n\n10000 12000 14000 16000 \n\nFigure 2: Comparison of the drift segmentation obtained by the algorithm (upper \nplot), and a manual segmentation by a medical expert (middle). Only a single(cid:173)\nchannel EEG recording (occipital-l, time resolution O.ls) of an afternoon nap is \ngiven for the algorithmic approach, while the manual segmentation is based on all \navailable measurements. In the manual analysis, WI and W2 indicate two wake(cid:173)\nstates (eyes open/closed), and 81 and 82 indicate sleep stage I and II, respectively. \n(n.a.: no assessment, art.: artifacts) \n\nnets, 6 and 8, performs the best reconstruction of the EEG dynamics. Then, at \nt = 3000 (300s), there starts a drift to net 4, which apparently represents the dy(cid:173)\nnamics of sleep stage II (82) . Interestingly, sleep stage I (81) is not represented by \na separate net but by a linear mixture of net 4 and net 6, with much more weight \non net 4. Thus, the process of falling asleep is represented as a drift from the state \nof being awake directly to sleep stage II. \nDuring sleep there are several wake-up spikes indicated in the manual segmentation. \nAt least the last four are also clearly indicated in the drift segmentation, as drifts \nback to net 6. Furthermore, the detection ofthe final arousal after t = 12000 (1200s) \nis in good accordance with the manual segmentation: there is a fast drift back to \nnet 6 at that point. \nConsidering the fact that our method is based only on the recording of a single \nEEG channel and does not use any medical expert knowledge, the drift algorithm \nis in remarkable accordance with the assessment of the medical expert. Moreover, \nit resolves the dynamical structure of the signal to more detail. For a more com(cid:173)\nprehensive analysis of wake/sleep data, we refer to our forthcoming publication \n[3] . \n\n\fAnalysis of Drifting Dynamics with Neural Network Hidden Markov Models \n\n741 \n\n6 Summary and Discussion \n\nWe presented a method for the unsupervised segmentation and identification of \nnonstationary drifting dynamics. It applies to time series where the dynamics is \ndrifting or switching between different operating modes. An application to phys(cid:173)\niological wake/sleep data (EEG) demonstrates that drift can be found in natural \nsystems. It is therefore important to consider this aspect of data description. \nIn the case of wake/sleep data, where the physiological state transitions are far from \nbeing understood, we can extract the shape of the dynamical drift from wake to \nsleep in an unsupervised manner. By applying this new data analysis method, we \nhope to gain more insights into the underlying physiological processes. Our future \nwork is therefore dedicated to a comprehensive analysis of large sets of physiological \nwake/sleep recordings. We expect, however, that our method will be also applicable \nin many other fields. \n\nAcknowledgements: We acknowledge support of the DFG (grant Ja379/51) and \nwe would like to thank J. Rittweger for the EEG data and for fruitful discussions. \n\nReferences \n\n[1] Jacobs, R.A., Jordan, M.A. , Nowlan, S.J., Hinton, G.E. (1991). Adaptive Mix(cid:173)\n\ntures of Local Experts, Neural Computation 3, 79-87. \n\n[2] Kohlmorgen, J., Miiller, K.-R., Pawelzik, K. (1995). Improving short-term pre(cid:173)\n\ndiction with competing experts. ICANN'95, EC2 & Cie, Paris, 2:215-220. \n[3] Kohlmorgen, J., Miiller, K.-R., Rittweger, J., Pawelzik, K., in preparation. \n[4] Mackey, M., Glass, L. (1977). Oscillation and Chaos in a Physiological Control \n\nSystem, Science 197,287. \n\n[5] Moody, J., Darken, C. (1989). Fast Learning in Networks of Locally-Tuned \n\nProcessing Units. Neural Computation 1, 281-294. \n\n[6] Miiller, K.-R., Kohlmorgen, J., Pawelzik, K. (1995). Analysis of Switching \nDynamics with Competing Neural Networks, IEICE 'nans. on Fundamentals \nof Electronics, Communications and Computer Sc., E78-A, No.1O, 1306-1315. \n[7] Miiller, K.-R., Kohlmorgen, J., Rittweger, J., Pawelzik, K. (1995). Analysing \n\nPhysiological Data from the Wake-Sleep State Transition with Competing Pre(cid:173)\ndictors, NOLTA'95: Symposium on Nonlinear Theory and its Appl., 223-226. \n[8] Pawelzik, K. , Kohlmorgen, J., Miiller, K.-R. (1996). Annealed Competition of \nExperts for a Segmentation and Classification of Switching Dynamics, Neural \nComputation, 8:2, 342-358. \n\n[9] Rabiner, L.R. (1988). A Tutorial on Hidden Markov Models and Selected Ap(cid:173)\nplications in Speech Recognition. In Readings in Speech Recognition, ed. A. \nWaibel, K. Lee, 267-296. San Mateo: Morgan Kaufmann, 1990. \n\n[10] Takens, F. (1981). Detecting Strange Attractors in Turbulence. In: Rand, D., \nYoung, L.-S., (Eds.), Dynamical Systems and Turbulence, Springer Lecture \nNotes in Mathematics, 898, 366. \n\n[11] Weigend, A.S., Gershenfeld, N.A. (Eds.) (1994). Time Series Prediction: Fore(cid:173)\n\ncasting the Future and Understanding the Past, Addison-Wesley. \n\n\f", "award": [], "sourceid": 1350, "authors": [{"given_name": "Jens", "family_name": "Kohlmorgen", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Klaus", "family_name": "Pawelzik", "institution": null}]}