{"title": "A Dynamic HMM for On-line Segmentation of Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": null, "full_text": "ADynamic HMM for On-line \nSegmentation of Sequential Data \n\nJens Kohlmorgen* \nFraunhofer FIRST.IDA \n\nKekulestr. 7 \n\n12489 Berlin, Germany \njek@first\u00b7fraunhofer.de \n\nSteven Lemm \n\nFraunhofer FIRST.IDA \n\nKekulestr. 7 \n\n12489 Berlin, Germany \nlemm @first\u00b7fraunhofer.de \n\nAbstract \n\nWe propose a novel method for the analysis of sequential data \nthat exhibits an inherent mode switching. In particular, the data \nmight be a non-stationary time series from a dynamical system \nthat switches between multiple operating modes. Unlike other ap(cid:173)\nproaches, our method processes the data incrementally and without \nany training of internal parameters. We use an HMM with a dy(cid:173)\nnamically changing number of states and an on-line variant of the \nViterbi algorithm that performs an unsupervised segmentation and \nclassification of the data on-the-fly, i.e. the method is able to pro(cid:173)\ncess incoming data in real-time. The main idea of the approach is \nto track and segment changes of the probability density of the data \nin a sliding window on the incoming data stream. The usefulness \nof the algorithm is demonstrated by an application to a switching \ndynamical system. \n\n1 \n\nIntroduction \n\nAbrupt changes can occur in many different real-world systems like, for example, \nin speech, in climatological or industrial processes, in financial markets, and also \nin physiological signals (EEG/MEG). Methods for the analysis of time-varying dy(cid:173)\nnamical systems are therefore an important issue in many application areas. In [12], \nwe introduced the annealed competition of experts method for time series from non(cid:173)\nlinear switching dynamics, related approaches were presented, e.g., in [2, 6, 9, 14]. \nFor a brief review of some of these models see [5], a good introduction is given in \n[3]. \nWe here present a different approach in two respects. First, the segmentation does \nnot depend on the predictability of the system. Instead, we merely estimate the \ndensity distribution of the data and track its changes. This is particularly an im(cid:173)\nprovement for systems where data is hard to predict, like, for example, EEG record(cid:173)\nings [7] or financial data. Second, it is an on-line method. An incoming data stream \nis processed incrementally while keeping the computational effort limited by a fixed \n\n\u2022 http://www.first.fraunhofer.de/..-.jek \n\n\fupper bound, i.e. the algorithm is able to perpetually segment and classify data \nstreams with a fixed amount of memory and CPU resources. It is even possible to \ncontinuously monitor measured data in real-time, as long as the sampling rate is \nnot too high.l The main reason for achieving a high on-line processing speed is the \nfact that the method, in contrast to the approaches above, does not involve any \ntraining, i.e. iterative adaptation of parameters. Instead, it optimizes the segmen(cid:173)\ntation on-the-fly by means of dynamic programming [1], which thereby results in an \nautomatic correction or fine-tuning of previously estimated segmentation bounds. \n\n2 The segmentation algorithm \n\nWe consider the problem of continuously segmenting a data stream on-line and \nsimultaneously labeling the segments. The data stream is supposed to have a se(cid:173)\nquential or temporal structure as follows: it is supposed to consist of consecutive \nblocks of data in such a way that the data points in each block originate from \nthe same underlying distribution. The segmentation task is to be performed in an \nunsupervised fashion, i.e. without any a-priori given labels or segmentation bounds. \n\n2.1 Using pdfs as features for segmentation \n\nConsider Yl, Y2 , Y3, ... , with Yt E Rn, an incoming data stream to be analyzed. \nThe sequence might have already passed a pre-processing step like filtering or sub(cid:173)\nsampling, as long as this can be done on-the-fly in case of an on-line scenario. As \na first step of further processing, it might then be useful to exploit an idea from \ndynamical systems theory and embed the data into a higher-dimensional space, \nwhich aims to reconstruct the state space of the underlying system, \n\nXt = (Yt,Yt-n'\" ,Yt -(m-l)r )' \n\n(1) \nThe parameter m is called the embedding dimension and T is called the delay \nparameter of the embedding. The dimension of the vectors Xt thus is d = m n. The \nidea behind embedding is that the measured data might be a potentially non-linear \nprojection of the systems state or phase space. In any case, an embedding in a \nhigher-dimensional space might help to resolve structure in the data, a property \nwhich is exploited, e.g., in scatter plots. After the embedding step one might \nperform a sub-sampling of the embedded data in order to reduce the amount of \ndata for real-time processing. 2 Next, we want to track the density distribution of \nthe embedded data and therefore estimate the probability density function (pdf) in a \nsliding window of length W. We use a standard density estimator with multivariate \nGaussian kernels [4] for this purpose, centered on the data points3 in the window \n}W -l \n{ ~ \nXt-w w=o, \n\n() \n\n1 ~l 1 \n\nPt x = W ~ (27fa 2 )d/2 exp \n\n( \n\n-\n\n(x - Xt_w)2) \n\n2a2 \n\n. \n\n(2) \n\nThe kernel width a is a smoothing parameter and its value is important to obtain \na good representation of the underlying distribution. We propose to choose a pro(cid:173)\nportional to the mean distance of each Xt to its first d nearest neighbors, averaged \nover a sample set {xt}. \n\n1 In our reported application we can process data at 1000 Hz (450 Hz including display) \non a 1.33 GHz PC in MATLAB/C under Linux, which we expect is sufficient for a large \nnumber of applications. \n\n2In that case, our further notation of time indices would refer to the subsampled data. \n3We use if to denote a specific vector-valued point and x to denote a vector-valued \n\nvariable. \n\n\f2.2 Similarity of two pdfs \n\nOnce we have sampled enough data points to compute the first pdf according to \neq. (2), we can compute a new pdf with each new incoming data point. In order \nto quantify the difference between two such functions, f and g, we use the squared \nL2-Norm, also called integrated squared error (ISE) , d(f, g) = J(f - g)2 dx , which \ncan be calculated analytically if f and 9 are mixtures of Gaussians as in our case \nof pdfs estimated from data windows, \n\n(3) \n\n2.3 The HMM in the off-line case \n\nBefore we can discuss the on-line variant, it is necessary to introduce the HMM and \nthe respective off-line algorithm first. For a given a data sequence, {X'dT=l' we can \nobtain the corresponding sequence of pdfs {Pt(X)}tES, S = {W, ... , T}, according \nto eq. (2). We now construct a hidden Markov model (HMM) where each of these \npdfs is represented by a state s E S, with S being the set of states in the HMM. \nFor each state s, we define a continuous observation probability distribution, \n\n( ( ) I ) -\nPPt X s-~ exp\n\n1 \n\nV 21f <; \n\n-\n\n( d(Ps(X),Pt(x))) \n\n22 \n<; \n\n' \n\n(4) \n\nfor observing a pdf Pt(x) in state s. Next, the initial state distribution {1fsLES \nof the HMM is given by the uniform distribution, 1fs = liN, with N = lSI being \nthe number of states. Thus, each state is a-priori equally probable. The HMM \ntransition matrix, A = (PijkjES, determines each probability to switch from a \nstate Si to a state Sj. Our aim is to find a representation of the given sequence of \npdfs in terms of a sequence of a small number of representative pdfs, that we call \nprototypes, which moreover exhibits only a small number of prototype changes. We \ntherefore define A in such a way that transitions to the same state are k times more \nlikely than transitions to any of the other states, \n\n_ { k+~-l \n\n1 \n\nPij -\n\nk+N - l \n\n;ifi=J \n\n;ifi-j.J \n\n(5) \n\nThis completes the definition of our HMM. Note that this HMM has only two free \nparameters, k and <;. The well-known Viterbi algorithm [13] can now be applied \nto the above HMM in order to compute the optimal - i.e. the most likely - state \nsequence of prototype pdfs that might have generated the given sequence of pdfs. \nThis state sequence represents the segmentation we are aiming at. We can compute \nthe most likely state sequence more efficiently if we compute it in terms of costs, \nc = -log(p), instead of probabilities p, i.e. instead of computing the maximum of \nthe likelihood function L , we compute the minimum of the cost function, -log(L), \nwhich yields the optimal state sequence as well. In this way we can replace products \nby sums and avoid numerical problems [13]. In addition to that, we can further \nsimplify the computation for the special case of our particular HMM architecture, \nwhich finally results in the following algorithm: \nFor each time step, t = w, ... , T, we compute for all S E S the cost cs(t) of the opti(cid:173)\nmal state sequence from W to t, subject to the constraint that it ends in state S at \n\n\ftime t. We call these constrained optimal sequences c-paths and the unconstrained \noptimum 0* -path. The iteration can be formulated as follows, with ds,t being a \nshort hand for d(ps(x)'pt(x)) and bs,s denoting the Kronecker delta function: \n\nInitialization, Vs E S: \n\nInduction, Vs E S: \n\nCs(W) := ds ,w, \n\n(6) \n\ncs(t) := ds t + min { cs(t - 1) + C (1- bs s)}, \n\nfor t = W + 1, ... , T, \n\n(7) \n\n, \n\nsES \n\n' \n\nTermination: \n\nsES \n\n0* := min { cs(T) } . \n\n(8) \nThe regularization constant C, which is given by C = 2C;2 10g(k) and thus subsumes \nour two free HMM parameters, can be interpreted as transition cost for switching \nto a new state in the path. 4 The optimal prototype sequence with minimal costs \n0* for the complete series of pdfs, which is determined in the last step, is obtained \nby logging and updating the c-paths for all states s during the iteration and finally \nchoosing the one with minimal costs according to eq. (8). \n\n2.4 The on-line algorithm \n\nIn order to turn the above segmentation algorithm into an on-line algorithm, we \nmust restrict the incremental update in eq. (7), such that it only uses pdfs (and \ntherewith states) from past data points. We neglect at this stage that memory and \nCPU resources are limited. \n\nSuppose that we have already processed data up to T - 1. When a new data point \nYT arrives at time T, we can generate a new embedded vector XT (once we have \nsampled enough initial data points for the embedding), we have a new pdf pT(X) \n(once we have sampled enough embedded vectors Xt for the first pdf window), and \nthus we have given a new HMM state. We can also readily compute the distances \nbetween the new pdf and all the previous pdfs, dT,t, t < T, according to eq. (3). \nA similarly simple and straightforward update of the costs, the c-paths and the \noptimal state sequence is only possible, however, if we neglect to consider potential \nc-paths that would have contained the new pdf as a prototype in previous segments. \nIn that case we can simply reuse the c-paths from T - 1. The on-line update at \ntime T for these restricted paths, that we henceforth denote with a tilde, can be \nperformed as follows: \nFor T = W, we have cw(W) := o*(W) := dw,w = O. For T > W: \n\n1. Compute the cost cT(T - 1) for the new state s = T at time T - 1: \n\nFor t = w, ... , T - 1, compute \n\nCT(t) :=dT,t+ min{cT(t-1) ; o*(t-1)+C}: else \n\n{ \n\n0 \n\nift=W \n\n(9) \n\nand update \n\no*(t) := CT(t), \n\n(10) \nHere we use all previous optimal segmentations o*(t), so we don't need to \nkeep the complete matrix (cs(t))S,tES and repeatedly compute the minimum \n4We developed an algorithm that computes an appropriate value for the hyperparameter \nC from a sample set {it}. Due to the limited space we will present that algorithm in a \nforthcoming publication [8]. \n\nif CT(t) < o*(t). \n\n\fover all states. However, we must store and update the history of optimal \nsegmentations 8* (t). \n\n2. Update from T - 1 to T and compute cs(T) for all states s E S obtained \n\nso far, and also get 8*(T): For s = W, ... , T , compute \n\ncs(T) := ds,T + min {cs(T - 1); 8*(T - 1) + C} \n\nand finally get the cost of the optimal path \n\n8* (T) := min {cs(T)} . \n\nsES \n\n(11) \n\n(12) \n\nAs for the off-line case, the above algorithm only shows the update equations for \nthe costs of the C- and 8* -paths. The associated state sequences must be logged \nsimultaneously during the computation. Note that this can be done by just storing \nthe sequence of switching points for each path. Moreover, we do not need to keep \nthe full matrix (cs(t))s,tES for the update, the most recent column is sufficient. \nSo far we have presented the incremental version of the segmentation algorithm. \nThis algorithm still needs an amount of memory and CPU time that is increasing \nwith each new data point. In order to limit both resources to a fixed amount, we \nmust remove old pdfs, i.e. old HMM states, at some point. We propose to do this \nby discarding all states with time indices smaller or equal to s each time the path \nassociated with cs(T) in eq. (11) exhibits a switch back from a more recent state/pdf \nto the currently considered state s as a result of the min-operation in eq. (11). In \nthe above algorithm this can simply be done by setting W := s + 1 in that case, \nwhich also allows us to discard the corresponding old cs(T)- and 8* (t)-paths, for all \ns::::: sand t < s. In addition, the \"if t = W\" initialization clause in eq. (9) must be \nignored after the first such cut and the 8* (W - I)-path must therefore still be kept \nto compute the else-part also for t = W now. Moreover, we do not have CT(W -1) \nand we therefore assume min {CT(W - 1); 8*(W - 1) + C} = 8*(W - 1) + C (in \neq. (9)). \n\nThe explanation for this is as follows: A switch back in eq. (11) indicates that a \nnew data distribution is established, such that the c-path that ends in a pdf state \ns from an old distribution routes its path through one of the more recent states \nthat represent the new distribution, which means that this has lower costs despite \nof the incurred additional transition. Vice versa, a newly obtained pdf is unlikely \nto properly represent the previous mode then, which justifies our above assumption \nabout CT (W -1). The effect of the proposed cut-off strategy is that we discard paths \nthat end in pdfs from old modes but still allow to find the optimal pdf prototype \nwithin the current segment. \n\nCut-off conditions occur shortly after mode changes in the data and cause the \nremoval of HMM states with pdfs from old modes. However, if no mode change \ntakes place in the incoming data sequence, no states will be discarded. We therefore \nstill need to set a fixed upper limit\", for the number of candidate paths/pdfs that \nare simultaneously under consideration if we only have limited resources available. \nWhen this limit is reached because no switches are detected, we must successively \ndiscard the oldest path/pdf stored, which finally might result in choosing a sub(cid:173)\noptimal prototype for that segment however. Ultimately, a continuous discarding \neven enforces a change of prototypes after 2\", time steps if no switching is induced \nby the data until then. The buffer size\", should therefore be as large as possible. In \nany case, the buffer overflow condition can be recorded along with the segmentation, \nwhich allows us to identify such artificial switchings. \n\n\f2.5 The labeling algorithm \n\nA labeling algorithm is required to identify segments that represent the same un(cid:173)\nderlying distribution and thus have similar pdf prototypes. The labeling algorithm \ngenerates labels for the segments and assigns identical labels to segments that are \nsimilar in this respect. To this end, we propose a relatively simple on-line clustering \nscheme for the prototypes, since we expect the prototypes obtained from the same \nunderlying distribution to be already well-separated from the other prototypes as \na result of the segmentation algorithm. We assign a new label to a segment if the \ndistance of its associated prototype to all preceding prototypes exceeds a certain \nthreshold e, and we assign the existing label of the closest preceding prototype \notherwise. This can be written as \n\nl(R) = { ne.wlabel ,. if min1:'Sr e \n\n(13) \n\n1 (mdexmml:'Sr