{"title": "Multiple Alignment of Continuous Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 817, "page_last": 824, "abstract": null, "full_text": " Multiple Alignment of Continuous Time Series\n\n\n\n Jennifer Listgarten, Radford M. Neal, Sam T. Roweis and Andrew Emili\n Department of Computer Science, Banting and Best Department of Medical Research\n and Program in Proteomics and Bioinformatics\n University of Toronto, Toronto, Ontario, M5S 3G4\n {jenn,radford,roweis}@cs.toronto.edu, andrew.emili@utoronto.ca\n\n\n\n Abstract\n\n\n Multiple realizations of continuous-valued time series from a stochastic\n process often contain systematic variations in rate and amplitude. To\n leverage the information contained in such noisy replicate sets, we need\n to align them in an appropriate way (for example, to allow the data to be\n properly combined by adaptive averaging). We present the Continuous\n Profile Model (CPM), a generative model in which each observed time\n series is a non-uniformly subsampled version of a single latent trace, to\n which local rescaling and additive noise are applied. After unsupervised\n training, the learned trace represents a canonical, high resolution fusion\n of all the replicates. As well, an alignment in time and scale of each\n observation to this trace can be found by inference in the model. We\n apply CPM to successfully align speech signals from multiple speakers\n and sets of Liquid Chromatography-Mass Spectrometry proteomic data.\n\n\n\n1 A Profile Model for Continuous Data\n\nWhen observing multiple time series generated by a noisy, stochastic process, large sys-\ntematic sources of variability are often present. For example, within a set of nominally\nreplicate time series, the time axes can be variously shifted, compressed and expanded,\nin complex, non-linear ways. Additionally, in some circumstances, the scale of the mea-\nsured data can vary systematically from one replicate to the next, and even within a given\nreplicate.\n\nWe propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such\ntime series. In this model, each time series is generated as a noisy transformation of a\nsingle latent trace. The latent trace is an underlying, noiseless representation of the set\nof replicated, observable time series. Output time series are generated from this model\nby moving through a sequence of hidden states in a Markovian manner and emitting an\nobservable value at each step, as in an HMM. Each hidden state corresponds to a particular\nlocation in the latent trace, and the emitted value from the state depends on the value of\nthe latent trace at that position. To account for changes in the amplitude of the signals\nacross and within replicates, the latent time states are augmented by a set of scale states,\nwhich control how the emission signal will be scaled relative to the value of the latent trace.\nDuring training, the latent trace is learned, as well as the transition probabilities controlling\nthe Markovian evolution of the scale and time states and the overall noise level of the\n\n\f\nobserved data. After training, the latent trace learned by the model represents a higher\nresolution 'fusion' of the experimental replicates. Figure 1 illustrate the model in action.\n\n\n Unaligned, Linear Warp Alignment and CPM Alignment\n\n 40\n\n 30\n\n 20\n\n Amplitude 10\n\n 0\n 50\n\n 40\n\n 30\n\n 20\n Amplitude\n 10\n\n 0\n 30\n\n\n 20\n\n\n\n Amplitude 10\n\n\n\n 0 Time\n\n a) b)\n\n\nFigure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle:\nsame signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the\nlearned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy\nsignals in a), Top: unaligned originals, Bottom: aligned using CPM.\n\n\n\n2 Defining the Continuous Profile Model (CPM)\n\nThe CPM is generative model for a set of K time series, xk = (xk1, xk2, ..., xk ). The\n N k\ntemporal sampling rate within each xk need not be uniform, nor must it be the same across\nthe different xk. Constraints on the variability of the sampling rate are discussed at the end\nof this section. For notational convenience, we henceforth assume N k = N for all k, but\nthis is not a requirement of the model.\n\nThe CPM is set up as follows: We assume that there is a latent trace, z = (z1, z2, ..., zM ), a\ncanonical representation of the set of noisy input replicate time series. Any given observed\ntime series in the set is modeled as a non-uniformly subsampled version of the latent trace\nto which local scale transformations have been applied. Ideally, M would be infinite, or at\nleast very large relative to N so that any experimental data could be mapped precisely to the\ncorrect underlying trace point. Aside from the computational impracticalities this would\npose, great care to avoid overfitting would have to be taken. Thus in practice, we have used\nM = (2 + )N (double the resolution, plus some slack on each end) in our experiments\nand found this to be sufficient with < 0.2. Because the resolution of the latent trace is\nhigher than that of the observed time series, experimental time can be made effectively to\nspeed up or slow down by advancing along the latent trace in larger or smaller jumps.\n\nThe subsampling and local scaling used during the generation of each observed time se-\nries are determined by a sequence of hidden state variables. Let the state sequence for\nobservation k be k. Each state in the state sequence maps to a time state/scale state pair:\nk { k, k}. Time states belong to the integer set (1..M ); scale states belong to an\n i i i\nordered set (1..Q). (In our experiments we have used Q=7, evenly spaced scales in\nlogarithmic space). States, k, and observation values, xk, are related by the emission\n i i\nprobability distribution: A (xk|z) p(xk|k, z, , uk) N (xk; z kuk, )\n k , where \n i i i i k i\n i i\n\n\f\nis the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density\nfor a with mean b and standard deviation c. The uk are real-valued scale parameters, one\nper observed time series, that correct for any overall scale difference between time series k\nand the latent trace.\n\nTo fully specify our model we also need to define the state transition probabilities. We\ndefine the transitions between time states and between scale states separately, so that\nT k p(\n i|i-1) = p(i|i-1)pk (i|i-1). The constraint that time must move\n i-1,i\nforward, cannot stand still, and that it can jump ahead no more than J time states is en-\nforced. (In our experiments we used J = 3.) As well, we only allow scale state transitions\nbetween neighbouring scale states so that the local scale cannot jump arbitrarily. These\nconstraints keep the number of legal transitions to a tractable computational size and work\nwell in practice. Each observed time series has its own time transition probability dis-\ntribution to account for experiment-specific patterns. Both the time and scale transition\nprobability distributions are given by multinomials:\n\n dk\n 1, ifa-b=1\n dk2, ifa-b=2\n .\n pk( \n i = a|i-1 = b) = \n ..dk, ifa-b=J\n J \n \n 0, otherwise\n s0, ifD(a,b)=0\n s\n p( 1, if D(a, b) = 1\n i = a|i-1 = b) = \n s1, ifD(a,b)=-1\n 0, otherwise\nwhere D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = -1 means\nthat a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions\nare constrained by: J dk = 1 and 2s\n i=1 i 1 + s0 = 1.\n\nJ determines the maximum allowable instantaneous speedup of one portion of a time\nseries relative to another portion, within the same series or across different series. However,\nthe length of time for which any series can move so rapidly is constrained by the length of\nthe latent trace; thus the maximum overall ratio in speeds achievable by the model between\nany two entire time series is given by min(J , M ).\n N\n\nAfter training, one may examine either the latent trace or the alignment of each observable\ntime series to the latent trace. Such alignments can be achieved by several methods, in-\ncluding use of the Viterbi algorithm to find the highest likelihood path through the hidden\nstates [1], or sampling from the posterior over hidden state sequences. We found Viterbi\nalignments to work well in the experiments below; samples from the posterior looked quite\nsimilar.\n\n\n3 Training with the Expectation-Maximization (EM) Algorithm\n\nAs with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the\ncontext of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly\nusing the Forward-Backward algorithm [1], which provides the posterior probability over\nstates for each time point of every observed time series, k(i) p(\n s i = s|x) and also\nthe pairwise state posteriors, s,t(i) p(i-1 = s, i = t|xk). The algorithm is modified\n\n\f\nonly in that the emission probabilities depend on the latent trace as described in Section 2.\nThe M-Step consists of a series of analytical updates to the various parameters as detailed\nbelow.\n\nGiven the latent trace (and the emission and state transition probabilities), the complete log\nlikelihood of K observed time series, xk, is given by Lp L + P. L is the likelihood term\narising in a (conditional) HMM model, and can be obtained from the Forward-Backward\nalgorithm. It is composed of the emission and state transition terms. P is the log prior\n(or penalty term), regularizing various aspects of the model parameters as explained below.\nThese two terms are:\n K N N\n\n L log p(1) + log A (xk|z) + log T k (1)\n i i i-1,i\n k=1 i=1 i=2\n\n -1 K\n P - (zj+1 - zj)2 + log D(dk|{k}) + log D(s }),\n v v v |{v (2)\n j=1 k=1\n\n\nwhere p(1) are priors over the initial states. The first term in Equation 2 is a smoothing\npenalty on the latent trace, with controlling the amount of smoothing. k\n v and v are\nDirichlet hyperprior parameters for the time and scale state transition probability distribu-\ntions respectively. These ensure that all non-zero transition probabilities remain non-zero.\nFor the time state transitions, v {1, J } and kv corresponds to the pseudo-count data for\nthe parameters d1, d2 . . . dJ . For the scale state transitions, v {0, 1} and k\n v corresponds\nto the pseudo-count data for the parameters s0 and s1.\n\nLetting S be the total number of possible states, that is, the number of elements in the\ncross-product of possible time states and possible scale states, the expected complete log\nlikelihood is:\n\n\n K S K S N\n\n =P + k(1) log T k + k(i) log A |z) + . . .\n s 0,s s s(xk\n i\n k=1 s=1 k=1 s=1 i=1\n\n K S S N\n\n . . . + k (i) log T k\n s,s s,s\n k=1 s=1 s =1 i=2\n\n\nusing the notation T k\n 0 p( (i) (i)\n ,s 1 = s), and where k\n s and k are the posteriors over\n s,s\nstates as defined above. Taking derivatives of this quantity with respect to each of the\nparameters and finding the critical points provides us with the M-Step update equations. In\nupdating the latent trace z we obtain a system of M simultaneous equations, for j = 1..M :\n\n\n K N\n (xk - z\n = 0 = k(i) i j uk s) - (4z\n z s suk j - 2zj-1 - 2zj+1)\n j 2\n k=1 {s|s=j} i=1\n\nFor the cases j = 1, N , the terms 2zj-1 and 2zj+1, respectively, drop out. Considering\nall such equations we obtain a system of M equations in M unknowns. Each equation\ndepends only linearly on three variables from the latent trace. Thus the solution is easily\nobtained numerically by solving a tridiagonal linear system.\n\nAnalytic updates for 2 and uk are given by:\n\n\n S N k(i)(xk - z uk S z N k(i)xk\n 2 = s=1 i=1 s i s s)2 , uk = s=1 s s i=1 s i\n N S (z k(i)\n s=1 s s)2 N\n i=1 s\n\n\f\nLastly, updates for the scale and state transition probability distributions are given by:\n\n\n k + S N k (i)\n v s=1 {s | - i=2 s,s\n dk = s s =v}\n v J k + J S N k (i)\n j=1 j j=1 s=1 {s | - i=2 s,s\n s s =j}\n\n + K S N k (i)\n j k=1 s=1 {s H(s,v)} i=2 s,s\n sv = 1 + K S N k (i)\n j=0 j k=1 s=1 {s H(s,1),H(s,0)} i=2 s,s\n\n\nwhere H(s, j) {s |s is exactly j scale states away from s}. Note that we do\nnot normalize the Dirichlets, and omit the traditional minus one in the exponent:\nD(dk|{k}) = J (dk)kv }) = 1 (s\n v v and D(s\n v=1 v v |{v v=0 v )v .\n\nThe M-Step updates uk, , and z are coupled. Thus we arbitrarily pick an order to update\nthem and as one is updated, its new values are used in the updates for the coupled parameter\nupdates that follow it. In our experiments we updated in the following order: , z, uk. The\nother two parameters, dkv and sv, are completely decoupled.\n\n\n4 Experiments with Laboratory and Speech Data\n\nWe have applied the CPM model to analyze several Liquid Chromatography - Mass Spec-\ntrometry (LC-MS) data sets from an experimental biology laboratory. Mass spectrometry\ntechnology is currently being developed to advance the field of proteomics [2, 3]. A mass\nspectrometer takes a sample as input, for example, human blood serum, and produces a\nmeasure of the abundance of molecules that have particular mass/charge ratios. In pro-\nteomics the molecules in question are small protein fragments. From the pattern of abun-\ndance values one can hope to infer which proteins are present and in what quantity. For\nprotein mixtures that are very complex, such as blood serum, a sample preparation step\nis used to physically separate parts of the sample on the basis of some property of the\nmolecules, for example, hydrophobicity. This separation spreads out the parts over time\nso that at each unique time point a less complex mixture is fed into the mass spectrometer.\nThe result is a two-dimensional time series spectrum with mass/charge on one axis and\ntime of input to the mass spectrometer on the other. In our experiments we collapsed the\ndata at each time point to one dimension by summing together abundance values over all\nmass/charge values. This one-dimensional data is referred to as the Total Ion Count (TIC).\nWe discuss alternatives to this in the last section. After alignment of the TICs, we assessed\nthe alignment of the LC-MS data by looking at both the TIC alignments, and also the cor-\nresponding two-dimensional alignments of the non-collapsed data, which is where the true\ninformation lies.\n\nThe first data set was a set of 13 replicates, each using protein extracted from lysed E. coli\ncells. Proteins were digested and subjected to capillary-scale LC-MS coupled on-line to an\nion trap mass spectrometer. First we trained the model with no smoothing (i.e., = 0) on\nthe 13 replicates. This provided nice alignments when viewed in both the TIC space and\nthe full two-dimensional space. Next we used leave-one-out cross-validation on six of the\nreplicates in order to choose a suitable value for . Because the uk and dkv are time series\nspecific, we ran a restricted EM on the hold-out case to learn these parameters, holding the\nother parameters fixed at the values found from learning on the training set. Sixteen values\nof over five orders of magnitude, and also zero, were used. Note that we did not include\nthe regularization likelihood term in the calculations of hold-out likelihood. One of the\nnon-zero values was found to be optimal (statistically significant at a p=0.05 level using a\npaired sample t-test to compare it to no smoothing). Visually, there did not appear to be\na difference between no regularization and the optimal value of , in either the TIC space\n\n\f\nor the full two-dimensional space. Figure 2 shows the alignments applied to the TICs and\nalso the two-dimensional data, using the optimal value of .\n\n\n 8\n x 10 Unaligned and Aligned Time Series\n 8\n x 10 Replicate 5\n\n 10 9 Latent Trace\n Original Time Series Aligned Experimental Time Series\n 8\n 8\n 7\n 6\n 6\n Amplitude 4 5\n\n 2 4\n Amplitude\n\n 8 3\n 0 x 10\n 2\n\n 6 1\n\n 0\n 100 200 300 400 500 600 700 800\n Residual\n 4\n\n 3 Time Jump From Previous State\n 2\n 2 1\n\n Latent Space Amplitude Scale States\n\n 0\n 100 200 300 400 500 600 700 800 200 400 600 800\n Time Latent Time\n\n a) b)\n\n\n\n\n\n c) d)\n\nFigure 2: Figure 2: a) Top: 13 Replicate pre-processed TICs as described in Section 4),\nBottom: same as top, but aligned with CPM (the learned latent trace is also shown). b) The\nfifth TIC replicate aligned to the learned latent trace (inset shows the original, unaligned).\nBelow are three strips showing, from top-to-bottom, i) the error residual, ii) the number\nof time states moved between every two states in the Viterbi alignment, and iii) the local\nscaling applied at each point in the alignment. c) A portion of the two-dimensional LC-MS\ndata from replicates two (in red) and four (in green). d) Same as c), but after alignment (the\nsame one dimensional alignment was applied to every Mass/Charge value). Marker lines\nlabeled A to F show how time in c) was mapped to latent time using the Viterbi alignment.\n\n\nWe also trained our model on five different sets of LC-MS data, each consisting of human\nblood serum. We used no smoothing and found the results visually similar in quality to the\nfirst data set.\n\nTo ensure convergence to a good local optimum and to speed up training, we pre-processed\nthe LC-MS data set by coarsely aligning and scaling each time series as follows: We 1)\ntranslated each time series so that the center of mass of each time series was aligned to the\nmedian center of mass over all time series, 2) scaled the abundance values such that the\nsum of abundance values in each time series was equal to the median sum of abundance\nvalues over all time series.\n\nWe also used our model to align 10 speech signals, each an utterance of the same sentence\n\n\f\nspoken by a different speaker. The short-time energy (using a 30ms Hanning window) was\ncomputed every 8ms for each utterance and the resulting vectors were used as the input\nto CPM for alignment. The smoothing parameter was set to zero. For comparison, we\nalso performed a linear warping of time with an offset. (i.e. each signal was translated so\nas to start at the same time, and the length of each signal was stretched or compressed so\nas to each occupy the same amount of time). Figure 1 shows the successful alignment of\nthe speech signals by CPM and also the (unsuccessful) linear warp. Audio for this exam-\nple can be heard at http://www.cs.toronto.edu/~jenn/alignmentStudy,\nwhich also contains some supplemental figures for the paper.\n\nInitialization for EM training was performed as follows: was set to 15% of the difference\nbetween the maximum and minimum values of the first time series. The latent trace was\ninitialized to be the first observed time series, with Gaussian, zero-mean noise added, with\nstandard deviation equal to . This was then upsampled by a factor of two by repeating\nevery value twice in a row. The additional slack at either end of the latent trace was set to\nbe the minimum value seen in the given time series. The uk were each set to one and the\nmultinomial scale and state transition probabilities were set to be uniform.\n\n\n5 Related Algorithms and Models\n\nOur proposed CPM has many similarities to Input/Output HMMs (IOHMMs), also called\nConditional HMMs [4]. IOHMMs extend standard HMMs [1] by conditioning the emission\nand transition probabilities on an observed input sequence. Each component of the output\nsequence corresponds to a particular component of the input. Training of an IOHMM\nis supervised -- a mapping from an observed input sequence to output target sequence\nis learned. Our CPM also requires input and thus is also a type of conditional HMM.\nHowever, the input is unobserved (but crucially it is shared between all replicates) and\nhence learning is unsupervised in the CPM model. One could also take the alternative view\nthat the CPM is simply an HMM with an extra set of parameters, the latent trace, that affect\nthe emission probabilities and which are learned by the model.\n\nThe CPM is similar in spirit to Profile HMMs which have been used with great success\nfor discrete, multiple sequence alignment, modeling of protein families and their con-\nserved structures, gene finding [5], among others. Profile HMM are HMMs augmented\nby constrained-transition 'Delete' and 'Insert' states, with the former emitting no observa-\ntions. Multiple sequences are provided to the Profile HMM during training and a summary\nof their shared statistical properties is contained in the resulting model. The development\nof Profile HMMs has provided a robust, statistical framework for reasoning about sets of\nrelated discrete sequence data. We put forth the CPM as a continuous data, conditional\nanalogue.\n\nMany algorithms currently used for aligning continuous time series data are variations of\nDynamic Time Warping (DTW) [6], a dynamic programming based approach which origi-\nnated in the speech recognition community as a robust distance measure between two time\nseries. DTW works on pairs of time series, aligning one time series to a specified reference\ntime series. DTW does not take in to account systematic variations in the amplitude of the\nsignal. Our CPM can be viewed as a rich and robust extension of DTW that can be applied\nto many time series in parallel and which automatically uncovers the underlying template\nof the data.\n\n\n6 Discussion and Conclusion\n\nWe have introduced a generative model for sets of continuous, time series data. By training\nthis model one can leverage information contained in noisy, replicated experimental data,\n\n\f\nand obtain a single, superior resolution 'fusion' of the data. We demonstrated successful\nuse of this model on real data, but note that it could be applied to a wide range of problems\ninvolving time signals, for example, alignment of gene expression time profiles, alignment\nof temporal physiological signals, alignment of motion capture data, to name but a few.\n\nCertain assumptions of the model presented here may be violated under different ex-\nperimental conditions. For example, the Gaussian emission probabilities treat errors in\nlarge amplitudes in the same absolute terms as in smaller amplitudes, whereas in real-\nity, it may be that the error scales with signal amplitude. Similarly, the penalty term\n- -1(z\n j=1 j+1 - zj )2 does not scale with the amplitude; this might result in the model\narbitrarily preferring a lower amplitude latent trace. (However, in practice, we did not find\nthis to be a problem.)\n\nOne immediate and straight-forward extension to the model would be to allow the data at\neach time point to be a multi-dimensional feature vector rather than a scalar value. This\ncould easily be realized by allowing the emission probabilities to be multi-dimensional. In\nthis way a richer set of information could be used: either the raw, multi-dimensional feature\nvector, or some transformation of the feature vectors, for example, Principal Components\nAnalysis. The rest of the model would be unchanged and each feature vector would move\nas a coherent piece. However, it might also be useful to allow different dimensions of the\nfeature vector to be aligned differently. For example, with the LC-MS data, this might\nmean allowing different mass/charge peptides to be aligned differently at each time point.\nHowever, in its full generality, such a task would be extremely computational intense.\n\nA perhaps more interesting extension is to allow the model to work with non-replicate data.\nFor example, suppose one had a set of LC-MS experiments from a set of cancer patients,\nand also a set from normal persons. It would be desirable to align the whole set of time\nseries and also to have the model tease out the differences between them. One approach\nis to consider the model to be semi-supervised - the model is told the class membership\nof each training example. Then each class is assigned its own latent trace, and a penalty\nis introduced for any disagreements between the latent traces. Care needs to be taken to\nensure that the penalty plateaus after a certain amount of disagreement between latent trace\npoints, so that parts of the latent trace which are truly different are able to whole-heartedly\ndisagree. Assuming that the time resolution in the observed time series is sufficiently high,\none might also want to encourage the amount of disagreement over time to be Markovian.\nThat is, if the previous time point disagreed with the other latent traces, then the current\npoint should be more likely to disagree.\n\n\nReferences\n\n[1] Alan B. Poritz. Hidden markov models: A guided tour. In Proceedings of the IEEE Conference\n on Acoustics, Speech and Signal Processing (ICASSP), pages 713. Morgan Kaufmann, 1988.\n\n[2] Ruedi Aebersold and Matthias Mann. Mass spectrometry-based proteomics. Nature, 422:198\n 207, 2003.\n\n[3] P. Kearney and P. Thibault. Bioinformatics meets proteomics - bridging the gap between mass\n spectrometry data analysis and cell biology. Journal of Bioinformatics and Computational Biol-\n ogy, 1:183200, 2003.\n\n[4] Yoshua Bengio and Paolo Frasconi. An input output HMM architecture. In G. Tesauro, D. Touret-\n zky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages\n 427434. The MIT Press, 1995.\n\n[5] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence\n Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 2000.\n Durbin.\n\n[6] H. Sakoe and S.Chiba. Dynamic programming algorithm for spoken word recognition. Readings\n in Speech Recognition, pages 159165, 1990.\n\n\f\n", "award": [], "sourceid": 2721, "authors": [{"given_name": "Jennifer", "family_name": "Listgarten", "institution": null}, {"given_name": "Radford", "family_name": "Neal", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Andrew", "family_name": "Emili", "institution": null}]}