{"title": "Joint Probabilistic Curve Clustering and Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 480, "abstract": null, "full_text": " Joint Probabilistic Curve Clustering and\n Alignment\n\n\n\n Scott Gaffney and Padhraic Smyth\n School of Information and Computer Science\n University of California, Irvine, CA 92697-3425\n {sgaffney,smyth}@ics.uci.edu\n\n\n Abstract\n\n Clustering and prediction of sets of curves is an important problem in\n many areas of science and engineering. It is often the case that curves\n tend to be misaligned from each other in a continuous manner, either in\n space (across the measurements) or in time. We develop a probabilistic\n framework that allows for joint clustering and continuous alignment of\n sets of curves in curve space (as opposed to a fixed-dimensional feature-\n vector space). The proposed methodology integrates new probabilistic\n alignment models with model-based curve clustering algorithms. The\n probabilistic approach allows for the derivation of consistent EM learn-\n ing algorithms for the joint clustering-alignment problem. Experimental\n results are shown for alignment of human growth data, and joint cluster-\n ing and alignment of gene expression time-course data.\n\n\n1 Introduction\n\nWe introduce a novel methodology for the clustering and prediction of sets of smoothly\nvarying curves while jointly allowing for the learning of sets of continuous curve trans-\nformations. Our approach is to formulate models for both the clustering and alignment\nsub-problems and integrate them into a unified probabilistic framework that allows for the\nderivation of consistent learning algorithms. The alignment sub-problem is handled with\nthe introduction of a novel curve alignment procedure employing model priors over the set\nof possible alignments leading to the derivation of EM learning algorithms that formalize\nthe so-called Procrustes approach for curve data [1]. These alignment models are then\nintegrated into a finite mixture model setting in which the clustering is carried out. We\nmake use of both polynomial and spline regression mixture models to complete the joint\nclustering-alignment framework.\n\nThe following simple illustrative example demonstrates the importance of jointly handling\nthe clustering-alignment problem as opposed to treating alignment and clustering sepa-\nrately. Figure 1(a) shows a simulated set of curves which have been subjected to ran-\ndom translations in time. The underlying generative model contains three clusters each\ndescribed by a cubic polynomial (not shown). Figure 1(b) shows the output of the pro-\nposed joint EM algorithm introduced in this paper, where curves have been simultane-\nously aligned and clustered. The algorithm recovers the hidden labels and alignments near-\nperfectly in this case. On the other hand, Figure 1(c) shows the result of first clustering\n\n\f\n 100 100\n\n\n\n 0 0\n Y-axis Y-axis\n\n -100 -100\n\n\n\n -200 -200\n -5 0 5 10 15 -5 0 5 10 15\n Time Time\n\n\n\n 100 100\n\n\n\n 0 0\n Y-axis Y-axis\n\n -100 -100\n\n\n\n -200 -200\n -5 0 5 10 15 -5 0 5 10 15\n Time Time\n\n\nFigure 1: Comparison of joint EM and sequential clustering-alignment: (a, top-left) un-\nlabelled simulated data with hidden alignments; (b, top-right) solution recovered by joint\nEM; (c, bottom-left) partial solution after clustering first, and (d, bottom-right) final solu-\ntion after aligning clustered data in (c).\n\n\n\nthe unaligned data in Figure 1(b), while Figure 1(d) shows the final result of aligning each\nof the found clusters individually. The sequential approach results in significant misclas-\nsification and incorrect alignment demonstrating that a two-stage approach can be quite\nsuboptimal when compared to a joint clustering-alignment methodology. (Similar results,\nnot shown, are obtained when the curves are first aligned and then clustered--see [2] for\nfull details.)\n\nThere has been little prior work on the specific problem of joint curve clustering and align-\nment, but there is related work in other areas. For example, clustering of gene-expression\ntime profiles with mixtures of splines was addressed in [3]. However, alignment was only\nconsidered as a post-processing step to compare cluster results among related datasets. In\nimage analysis, the transformed mixture of Gaussians (TMG) model uses a probabilistic\nframework and an EM algorithm to jointly learn clustering and alignment of image patches\nsubject to various forms of linear transformations [4]. However, this model only considers\nsets of transformations in discrete pixel space, whereas we are focused on curve modelling\nthat allows for arbitrary continuous alignment in time and space. Another branch of work\nin image analysis focuses on the problem of estimating correspondences of points across\nimages [5] (or vertices across graphs [6]), using EM or deterministic annealing algorithms.\nThe results we describe here differ primarily in that (a) we focus specifically on sets of\ncurves rather than image data (generally making the problem more tractable), (b) we fo-\ncus on clustering and alignment rather than just alignment, (c) we allow continuous affine\ntransformations in time and measurement space, and (d) we have a fully generative prob-\nabilistic framework allowing for (for example) the incorporation of informative priors on\ntransformations if such prior information exists.\n\nIn earlier related work we developed general techniques for curve clustering (e.g., [7])\nand also proposed techniques for transformation-invariant curve clustering with discrete\ntime alignment and Gaussian mixture models for curves [8, 9]. In this paper we provide\n\n\f\na much more general framework that allows for continuous alignment in both time and\nmeasurement space for a general class of \"cluster shape\" models, including polynomials\nand splines.\n\n\n2 Joint clustering and alignment\n\nIt is useful to represent curves as variable-length vectors. In this case, y is a curve that\n i\nconsists of a sequence of ni observations or measurements. The j-th measurement of y i\nis denoted by yij and is usually taken to be univariate (the generalization to multivariate\nobservations is straightforward). The associated covariate of y is written as x\n i i in the same\nmanner. xi is often thought of as time so that xij gives the time at which yij was observed.\n\nRegression mixture models can be effectively used to cluster this type of curve data [10].\nIn the standard setup, yi is modelled using a normal (Gaussian) regression model in which\ny = X\n i i + i, where is a (p +1) 1 coefficient vector, i is a zero-mean Gaussian noise\nvariable, and Xi is the regression matrix. The form of Xi depends on the type of regression\nmodel employed. For polynomial regression, X i is often associated with the standard\nVandermonde matrix; and for spline regression, X i takes the form of a spline-basis matrix\n(see, e.g., [7] for more details). The mixture model is completed by repeating this model\nover K clusters and indexing the parameters by k so that, for example, y = X +\n i ik i\ngives the regression model for y under the\n i k-th cluster.\nB-splines [11] are particularly efficient for computational purposes due to the block-\ndiagonal basis matrices that result. Using B-splines, the curve point y ij can be represented\nas the linear combination y =\n ij B c\n ij , in which the vector Bij gives the vector of B-spline\nbasis functions evaluated at xij, and c gives the spline coefficient vector [2]. The full curve\ny = B\n i can then be written compactly as yi i c in which the spline basis matrix takes the\nform B = [ ]\n i Bi1 Bin . Spline regression models can be easily integrated into the re-\n i\ngression mixture model framework by equating the regression matrix X i with the spline\nbasis matrix Bi. In what follows, we use the more general notation Xi in favor of the more\nspecific Bi.\n\n2.1 Joint model definition\n\nThe joint clustering-alignment model definition is based on a regression mixture model\nthat has been augmented with up to four individual random transformation parameters or\nvariables (ai, bi, ci, di). The ai and bi allow for scaling and translation in time, while the ci\nand di allow for scaling and translation in measurement space. The model definition takes\nthe form\n y = + +\n i ci aixi - bi k di i, (1)\nin which aixi - bi represents the regression matrix Xi (either spline or polynomial)\nevaluated at the transformed time aixi - bi. Below we use the matrix X i to denote aixi -\nbi when parsimony is required. It is assumed that i is a zero-mean Gaussian vector with\ncovariance 2I.\n k\n\nThe conditional density\n\n p ( ) = +\n k y | N (y | I)\n i ai, bi, ci, di i ci aixi - bi k di, 2k (2)\n\ngives the probability density of y when all the transformation parameters (as well as cluster\n i\nmembership) are known. (Note that the density on the left is implicitly conditioned on an\nappropriate set of parameters--this is always assumed in what follows.) In general, the val-\nues for the transformation parameters are unknown. Treating this as a standard hidden-data\nproblem, it is useful to think of each of the transformation parameters as random variables\nthat are curve-specific but with \"population-level\" prior probability distributions. In this\n\n\f\nway, the transformation parameters and the model parameters can be learned simultane-\nously in an efficient manner using EM.\n\n\n2.2 Transformation priors\n\nPriors are attached to each of the transformation variables in such a way that the identity\ntransformation is the most likely transformation. A useful prior for this is the Gaussian den-\nsity N (, 2) with mean and variance 2. The time transformation priors are specified\nas\n a ) )\n i N (1, r2\n k , bi N (0, s2k , (3)\nand the measurement space priors are given as\n c ) )\n i N (1, u2\n k , di N (0, v2k . (4)\nNote that the identity transformation is indeed the most likely. All of the variance param-\neters are cluster-specific in general; however, any subset of these parameters can be \"tied\"\nacross clusters if desired in a specific application. Note that these priors technically allow\nfor negative scaling in time and in measurement space. In practice this is typically not a\nproblem, though one can easily specify other priors (e.g., log-normal) to strictly disallow\nthis possibility. It should be noted that each of the prior variance parameters are learned\nfrom the data in the ensuing EM algorithm. We do not make use of hyperpriors for these\nprior parameters; however, it is straightforward to extend the method to allow hyperpriors\nif desired.\n\n\n2.3 Full probability model\n\nThe joint density of y = {\n i and the set of transformation variables i ai, bi, ci, di} can be\nwritten succinctly as\n p ( ) = ( ) ( )\n k y y |\n i, i pk i i pk i , (5)\nwhere p ( ) = ) ) ) )\n k i N (ai|1, r2 N ( N ( N ( . The space transforma-\n k bi|0, s2k ci|1, u2k di|0, v2k\ntion parameters can be integrated-out of (5) resulting in the marginal of y i conditioned only\non the time transformation parameters. This conditional marginal takes the form\n\n p ( ) = ( )\n k y | y\n i ai, bi pk i, ci, di|ai, bi dci, ddi\n\n = N (y |X + V I)\n i ik, Uik k - 2\n k , (6)\nwith U = + =\n ik u2X X I and V 11 + I. The unconditional (though,\n k ik k i 2k k v2k 2k\nstill cluster-dependent) marginal for yi cannot be computed analytically since ai, bi cannot\nbe analytically integrated-out. Instead, we use numerical Monte Carlo integration for this\ntask. The resulting unconditional marginal for y can be approximated by\n i\n\n p ( ) = ( ) ( ) ( )\n k y y |\n i pk i ai, bi pk ai pk bi dai dbi\n 1\n ( (\n p (y |a m) m)), (7)\n M k i i , bi\n m\nwhere the M Monte Carlo samples are taken according to\n ( (\n a m) N (1 ) m) N (0 )\n i , r2k , and bi , s2k , for m = 1, . . . , M. (8)\nA mixture results when cluster membership is unknown:\n p(y ) = (y )\n i kpk i . (9)\n k\nThe log-likelihood of all n curves Y = {y }\n i follows directly from this approximation and\ntakes the form\n log ( (\n p(Y ) log ( m) m))\n kpk y | -\n i ai , bi n log M. (10)\n i mk\n\n\f\n2.4 EM algorithm\n\nWe derive an EM algorithm that simultaneously allows the learning of both the model\nparameters and the transformation variables with time-complexity that is linear in the\ntotal number of data points N = i ni. First, let zi give the cluster membership for curve\ny . Now, regard the transformation variables {\n i i} as well as the cluster memberships {zi}\nas being hidden. The complete-data log-likelihood function is defined as the joint log-\nlikelihood of Y and the hidden data {i, zi}. This can be written as the sum over all n\ncurves of the log of the product of z and the cluster-dependent joint density in (5). This\n i\nfunction takes the form\n\n L = log ( ) ( )\n c z y |\n i pzi i i pzi i . (11)\n i\n\nIn the E-step, the posterior p( )\n i, zi|yi is calculated and then used to take the posterior\nexpectation of Equation (11). This expectation is then used in the M-step to calculate the\nre-estimation equations for updating the model parameters { }\n k, 2 .\n k, r2\n k , s2\n k, u2\n k, v2\n k\n\n\n2.5 E-step\n\nThe posterior p( ) ( ) )\n i, zi|y |y\n i can be factorized as pz . The second factor is\n i i p(zi|yi\nthe membership probability wik that yi was generated by cluster k. It can be rewritten as\np(z = ) ( )\n i k|y y\n i pk i and evaluated using Equation (7). The first factor requires a bit\nmore work. Further factoring reveals that p ( ) = ( ) ( )\n z |y .\n i i pzi ci, di|ai, bi, yi pzi ai, bi|yi\nThe new first factor p ( )\n z can be solved for exactly by noting that it is propor-\n i ci, di|ai, bi, yi\ntional to a bivariate normal distribution for each z ( )\n i [2]. The new second factor p zi ai, bi|yi\ncannot, in general, be solved for analytically, so instead we use an approximation.\n\nThe fact that posterior densities tend towards highly peaked Gaussian densities has been\nwidely noted (e.g, [12]) and leads to the normal approximation of posterior densities.\nTo make the approximation here, the vector (^\n a )\n ik, ^\n bik representing the multi-dimensional\n (\nmode of p ( ) k) )\n k ai, bi|yi , the covariance matrix V for (^ , and the separate variances\n a a\n ibi ik, ^\n bik\nVa must be found. These can readily be estimated using a Nelder-Mead optimiza-\n ik , Vbik\ntion method. Experiments have shown this approximation works well across a variety of\nexperimental and real-world data sets [2].\n\nThe above calculations of the posterior p( )\n i, zi|yi allow the posterior expectation of the\ncomplete-data log-likelihood in Equation (11) to be solved for. This expectation results\nin the so-called Q-function which is maximized in the M-step. Although the derivation\nis quite complex, the Q-function can be calculated exactly for polynomial regression [2];\nfor spline regression, the basis functions do not afford an exact formula for the solution of\nthe Q-function. However, in the spline case, removal of a few problematic variance terms\ngives an efficient approximation (the interested reader is referred to [2] for more details).\n\n\n2.6 M-step\n\nThe M-step is straightforward since most of the hard work is done in the E-step. The Q-\nfunction is maximized over the set of parameters { }\n k, 2 for 1 \n k, r2\n k , s2\n k, u2\n k, v2\n k k K.\nThe derived solutions are as follows:\n 1 1\n ^r2 = ^ + = ^ +\n k wik a2ik Vaik , ^s2k wik b2ik Vbik ,\n i wik i i wik i\n\n 1 1\n ^u2 = ^ + = ^ +\n k wik c2ik Vcik , ^v2k wik d2ik Vdik ,\n i wik i i wik i\n\n\f\n 4 4\n\n 2 2\n\n 0 0\n\n -2 -2\n\n -4 -4\n Height acceleration Height acceleration\n -6 -6\n\n 10 12 14 16 18 8 10 12 14 16 18\n Age Age\n\n\nFigure 2: Curves measuring the height acceleration for 39 boys; (left) smoothed versions\nof raw observations, (right) automatically aligned curves.\n\n\n -1\n^\n = ^ ^\n X ^\n X + V ^ ^\n X (y - ^ ) + V y - V 1\n k wikc2ik ik ik xxi wikcik ik i dik xi i xcd ,\n i i\n\nand\n\n 1 2\n ^2 = - ^ ^\n X\n k wik yi cik ik - ^dik\n i wik ni i\n\n -2y V ^ + ^ V ^\n + 2^ V\n i xi k k xxi k k xcd1 + niVdik ,\n\nwhere ^\n X = ^\n ik aikxi - ^bik , and Vxxi, Vxi, Vxcd are special \"variance\" matrices whose\ncomponents are functions of the posterior expectations of calculated in the E-step (the\nexact forms of these matrices can be found in [2]).\n\n\n3 Experimental results and conclusions\n\nThe results of a simple demonstration of EM-based alignment (using splines and the learn-\ning algorithm of the previous section, but with no clustering) are shown in Figure 2. In the\nleft plot are a set of smoothed curves representing the acceleration of height for each of 39\nboys whose heights were measured at 29 observation times over the ages of 1 to 18 [1]. No-\ntice that the curves share a similar shape but seem to be misaligned in time due to individual\ngrowth dynamics. The right plot shows the same acceleration curves after processing from\nour spline alignment model using quartic splines with 8 uniformly spaced knots allowing\nfor a maximum time translation of 2 units. The x-axis in this plot can be seen as canonical\n(or \"average\") age. The aligned curves in the right plot of Figure 2 represent the average\nbehavior in a much clearer way. For example, it appears there is an interval of 2.5 years\nfrom peak (age 12.5) to trough (age 15) that describes the average cycle that all boys go\nthrough. The results demonstrate that it is common for important features of curves to be\nrandomly translated in time and that it is possible to use the data to recover these underlying\nhidden transformations using our alignment models.\n\nNext we briefly present an application of the joint clustering-alignment model to the prob-\nlem of gene expression clustering. We analyze the alpha arrest data described in [13] that\ncaptures gene expression levels at 7 minute intervals for two consecutive cell cycles (to-\ntaling 17 measurements per gene). Clustering is often used in gene expression analysis\nto reveal groups of genes with similar profiles that may be physically related to the same\nunderlying biological process (e.g., [13]). It is well-known that time-delays play an impor-\n\n\f\n 2 2\n\n\n\n 1 1\n\n\n\n 0 0\n\n Expression Expression\n -1 -1\n\n\n\n -2 -2\n 0 5 10 15 0 5 10 15\n Canonical time Time\n\n 2 2\n\n\n\n 1 1\n\n\n\n 0 0\n\n Expression Expression\n -1 -1\n\n\n\n -2 -2\n 0 5 10 15 0 5 10 15\n Canonical time Time\n\n 2 2\n\n\n\n 1 1\n\n\n\n 0 0\n\n Expression Expression\n -1 -1\n\n\n\n -2 -2\n 0 5 10 15 0 5 10 15\n Canonical time Time\n\n\n\nFigure 3: Three clusters for the time translation alignment model (left) and the non-\nalignment model (right).\n\n\n\n\ntant role in gene regulation, and thus, curves measured over time which represent the same\nprocess may often be misaligned from each other. [14].\n\nSince these gene expression data are already normalized, we did not allow for transfor-\nmations in measurement space. We only allowed for translations in time since experts do\nnot expect scaling in time to be a factor in these data. For the curve model, cubic splines\nwith 6 uniformly spaced knots across the interval from -4 to 21 were chosen, allowing\nfor a maximum time translation of 4 units. Due to limited space, we present a single case\nof comparison between a standard spline regression mixture model (SRM) and an SRM\nthat jointly allows for time translations. Ten random starts of EM were allowed for each\nalgorithm with the highest likelihood model selected for comparison for each algorithm. It\nis common to assume that there are five distinct clusters of genes in these data; as such we\nset K = 5 for each algorithm [13].\n\nThree of the resulting clusters from the two methods are shown in Figure 3. The left\ncolumn of the figure shows the output from the joint clustering-alignment model, while\nthe right column shows the output from the standard cluster model. It is apparent that\nthe time-aligned clusters represent the mean behavior more accurately. The overall cluster\nvariance is much lower than in the non-aligned clustering. The results also demonstrate\nthe appearance of cluster-dependent alignment effects. Out-of-sample experiments (not\nshown here) show that the joint model produces better predictive models than the standard\nclustering method. Experimental results on a variety of other data sets are provided in [2],\nincluding applications to clustering of cyclone trajectories.\n\n\f\n4 Conclusions\n\nWe proposed a general probabilistic framework for joint clustering and alignment of sets\nof curves. The experimental results indicate that the approach provides a new and use-\nful tool for curve analysis in the face of underlying hidden transformations. The re-\nsulting EM-based learning algorithms have time-complexity that is linear in the number\nof measurements--in contrast, many existing curve alignment algorithms themselves are\nO(n2) (e.g., dynamic time warping) without regard to clustering. The incorporation of\nsplines gives the method an overall non-parametric freedom which leads to general appli-\ncability.\n\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under\ngrants No. SCI-0225642 and IIS-0431085.\n\n\nReferences\n\n [1] J.O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer-Verlag, New York, NY,\n 1997.\n\n [2] Scott J. Gaffney. Probabilistic Curve-Aligned Clustering and Prediction with Regression Mix-\n ture Models. Ph.D. Dissertation, University of California, Irvine, 2004.\n\n [3] Z. Bar-Joseph et al. A new approach to analyzing gene expression time series data. Journal of\n Computational Biology, 10(3):341356, 2003.\n\n [4] B. J. Frey and N. Jojic. Transformation-invariant clustering using the EM algorithm. IEEE\n Trans. PAMI, 25(1):117, January 2003.\n\n [5] H. Chui, J. Zhang, and A. Rangarajan. Unsupervised learning of an atlas from unlabeled point-\n sets. IEEE Trans. PAMI, 26(2):160172, February 2004.\n\n [6] A. D. J. Cross and E. R. Hancock. Graph matching with a dual-step EM algorithm. IEEE Trans.\n PAMI, 20(11):12361253, November 1998.\n\n [7] S. J. Gaffney and P. Smyth. Curve clustering with random effects regression mixtures. In C. M.\n Bishop and B. J. Frey, editors, Proc. Ninth Inter. Workshop on Artificial Intelligence and Stats,\n Key West, FL, January 36 2003.\n\n [8] D. Chudova, S. J. Gaffney, and P. J. Smyth. Probabilistic models for joint clustering and time-\n warping of multi-dimensional curves. In Proc. of the Nineteenth Conference on Uncertainty in\n Artificial Intelligence (UAI-2003), Acapulco, Mexico, August 710, 2003.\n\n [9] D. Chudova, S. J. Gaffney, E. Mjolsness, and P. J. Smyth. Translation-invariant mixture models\n for curve clustering. In Proc. Ninth ACM SIGKDD Inter. Conf. on Knowledge Discovery and\n Data Mining, Washington D.C., August 2427, New York, 2003. ACM Press.\n\n[10] S. Gaffney and P. Smyth. Trajectory clustering with mixtures of regression models. In Surajit\n Chaudhuri and David Madigan, editors, Proc. Fifth ACM SIGKDD Inter. Conf. on Knowledge\n Discovery and Data Mining, August 1518, pages 6372, N.Y., 1999. ACM Press.\n\n[11] P. H. C. Eilers. and B. D. Marx. Flexible smoothing with B-splines and penalties. Statistical\n Science, 11(2):89121, 1996.\n\n[12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman &\n Hall, New York, NY, 1995.\n\n[13] P. T. Spellman et al. Comprehensive identification of cell cycle-regulated genes of the yeast\n Saccharomyces cerevisiae by microarray hybridization. Molec. Bio. Cell, 9(12):32733297,\n December 1998.\n\n[14] J. Aach and G. M. Church. Aligning gene expression time series with time warping algorithms.\n Bioinformatics, 17(6):495508, 2001.\n\n\f\n", "award": [], "sourceid": 2627, "authors": [{"given_name": "Scott", "family_name": "Gaffney", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}