{"title": "Canonical Time Warping for Alignment of Human Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 2286, "page_last": 2294, "abstract": "Alignment of time series is an important problem to solve in many scientific disciplines. In particular, temporal alignment of two or more subjects performing similar activities is a challenging problem due to the large temporal scale difference between human actions as well as the inter/intra subject variability. In this paper we present canonical time warping (CTW), an extension of canonical correlation analysis (CCA) for spatio-temporal alignment of the behavior between two subjects. CTW extends previous work on CCA in two ways: (i) it combines CCA with dynamic time warping for temporal alignment; and (ii) it extends CCA to allow local spatial deformations. We show CTWs effectiveness in three experiments: alignment of synthetic data, alignment of motion capture data of two subjects performing similar actions, and alignment of two people with similar facial expressions. Our results demonstrate that CTW provides both visually and qualitatively better alignment than state-of-the-art techniques based on dynamic time warping.", "full_text": "Canonical Time Warping\n\nfor Alignment of Human Behavior\n\nFeng Zhou\n\nRobotics Institute\n\nCarnegie Mellon University\n\nwww.f-zhou.com\n\nFernando de la Torre\n\nRobotics Institute\n\nCarnegie Mellon University\nftorre@cs.cmu.edu\n\nAbstract\n\nAlignment of time series is an important problem to solve in many scienti\ufb01c dis-\nciplines. In particular, temporal alignment of two or more subjects performing\nsimilar activities is a challenging problem due to the large temporal scale differ-\nence between human actions as well as the inter/intra subject variability. In this\npaper we present canonical time warping (CTW), an extension of canonical cor-\nrelation analysis (CCA) for spatio-temporal alignment of human motion between\ntwo subjects. CTW extends previous work on CCA in two ways: (i) it combines\nCCA with dynamic time warping (DTW), and (ii) it extends CCA by allowing\nlocal spatial deformations. We show CTW\u2019s effectiveness in three experiments:\nalignment of synthetic data, alignment of motion capture data of two subjects per-\nforming similar actions, and alignment of similar facial expressions made by two\npeople. Our results demonstrate that CTW provides both visually and qualitatively\nbetter alignment than state-of-the-art techniques based on DTW.\n\nIntroduction\n\n1\nTemporal alignment of time series has been an active research topic in many scienti\ufb01c disciplines\nsuch as bioinformatics, text analysis, computer graphics, and computer vision. In particular, tem-\nporal alignment of human behavior is a fundamental step in many applications such as recognition\n[1], temporal segmentation [2] and synthesis of human motion [3]. For instance consider Fig. 1a\nwhich shows one subject walking with varying speed and different styles and Fig. 1b which shows\ntwo subjects reading the same text.\nPrevious work on alignment of human motion has been addressed mostly in the context of recog-\nnizing human activities and synthesizing realistic motion. Typically, some models such as hidden\nMarkov models [4, 5, 6], weighted principal component analysis [7], independent component anal-\nysis [8, 9] or multi-linear models [10] are learned from training data and in the testing phase the\ntime series is aligned w.r.t.\nthe learned dynamic model. In the context of computer vision a key\naspect for successful recognition of activities is building view-invariant representations. Junejo et\nal. [1] proposed a view-invariant descriptor for actions making use of the af\ufb01nity matrix between\ntime instances. Caspi and Irani [11] temporally aligned videos from two closely attached cameras.\nRao et al. [12, 13] aligned trajectories of two moving points using constraints from the fundamental\nmatrix. In the literature of computer graphics, Hsu et al. [3] proposed the iterative motion warping,\na method that \ufb01nds a spatio-temporal warping between two instances of motion captured data. In the\ncontext of data mining there have been several extensions of DTW [14] to align time series. Keogh\nand Pazzani [15] used derivatives of the original signal to improve alignment with DTW. Listgarten\net al. [16] proposed continuous pro\ufb01le models, a probabilistic method for simultaneously aligning\nand normalizing sets of time series.\nA relatively unexplored problem in behavioral analysis is the alignment between the motion of the\nbody of face in two or more subjects (e.g., Fig. 1). Major challenges to solve human motion align-\n\n1\n\n\fFigure 1: Temporal alignment of human behavior. (a) One person walking in normal pose, slow\nspeed, another viewpoint and exaggerated steps (clockwise). (b) Two people reading the same text.\n\nment problems are: (i) allowing alignment between different sets of multidimensional features (e.g.,\naudio/video), (ii) introducing a feature selection or feature weighting mechanism to compensate for\nsubject variability or irrelevant features and (iii) execution rate [17]. To solve these problems, this\npaper proposes canonical time warping (CTW) for accurate spatio-temporal alignment between two\nbehavioral time series. We pose the problem as \ufb01nding the temporal alignment that maximizes the\nspatial correlation between two behavioral samples coming from two subjects. To accommodate for\nsubject variability and take into account the difference in the dimensionally of the signals, CTW uses\nCCA as a measure of spatial alignment. To allow temporal changes CTW incorporates DTW. CTW\nextends DTW by adding a feature weighting mechanism that is able to align signals of different\ndimensionality. CTW also extends CCA by incorporating time warping and allowing local spatial\ntransformations.\nThe remainder of the paper is organized as follows. Section 2 reviews related work on dynamic time\nwarping and canonical correlation analysis. Section 3 describes the new CTW algorithm. Section 4\nextends CTW to take into account local transformations. Section 5 provides experimental results.\n\n2 Previous work\nThis section describes previous work on canonical correlation analysis and dynamic time warping.\n\n2.1 Canonical correlation analysis\nCanonical correlation analysis (CCA) [18] is a technique to extract common features from a pair of\nmultivariate data. CCA identi\ufb01es relationships between two sets of variables by \ufb01nding the linear\ncombinations of the variables in the \ufb01rst set1 (X \u2208 Rdx\u00d7n) that are most correlated with the linear\ncombinations of the variables in the second set (Y \u2208 Rdy\u00d7n). Assuming zero-mean data, CCA \ufb01nds\na combination of the original variables that minimizes:\n\nJcca(Vx, Vy) = (cid:107)VT\n\nx X \u2212 VT\n\ny Y(cid:107)2\n\nF\n\ns.t. VT\n\nx X, vT\n\ny YYT Vy = Ib,\n\nx XXT Vx = VT\n\n(1)\nwhere Vx \u2208 Rdx\u00d7b is the projection matrix for X (similarly for Vy). The pair of canonical variates\ny Y) is uncorrelated with other canonical variates of lower order. Each successive canon-\n(vT\nical variate pair achieves the maximum correlation orthogonal to the preceding pairs. Eq. 1 has a\nclosed form solution in terms of a generalized eigenvalue problem. See [19] for a uni\ufb01cation of\nseveral component analysis methods and a review of numerical techniques to ef\ufb01ciently solve the\ngeneralized eigenvalue problems.\nIn computer vision, CCA has been used for matching sets of images in problems such as activity\nrecognition from video [20] and activity correlation from cameras [21]. Recently, Fisher et al. [22]\n1Bold capital letters denote a matrix X, bold lower-case letters a column vector x. xi represents the ith\ncolumn of the matrix X. xij denotes the scalar in the ith row and jth column of the matrix X. All non-bold\nletters represent scalars. 1m\u00d7n, 0m\u00d7n \u2208 Rm\u00d7n are matrices of ones and zeros. In \u2208 Rn\u00d7n is an identity\nxT x denotes the Euclidean distance. (cid:107)X(cid:107)2\nmatrix. (cid:107)x(cid:107) =\nF = Tr(XT X) designates the Frobenious norm.\nX \u25e6 Y and X \u2297 Y are the Hadamard and Kronecker product of matrices. Vec(X) denotes the vectorization of\nmatrix X. {i : j} lists the integers, {i, i + 1,\u00b7\u00b7\u00b7 , j \u2212 1, j}.\n\n\u221a\n\n2\n\n(a)(b)\fFigure 2: Dynamic time warping. (a) 1-D time series (nx = 7 and ny = 9). (b) DTW alignment.\n(d) Policy function at each node, where \u2191,(cid:45),\u2190 denote the policy,\n(c) Binary distance matrix.\n\u03c0(pt) = [1, 0]T , [1, 1]T , [0, 1]T , respectively. The optimal alignment path is denoted in bold.\n\nproposed an extension of CCA with parameterized warping functions to align protein expressions.\nThe learned warping function is a linear combination of hyperbolic tangent functions with non-\nnegative coef\ufb01cients, ensuring monotonicity. Unlike our method, the warping function is unable to\ndeal with feature weighting.\n\n2.2 Dynamic time warping\nGiven two time series, X = [x1, x2,\u00b7\u00b7\u00b7 , xnx] \u2208 Rd\u00d7nx and Y = [y1, y2,\u00b7\u00b7\u00b7 , yny] \u2208 Rd\u00d7ny,\ndynamic time warping [14] is a technique to optimally align the samples of X and Y such that the\nfollowing sum-of-squares cost is minimized:\n\nJdtw(P) =\n\n(cid:107)xpx\n\nt\n\n\u2212 ypy\n\nt\n\n(cid:107)2,\n\n(2)\n\nm(cid:88)\n\nt=1\n\nwhere m is the number of indexes (or steps) needed to align both signals. The correspondence\nmatrix P can be parameterized by a pair of path vectors, P = [px, py]T \u2208 R2\u00d7m, in which px \u2208\n{1 : nx}m\u00d71 and py \u2208 {1 : ny}m\u00d71 denote the composition of alignment in frames. For instance,\nt ]T = [i, j]T\nthe ith frame in X and the jth frame in Y are aligned iff there exists pt = [px\nfor some t. P has to satisfy three additional constraints: boundary condition (p1 \u2261 [1, 1]T and\npm \u2261 [nx, ny]T ), continuity (0 \u2264 pt \u2212 pt\u22121 \u2264 1) and monotonicity (t1 \u2265 t2 \u21d2 pt1 \u2212 pt2 \u2265 0).\nAlthough the number of possible ways to align X and Y is exponential in nx and ny, dynamic pro-\n\n(cid:1)) approach to minimize Jdtw using Bellman\u2019s equation:\n\ngramming [23] offers an ef\ufb01cient (O(cid:0)nxny\n\nt , py\n\nL\u2217(pt) = min\n\n(cid:107)xpx\n\nt\n\n\u2212 ypy\n\n(cid:107)2 + L\u2217(pt+1),\n\n\u03c0(pt)\n\n(3)\nwhere the cost-to-go value function, L\u2217(pt), represents the remaining cost starting at tth step to\nbe incurred following the optimum policy \u03c0\u2217. The policy function, \u03c0 : {1 : nx} \u00d7 {1 : ny} \u2192\n{[1, 0]T , [0, 1]T , [1, 1]T}, de\ufb01nes the deterministic transition between consecutive steps, pt+1 =\npt + \u03c0(pt). Once the policy queue is known, the alignment steps can be recursively constructed\nfrom the starting point, p1 = [1, 1]T . Fig. 2 shows an example of DTW to align two 1-D time series.\n\nt\n\n3 Canonical time warping (CTW)\nThis section describes the energy function and optimization strategies for CTW.\n\n3.1 Energy function for CTW\nIn order to have a compact and compressible energy function for CTW, it is important to notice that\nEq. 2 can be rewritten as:\n\nJdtw(Wx, Wy) =\n\nT wy\n\nj(cid:107)xi \u2212 yj(cid:107)2 = (cid:107)XWT\n\nx \u2212 YWT\n\ny (cid:107)2\nF ,\n\nwx\ni\n\n(4)\n\nwhere Wx \u2208 {0, 1}m\u00d7nx, Wy \u2208 {0, 1}m\u00d7ny are binary selection matrices that need to be inferred\nto align X and Y. In Eq. 4 the matrices Wx and Wy encode the alignment path. For instance,\n\ni=1\n\nj=1\n\n3\n\nnx(cid:88)\n\nny(cid:88)\n\n123456723451234567892345(a)(b)(c)(d)01110110110111111011011111111101001111101001101110110110111111012345678912345671234567891234567\ftpx\nt\n\ntpy\nt\n\n= wy\n\nx Wx, Dy = WT\n\nth frame in X and py\nt\n\n= 1 assigns correspondence between the px\nt\n\nth frame in Y. For\nwx\nconvenience, we denote, Dx = WT\nx Wy. Observe that Eq.\n4 is very similar to the CCA\u2019s objective (Eq. 1). CCA applies a linear transformation to the rows\n(features), while DTW applies binary transformations to the columns (time).\nIn order to accommodate for differences in style and subject variability, add a feature selection mech-\nanism, and reduce the dimensionality of the signals, CTW adds a linear transformation (VT\nx , VT\ny )\n(as CCA) to the least-squares form of DTW (Eq. 4). Moreover, this transformation allows aligning\ntemporal signals with different dimensionality (e.g., video and motion capture). CTW combines\nDTW and CCA by minimizing:\n\ny Wy and W = WT\n\nx \u2212 VT\n\nJctw(Wx, Wy, Vx, Vy) = (cid:107)VT\n\ny YWT\n\nx XWT\n\n(5)\nwhere Vx \u2208 Rdx\u00d7b, Vy \u2208 Rdy\u00d7b, b \u2264 min(dx, dy) parameterize the spatial warping by pro-\njecting the sequences into the same coordinate system. Wx and Wy warp the signal in time to\nachieve optimum temporal alignment. Similar to CCA, to make CTW invariant to translation, rota-\ny 1m = 0dy, (ii)\ntion and scaling, we impose the following constraints: (i) XWT\nx XWYT Vy to be a diagonal matrix. Eq. 5\nVT\nis the main contribution of this paper. CTW is a direct and clean extension of CCA and DTW to\nalign two signals X and Y in space and time. It extends previous work on CCA by adding temporal\nalignment and on DTW by allowing a feature selection and dimensionality reduction mechanism for\naligning signals of different dimensions.\n\ny YDyYT Vy = Ib and (iii) VT\n\nx XDxXT Vx = VT\n\nx 1m = 0dx, YWT\n\ny (cid:107)2\nF ,\n\n3.2 Optimization for CTW\n\nAlgorithm 1: Canonical Time Warping\ninput : X, Y\noutput: Vx, Vy, Wx, Wy\nbegin\n\nInitialize Vx = Idx , Vy = Idy\nrepeat\n\n0\n\nYWT XT\n\nV =\n\n0\n\n0\n\nYDyYT\n\nV\u039b\n\nUse dynamic programming to compute, Wx, Wy, for aligning the sequences, VT\nSet columns of, VT = [VT\n\nx , VT\n\ny ], be the leading b generalized eigenvectors of:\nXWYT\n\n0\n\n(cid:20) XDxXT\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nx X, VT\n\ny Y\n\nuntil Jctw converges\n\nend\n\nOptimizing Jctw is a non-convex optimization problem with respect to the alignment matrices\n(Wx, Wy) and projection matrices (Vx, Vy). We alternate between solving for Wx, Wy using\nDTW, and optimally computing the spatial projections using CCA. These steps monotonically de-\ncrease Jctw and since the function is bounded below it will converge to a critical point.\nAlg. 1 illustrates the optimization process (e.g., Fig. 3e). The algorithm starts by initializing Vx\nand Vy with identity matrices. Alternatively, PCA can be applied independently to each set, and\nused as initial estimation of Vx and Vy if dx (cid:54)= dy. In the case of high-dimensional data, the\ngeneralized eigenvalue problem is solved by regularizing the covariance matrices adding a scaled\nidentity matrix. The dimension b is selected to preserve 90% of the total correlation. We consider\nthe algorithm to converge when the difference between two consecutive values of Jctw is small.\n\n4 Local canonical time warping (LCTW)\nIn the previous section we have illustrated how CTW can align in space and time two time series of\ndifferent dimensionality. However, there are many situations (e.g., aligning long sequences) where\na global transformation of the whole time series is not accurate. For these cases, local models\nhave been shown to provide better performance [3, 24, 25]. This section extends CTW by allowing\nmultiple local spatial deformations.\n\n4\n\n\f4.1 Energy function for LCTW\nLet us assume that the spatial transformation for each frame in X and Y can be model as a\nT ]T \u2208 Rkxdx\u00d7b, Vy =\nlinear combination of kx or ky bases. Let be Vx = [Vx\n1\nT ]T \u2208 Rkydy\u00d7b and b \u2264 min(kxdx, kydy). CTW allows for a more \ufb02exible spatial\n[Vy\n1\nwarping by minimizing:\n\nT ,\u00b7\u00b7\u00b7 , Vx\n\nT ,\u00b7\u00b7\u00b7 , Vy\n\nkx\n\nky\n\nJlctw(Wx, Wy, Vx, Vy, Rx, Ry)\n\nnx(cid:88)\n\ni=1\n\n=\n\nj(cid:107)(cid:16) kx(cid:88)\n\nny(cid:88)\n(cid:104)\n(1kx \u2297 X) \u25e6 (RT\n\nT wy\n\nwx\ni\n\nj=1\n\ncx=1\n\nrx\nicx\n\nVx\ncx\n\n(cid:105)\nx \u2297 1dx)\n\nT(cid:17)\n\nxi \u2212(cid:16) ky(cid:88)\n(cid:104)\n\ncy=1\n\nry\njcy\n\nVy\ncy\n\n(cid:107)Fxrx\n\nT(cid:17)\n\nyj(cid:107)2 +\n\nkx(cid:88)\n(cid:105)\ny \u2297 1dy)\n\ncx=1\n\n(6)\n\n(cid:107)Fyry\n\ncy\n\n(cid:107)2\n\n(cid:107)2 +\n\nky(cid:88)\nF + (cid:107)FxRx(cid:107)2\n\ncy=1\n\ncx\n\nF + (cid:107)FyRy(cid:107)2\nF ,\n\ny\n\nx\n\nWT\n\nx \u2212 VT\n\n(1ky \u2297 Y) \u25e6 (RT\n\n=(cid:107)VT\nwhere Rx \u2208 Rnx\u00d7kx , Ry \u2208 Rny\u00d7ky are the weighting matrices. rx\nicx denotes the coef\ufb01cient (or\n). We further constrain the weights\nweight) of the cth\nto be positive (i.e., Rx, Ry \u2265 0) and the sum of weights to be one (i.e., Rx1kx = 1nx, Ry1ky =\n1ny) for each frame. The last two regularization terms, Fx \u2208 Rnx\u00d7nx , Fy \u2208 Rny\u00d7ny, are 1st\n\u2208 Rny\u00d71, encouraging smooth solutions over time.\norder differential operators of rx\ncx\nObserve that Jctw is a special case of Jlctw when kx = ky = 1.\n\nx basis for the ith frame of X (similarly for ry\njcy\n\n\u2208 Rnx\u00d71, ry\n\ny (cid:107)2\n\nWT\n\ncy\n\n4.2 Optimization for LCTW\n\nAlgorithm 2: Local Canonical Time Warping\ninput : X, Y\noutput: Wx, Wy, Vx, Vy, Rx, Ry\nbegin\n\nInitialize,\n\nVx = 1kx \u2297 Idx , Vy = 1ky \u2297 Idy\n(cid:99) < i \u2264 (cid:98) cxnx\nkx\n\n= 1 for (cid:98)(cy \u2212 1)ny\n\nry\njcy\n\n(cid:99),\n\nky\n\n(cid:99) < j \u2264 (cid:98) cyny\nky\n\n(cid:99)\n\n= 1 for (cid:98)(cx \u2212 1)nx\n\nkx\n\nrx\nicx\n\nrepeat\n\nDenote,\n\nZx = (1kx \u2297 X) \u25e6 (RT\n\nx \u2297 1dx), Zy = (1ky \u2297 Y) \u25e6 (RT\n\ny \u2297 1dy)\n\nQx = VT\n\nx (Ikx \u2297 X), Qy = VT\n\ny (Iky \u2297 Y)\n\nUse dynamic programming to compute, Wx, Wy, between the sequences, VT\nSet columns of, VT = [VT\n\ny ], be the leading b generalized eigenvectors,\nZxWZT\ny\n\n(cid:20) ZxDxZT\n\n(cid:21)\n\n(cid:21)\n\n0\n\n0\n\nx\n\nV =\n\n0\n\n0\n\nZyDyZT\ny\n\nV\u039b\n\nx Zx, VT\n\ny Zy\n\nSet, r = Vec([Rx, Ry]), be the solution of the quadratic programming problem,\n\n\u22121ky\u00d7kx \u2297 WT \u25e6 QT\n\nx Fx\n\nx Qx + Ikx \u2297 FT\n(cid:21)\n\ny Qx\n\nr = 1nx+ny\n\nr \u2265 0nxkx+nyky\n\n\u22121kx\u00d7ky \u2297 W \u25e6 QT\n\nx Qy\n\n1ky\u00d7ky \u2297 Dy \u25e6 QT\n\ny Qy + Iky \u2297 FT\n\ny Fy\n\nx , VT\n\nZyWT ZT\nx\n\n(cid:20)\n(cid:20) 1kx\u00d7kx \u2297 Dx \u25e6 QT\n(cid:20) 1T\n\nrT\n\nkx\n\n\u2297 Inx\n0\n\n0\n\u2297 Iny\n\n1T\nky\n\nmin\n\nr\n\ns.t.\n\n(cid:21)\n\nr\n\nuntil Jlctw converges\n\nend\n\nAs in the case of CTW, we use an alternating scheme for optimizing Jlctw, which is summarized in\nAlg. 2. In the initialization, we assume that each time series is divided into kx or ky equal parts,\nbeing the identity matrix the starting value for Vx\ncy and block structure matrices for Rx, Ry.\ncx\n\n, Vy\n\n5\n\n\fThe main difference between the alternating scheme of Alg. 1 and Alg. 2 is that the alternation\nstep is no longer unique. For instance, when \ufb01xing Vx, Vy, one can optimize either Wx, Wy\nor Rx, Ry. Consider a simple example of warping sin(t1) towards sin(t2), one could shift the\n\ufb01rst sequence along time axis by \u03b4t = t2 \u2212 t1 or do the linear transformation, at1 sin(t1) + bt1,\nwhere at1 = cos(t2 \u2212 t1) and bt1 = cos(t1) sin(t2 \u2212 t1).\nIn order to better control the trade-\noff between time warping and spatial transformation, we propose a stochastic selection process.\nLet us denote pw|v the conditional probability of optimizing W when \ufb01xing V. Given the prior\nprobabilities [pw, pv, pr], we can derive the conditional probabilities using Bayes\u2019 theorem and the\nfact that, [pr|w, pr|v, pv|r] = 1 \u2212 [pv|w, pw|v, pw|r].\n[pv|w, pw|v, pw|r]T = A\u22121b , where A =\n\n(cid:35)\n\nand b =\n\n. Fig. 3f (right-lower corner) shows the optimization\n\n(cid:34) pw \u2212pv\n\n0\npw\n0 \u2212pv\n\n(cid:35)\n\n0\npr\npr\n\n(cid:34)\n\n0\n\npw\u2212pv + pr\n\nstrategy, pw = .5, pv = .3, pr = .2, where the time warping process is more often optimized.\n\n5 Experiments\nThis section demonstrates the bene\ufb01ts of CTW and LCTW against state-of-the-art DTW approaches\nto align synthetic data, motion capture data of two subjects performing similar actions, and similar\nfacial expressions made by two people.\n\nx and Y = UT\n\ny ZMT\n\nx ZMT\n\n5.1 Synthetic data\nIn the \ufb01rst experiment we synthetically generated two spatio-temporal signals (3-D in space and 1-D\nin time) to evaluate the performance of CTW and LCTW. The \ufb01rst two spatial dimensions and the\ny , where Z \u2208 R2\u00d7m is\ntime dimension are generated as follows: X = UT\na curve in two dimensions (Fig. 3a). Ux, Uy \u2208 R2\u00d72 are randomly generated af\ufb01ne transformation\nmatrices for the spatial warping and Mx \u2208 Rnx\u00d7m, My \u2208 Rny\u00d7m, m \u2265 max(nx, ny) are randomly\ngenerated matrices for time warping2. The third spatial dimension is generated by adding a (1\u00d7 nx)\nor (1 \u00d7 ny) extra row to X and Y respectively, with zero-mean Gaussian noise (see Fig. 3a-b).\nWe compared the performance of CTW and LCTW against three other methods: (i) dynamic time\nwarping (DTW) [14], (ii) derivative dynamic time warping (DDTW) [15] and (iii) iterative time\nwarping (IMW) [3]. Recall that in the case of synthetic data we know the ground truth alignment\nmatrix Wtruth = MxMT\ny . The error between the ground truth and a given alignment Walg is\ncomputed by the area enclosed between both paths (see Fig. 3g).\nFig. 3c-f show the spatial warping estimated by each algorithm. DDTW (Fig. 3c) cannot deal with\nthis example because the feature derivatives do not capture well the structure of the sequence. IMW\n(Fig. 3d) warps one sequence towards the other by translating and re-scaling each frame in each\ndimension. Fig. 3h shows the testing error (space and time) for 100 new generated time series. As it\ncan be observed CTW and LCTW obtain the best performance. IMW has more parameters (O(dn))\nthan CTW (O(db)) and LCTW (O(kdb + kn)), and hence IMW is more prone to over\ufb01tting. IMW\ntries to \ufb01t the noisy dimension (3rd spatial component) biasing alignment in time (Fig. 3g), whereas\nCTW and LCTW have a feature selection mechanism which effectively cancels the third dimension.\ny =\nObserve that the null space for the projection matrices in CTW is vT\n[\u2212.002,\u2212.001,\u2212.071]T .\n\nx = [.002, .001,\u2212.067]T , vT\n\n5.2 Motion capture data\nIn the second experiment we apply CTW and LCTW to align human motion with similar behavior.\nThe motion capture data is taken from the CMU-Multimodal Activity Database [26]. We selected\na pair of sub-sequences from subject 1 and subject 3 cooking brownies. Typically, each sequence\ncontains 500-1000 frames. For each instance we computed the quaternions for the 20 joints resulting\nin a 60 dimensional feature vector that describes the body con\ufb01guration. CTW and LCTW are\ninitialized as described in previous sections and optimized until convergence. The parameters of\nLCTW are manually set to kx = 3, ky = 3 and pw = .5, pv = .3, pr = .2.\n\n2The generation of time transformation matrix Mx (similar for My) is initialized by setting Mx = Inx.\nThen, randomly pick and replicate m \u2212 nx columns of Mx. We normalize each row Mx1m = 1nx to make\nthe new frame to be an interpolation of zi.\n\n6\n\n\fFigure 3: Example with synthetic data. Time series are generated by (a) spatio-temporal transfor-\nmation of 2-D latent sequence (b) adding Gaussian noise in the 3rd dimension. The result of space\nwarping is computed by (c) derivative dynamic time warping (DDTW), (d) iterative time warping\n(IMW), (e) canonical time warping (CTW) and (f) local canonical time warping (LCTW). The en-\nergy function and order of optimizing the parameters for CTW and LCTW are shown in the top right\nand lower right corner of the graphs. (g) Comparison of the alignment results for several methods.\n(h) Mean and variance of the alignment error.\n\nFigure 4: Example of motion capture data alignment. (a) PCA. (b) CTW. (c) LCTW. (d) Alignment\npath. (e) Motion capture data. 1st row subject one, rest of the rows aligned subject two.\n\nFig. 4 shows the alignment results for the action of opening a cabinet. The projection on the principal\ncomponents for both sequences can be seen in Fig. 4a. CTW and LCTW project the sequences in\na low dimensional space that maximizes the correlation (Fig. 4b-c). Fig. 4d shows the alignment\npath. In this case, we do not have ground truth data, and we evaluated the results visually. The \ufb01rst\nrow of Fig. 4e shows few instances of the \ufb01rst subject, and the last three rows the alignment of the\nthird subject for DTW, CTW and LCTW. Observe that CTW and LCTW achieve better temporal\nalignment.\n\n7\n\n(a)(b)(c)(d)(e)(f)\u2212100102030\u221210010203040\u2212202\u2212202\u22122024\u22124\u2212202451015202530352530354045\u221220220406080102030405060708090 TruthDTWDDTWIMWCTWLCTW\u22120.100.10.2\u22120.15\u22120.1\u22120.0500.050.10.15510150.20.45101520WV\u22120.100.10.2\u22120.15\u22120.1\u22120.0500.050.10.15510150245101520WVR(g)(h)\u221210010203040\u221210010203040DTWDDTWIMWCTWLCTW00.020.040.060.080.10.120.140.16(a)(b)(c)(d)\u22120.500.5\u22120.500.500.050.10.15\u22120.0500.05\u22120.15\u22120.1\u22120.050\u22120.0500.05200400600800100200300400500600700 DTWCTWLCTWDTWCTWLCTW(e)\f5.3 Facial expression data\nIn this experiment we tested the ability of CTW and LCTW to align facial expressions. We took\n29 subjects from the RU-FACS database [27] which consists of interviews with men and women\nof varying ethnicity. The action units (AUs) in this database have been manually coded, and we\nselected AU12 (smiling) to run our experiments. Each event of AU12 is coded with an onset (start),\npeak and offset (end). We used person-speci\ufb01c AAM [28] to track 66 landmark points on the face.\nFor the alignment of AU12 we only used 18 landmarks corresponding to the outline of the mouth,\nso for each frame we have a vector (R36\u00d71) with (x, y) coordinates.\nWe took subject 14 and 30 and ran CTW and LCTW on the segments where the AU12 was coded.\nThe parameters of LCTW are manually set to kx = 3, ky = 3 and pw = .5, pv = .3, pr = .2. Fig. 5\nshows the results of the alignment. Fig. 5b-c shows that the low dimensional projection obtained\nwith CTW and LCTW has better alignment than DTW in Fig. 5a. Fig. 5d shows the position of\nthe peak frame as the intersection of the two dotted lines. As we can observe from Fig. 5d, the\nalignment paths found by CTW and LCTW are closer to the manually labeled peak than the ones\nfound by DTW. This shows that CTW and LCTW provide better alignment because the manually\nlabeled peaks in both sequences should be aligned. Fig. 5e shows several frames illustrating the\nalignment.\n\nFigure 5: Example of facial expression alignment. (a) PCA. (b) CTW. (c) LCTW. (d) Alignment\npath. (e) Frames from an AU12 event. The AU peaks are indicated by arrows.\n\n6 Conclusions\n\nIn this paper we proposed CTW and LCTW for spatio-temporal alignment of time series. CTW\nintegrates the bene\ufb01ts of DTW and CCA into a clean and simple formulation. CTW extends DTW by\nadding a feature selection mechanism and enables alignment of signals with different dimensionality.\nCTW extends CCA by adding temporal alignment and allowing temporal local projections. We\nillustrated the bene\ufb01ts of CTW for alignment of motion capture data and facial expressions.\n\n7 Acknowledgements\n\nThis material is based upon work partially supported by the National Science Foundation under\nGrant No. EEC-0540865.\n\n8\n\n(a)(e)DTWCTWLCTW\u2212150\u2212100\u221250050100\u221250050(b)\u22120.2\u22120.100.10.2\u22120.1\u22120.0500.050.10.150.20.25(c)\u22120.2\u22120.100.1\u22120.2\u22120.100.10.2(d)50100150102030405060708090 DTWCTWLCTW\fReferences\n[1] I. N. Junejo, E. Dexter, I. Laptev, and P. P\u00b4erez. Cross-view action recognition from temporal self-\n\nsimilarities. In ECCV, pages 293\u2013306, 2008.\n\n[2] F. Zhou, F. de la Torre, and J. K. Hodgins. Aligned cluster analysis for temporal segmentation of human\n\nmotion. In FGR, pages 1\u20137, 2008.\n\n[3] E. Hsu, K. Pulli, and J. Popovic. Style translation for human motion. In SIGGRAPH, 2005.\n[4] M. Brand, N. Oliver, and A. Pentland. Coupled hidden Markov models for complex action recognition.\n\nIn CVPR, pages 994\u2013999, 1997.\n\n[5] M. Brand and A. Hertzmann. Style machines. In SIGGRAPH, pages 183\u2013192, 2000.\n[6] G. W. Taylor, G. E. Hinton, and S. T. Roweis. Modeling human motion using binary latent variables. In\n\nNIPS, volume 19, page 1345, 2007.\n\n[7] A. Heloir, N. Courty, S. Gibet, and F. Multon. Temporal alignment of communicative gesture sequences.\n\nJ. Visual. Comp. Animat., 17(3-4):347\u2013357, 2006.\n\n[8] A. Shapiro, Y. Cao, and P. Faloutsos. Style components. In Graphics Interface, pages 33\u201339, 2006.\n[9] G. Liu, Z. Pan, and Z. Lin. Style subspaces for character animation. J. Visual. Comp. Animat., 19(3-\n\n4):199\u2013209, 2008.\n\n[10] A. M. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. In CVPR, 2004.\n[11] Y. Caspi and M. Irani. Aligning non-overlapping sequences. Int. J. Comput. Vis., 48(1):39\u201351, 2002.\n[12] C. Rao, A. Gritai, M. Shah, and T. Fathima Syeda-Mahmood. View-invariant alignment and matching of\n\nvideo sequences. In ICCV, pages 939\u2013945, 2003.\n\n[13] A. Gritai, Y. Sheikh, C. Rao, and M. Shah. Matching trajectories of anatomical landmarks under view-\n\npoint, anthropometric and temporal transforms. Int. J. Comput. Vis., 2009.\n\n[14] L. Rabiner and B.-H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993.\n[15] E. J. Keogh and M. J. Pazzani. Derivative dynamic time warping. In SIAM ICDM, 2001.\n[16] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In\n\nNIPS, pages 817\u2013824, 2005.\n\n[17] Y. Sheikh, M. Sheikh, and M. Shah. Exploring the space of a human action. In ICCV, 2005.\n[18] T. W. Anderson. An introduction to multivariate statistical analysis. Wiley-Interscience, 2003.\n[19] F. de la Torre. A uni\ufb01cation of component analysis methods. Handbook of Pattern Recognition and\n\nComputer Vision, 2009.\n\n[20] T. K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization\n\nand detection. IEEE Trans. Pattern Anal. Mach. Intell., 31:1415\u20131428, 2009.\n\n[21] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity correlation analysis. In CVPR, 2009.\n[22] B. Fischer, V. Roth, and J. Buhmann. Time-series alignment by non-negative multiple generalized canon-\n\nical correlation analysis. BMC bioinformatics, 8(10), 2007.\n\n[23] D. P. Bertsekas. Dynamic programming and optimal control. 1995.\n[24] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. University of\n\nToronto Tec. Rep., 1997.\n\n[25] J. J. Verbeek, S. T. Roweis, and N. A. Vlassis. Non-linear CCA and PCA by alignment of local models.\n\nIn NIPS, 2003.\n\n[26] F. de la Torre, J. K. Hodgins, J. Montano, S. Valcarcel, A. Bargteil, X. Martin, J. Macey, A. Collado,\nand P. Beltran. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database.\nCarnegie Mellon University Tec. Rep., 2009.\n\n[27] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. Fasel, and J. R. Movellan. Automatic\n\nrecognition of facial actions in spontaneous expressions. J. Multimed., 1(6):22\u201335, 2006.\n\n[28] I. Matthews and S. Baker. Active appearance models revisited. Int. J. Comput. Vis., 60(2):135\u2013164, 2004.\n\n9\n\n\f", "award": [], "sourceid": 760, "authors": [{"given_name": "Feng", "family_name": "Zhou", "institution": null}, {"given_name": "Fernando", "family_name": "Torre", "institution": null}]}