{"title": "Metric Learning for Temporal Sequence Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 1817, "page_last": 1825, "abstract": "In this paper, we propose to learn a Mahalanobis distance to perform alignment of multivariate time series. The learning examples for this task are time series for which the true alignment is known. We cast the alignment problem as a structured prediction task, and propose realistic losses between alignments for which the optimization is tractable. We provide experiments on real data in the audio-to-audio context, where we show that the learning of a similarity measure leads to improvements in the performance of the alignment task. We also propose to use this metric learning framework to perform feature selection and, from basic audio features, build a combination of these with better alignment performance.", "full_text": "Metric Learning for Temporal Sequence Alignment\n\nDamien Garreau \u2217\u2020\n\nENS\n\ndamien.garreau@ens.fr\n\nR\u00b4emi Lajugie \u2217\u2020\n\nINRIA\n\nremi.lajugie@inria.fr\n\nSylvain Arlot \u2020\n\nCNRS\n\nsylvain.arlot@ens.fr\n\nFrancis Bach \u2020\n\nINRIA\n\nfrancis.bach@inria.fr\n\nAbstract\n\nIn this paper, we propose to learn a Mahalanobis distance to perform alignment\nof multivariate time series. The learning examples for this task are time series for\nwhich the true alignment is known. We cast the alignment problem as a structured\nprediction task, and propose realistic losses between alignments for which the\noptimization is tractable. We provide experiments on real data in the audio-to-\naudio context, where we show that the learning of a similarity measure leads to\nimprovements in the performance of the alignment task. We also propose to use\nthis metric learning framework to perform feature selection and, from basic audio\nfeatures, build a combination of these with better alignment performance.\n\n1\n\nIntroduction\n\nThe problem of aligning temporal sequences is ubiquitous in applications ranging from bioinformat-\nics [5, 1, 23] to audio processing [4, 6]. The goal is to align two similar time series that have the\nsame global structure, but local temporal differences. Most alignments algorithms rely on similar-\nity measures, and having a good metric is crucial, especially in the high-dimensional setting where\nsome features of the signals can be irrelevant to the alignment task. The goal of this paper is to show\nhow to learn this similarity measure from annotated examples in order to improve the relevance of\nthe alignments.\nFor example, in the context of music information retrieval, alignment is used in two different cases:\n(1) audio-to-audio alignment and (2) audio-to-score alignment. In the \ufb01rst case, the goal is to match\ntwo audio interpretations of the same piece that are potentially different in rythm, whereas audio-to-\nscore alignment focuses on matching an audio signal to a symbolic representation of the score. In\nthe second case, there are some attempts to learn from annotated data a measure for performing the\nalignment. Joder et al. [12] propose to \ufb01t a generative model in that context, and Keshet et al. [13]\nlearn this measure in a discriminative setting.\nSimilarly to Keshet et al. [13], we use a discriminative loss to learn the measure, but our work focuses\non audio-to-audio alignment. In that context, the set of authorized alignments is much larger, and we\nexplicitly cast the problem as a structured prediction task, that we solve using off-the-shelf stochastic\noptimization techniques [15] but with proper and signi\ufb01cant adjustments, in particular in terms of\nlosses. The ideas of alignment are also very relevant to the community of speech recognition since\nthe pioneering work of Sakoe and Chiba [19].\n\n\u2217Contributed equally\n\u2020SIERRA project-team, D\u00b4epartement d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure (CNRS, INRIA, ENS)\n\n1\n\n\fThe need for metric learning goes far beyond unsupervised partitioning problems. Weinberger and\nSaul [26] proposed a large-margin framework for learning a metric in nearest-neighbour algorithms\nbased on sets of must-link/must-not-link constraints. Lajugie et al. [16] proposed to use a large\nmargin framework to learn a Mahalanobis metric in the context of partitioning problems. Since\nstructured SVM have been proposed by Tsochantaridis et al. [25] and Taskar et al. [22], they have\nsuccessfully been used to solve many learning problems, for instance to learn weights for graph\nmatching [3] or a metric for ranking tasks [17]. They have also been used to learn graph structures\nusing graph cuts [21].\nWe make the following \ufb01ve contributions:\n\u2013 We cast the learning of a Mahalanobis metric in the context of alignment as a structured prediction\n\nproblem.\n\n\u2013 We show that on real musical datasets this metric improves the performance of alignment algo-\n\nrithms using high-level features.\n\n\u2013 We propose to use the metric learning framework to learn combinations of basic audio features\n\nand get good alignment performances.\n\n\u2013 We show experimentally that the standard Hamming loss, although tractable computationnally,\n\ndoes not permit to learn a relevant similarity measure in some real world settings.\n\n\u2013 We propose a new loss, closer to the true evaluation loss for alignments, leading to a tractable\nlearning task, and derive an ef\ufb01cient Frank-Wolfe-based algorithm to deal with this new loss.\nThat loss solves some issues encountered with the Hamming loss.\n\n2 Matricial formulation of alignment problems\n\n2.1 Notations\nIn this paper, we consider the alignment problem between two multivariate time series sharing the\nsame dimension p, but possibly of different lengths TA and TB, namely A \u2208 RTA\u00d7p and B \u2208\nRTB\u00d7p. We refer to the rows of A as a1, . . . , aTA \u2208 Rp and those of B as b1, . . . , bTB \u2208 Rp as\ncolumn vectors. From now on, we denote by X the pair of signals (A, B).\nLet C(X) \u2208 RTA\u00d7TB be an arbitrary pairwise af\ufb01nity matrix associated to the pair X, that is,\nC(X)i,j encodes the af\ufb01nity between ai and bj. Note that our framework can be extended to the\ncase where A and B are multivariate signals of different dimensions, as long as C(X) is well-\nde\ufb01ned. The goal of the alignment task is to \ufb01nd two non-decreasing sequences of indices \u03b1 and \u03b2\nof same length u \u2265 max(TA, TB) and to match each time index \u03b1(i) in the time series A to the time\ni=1 C(X)\u03b1(i),\u03b2(i) is maximal, and that (\u03b1, \u03b2)\n\nindex \u03b2(i) in the time series B, in such a way that(cid:80)u\nsatis\ufb01es:\uf8f1\uf8f2\uf8f3\n\n\u03b1(1) = \u03b2(1) = 1\n\n(matching beginnings)\n(matching endings)\n(three type of moves)\n\n(1)\n\n\u2200i, (\u03b1(i + 1), \u03b2(i + 1)) \u2212 (\u03b1(i), \u03b2(i)) \u2208 {(1, 0), (0, 1), (1, 1)}\n\n\u03b1(u) = TA, \u03b2(u) = TB\n\nFor a given (\u03b1, \u03b2), we de\ufb01ne the binary matrix Y \u2208 {0, 1}TA\u00d7TB such that Y\u03b1(i),\u03b2(i) = 1 for every\ni \u2208 {1, . . . , u} and 0 otherwise. We denote by Y(X) the set of such matrices, which is uniquely\ndetermined by TA and TB. An example is given in Fig. 1. A vertical move in the Y matrix means\nthat the signal B is waiting for A, whereas an horizontal one means that A is waiting for B, and a\ndiagonal move means that they move together. In this sense the time reference is \u201cwarped\u201d.\nWhen C(X) is known, the alignment task can be cast as the following linear program (LP) over the\nset Y(X):\n\nTr(C(X)(cid:62)Y ).\n\nmax\n\nY \u2208Y(X)\n\n(2)\n\nOur goal is to learn how to form the af\ufb01nity matrix: once we have learned C(X), the alignment is\nobtained from Eq. (2). The optimization problem in Eq. (2) will be referred to as the decoding of\nour model.\nDynamic time warping. Given the af\ufb01nity matrix C(X) associated with the pair of signals X =\n(A, B), \ufb01nding the alignment that solves the LP of Eq. (2) can be done ef\ufb01ciently in O(TATB) using\n\n2\n\n\fFigure 1: Example of two valid alignments encoded by matrices Y 1 and Y 2. Red upper triangles\nshow the (i, j) such that Y 1\ni,j = 1. The\ngrey zone corresponds to the area loss \u03b4abs between Y 1 and Y 2.\n\ni,j = 1, and the blue lower ones show the (i, j) such that Y 2\n\na dynamic programming algorithm. It is often referred to as dynamic time warping [5, 18]. This\nalgorithm is described in Alg. 1 of the supplementary material. Various additional constraints may\nbe used in the dynamic time warping algorithm [18], which we could easily add to Alg. 1.\nThe cardinality of the set Y(X) is huge: it corresponds to the number of paths on a rectangular grid\nfrom the southwest (1, 1) to the northeast corner (TA, TB) with vertical, horizontal and diagonal\nmoves allowed. This is the de\ufb01nition of the Delannoy numbers [2]. As noted in [24], when t =\n\u221a\n\u221a\nTA = TB goes to in\ufb01nity, and one can show that #Yt,s \u223c (3+2\n2)t\n\u221a\n2\u22124\n3\n\n\u221a\n\n\u03c0t\n\n.\n\n2.2 The Mahalanobis metric\n\nIn many applications (see, e.g., [6]), for a pair X = (A, B), the af\ufb01nity matrix is computed by\nC(A, B)i,j = \u2212(cid:107)ai,k \u2212 bj,k(cid:107)2. In this paper we propose to learn the metric to compare ai and bj\ninstead of using the plain Euclidean metric. That is, C(X) is parametrized by a matrix W \u2208 W \u2282\nRp\u00d7p, where W \u2282 Rp\u00d7p is the set of semi-de\ufb01nite positive matrices, and we use the corresponding\nMahalanobis metric to compute the pairwise af\ufb01nity between ai and bj:\nC(X; W )i,j = \u2212(ai \u2212 bj)(cid:62)W (ai \u2212 bj).\n\n(3)\n\nNote that the decoding of Eq. (2) is the maximization of a linear function in the parameter W :\n\nmax\n\nY \u2208Y(X)\n\nTr(C(X; W )(cid:62)Y ) \u21d4 max\nY \u2208Y(X)\n\nTr(W (cid:62)\u03c6(X, Y )),\n\n(4)\n\n(5)\n\nif we de\ufb01ne the joint feature map\n\n\u03c6(X, Y ) = \u2212 TA(cid:88)\n\nTB(cid:88)\n\nYi,j(ai \u2212 bj)(ai \u2212 bj)(cid:62) \u2208 Rp\u00d7p.\n\ni=1\n\nj=1\n\n3 Learning the metric\nFrom now on, we assume that we are given n pairs of training instances1 (X i, Y i) =\n((Ai, Bi), Y i) \u2208 RT i\nB , i = 1, . . . , n. Our goal is to \ufb01nd a matrix\nW such that the predicted alignments are close to the groundtruth on these examples, as well as\non unseen examples. We \ufb01rst de\ufb01ne a loss between alignments, in order to quantify the proximity\nbetween alignments.\n\nB\u00d7p \u00d7 {0, 1}T i\n\nA\u00d7p \u00d7 RT i\n\nA\u00d7T i\n\n1We will see that it is necessary to have fully labelled instances, which means that for each pair X i we need\nan exact alignment Y i between Ai and Bi. Partial alignment might be dealt with by alternating between metric\nlearning and constrained alignment.\n\n3\n\n\f3.1 Losses between alignments\nIn our framework, the alignments are encoded by matrices in Y(X), thus we are interested in func-\ntions (cid:96) : Y(X) \u00d7 Y(X) \u2192 R+. The Frobenius norm is de\ufb01ned by (cid:107)M(cid:107)2\nHamming loss. A simple loss between matrices is the Frobenius norm of their difference, which\nturns out to be the unnormalized Hamming loss [9] for 0/1-valued matrices. For two matrices\nY1, Y2 \u2208 Y(X), it is de\ufb01ned as:\n\nF =(cid:80)\n\ni,j.\ni,j M 2\n\n(cid:96)H (Y1, Y2) = (cid:107)Y1 \u2212 Y2(cid:107)2\n= Tr(Y11TB 1(cid:62)\n\nF = Tr(Y (cid:62)\n\n1 Y1) + Tr(Y (cid:62)\n\n) + Tr(Y21TB 1(cid:62)\n\n2 Y2) \u2212 2 Tr(Y (cid:62)\n) \u2212 2 Tr(Y (cid:62)\n1 Y2),\n\n1 Y2)\n\nTA\n\nTA\n\n(6)\nwhere 1T is the vector of RT with all coordinates equal to 1. The last line of Eq. (6) comes from\nthe fact that the Yi have 0/1-values; that makes the Hamming loss af\ufb01ne in Y1 and Y2. This loss is\noften used in other structured prediction tasks [15]; in the audio-to-score setting, Keshet et al. [13]\nuse a modi\ufb01ed version of this loss, which is the average number of times the difference between the\ntwo alignments is greater than a \ufb01xed threshold.\nThis loss is easy to optimize since, it is linear in our parametrization of the alignement problem, but\nnot optimal for audio-to-audio alignment. Indeed, a major drawback of the Hamming loss is that, for\nalignments of \ufb01xed length, it depends only on the number of \u201ccrossings\u201d between alignment paths:\none can easily \ufb01nd Y1, Y2, Y3 such that (cid:96)H (Y2, Y1) = (cid:96)H (Y3, Y1) but Y2 is much closer to Y1 than\nY3 (see Fig. 2). It is important to notice this is often the case when the length of the signals grows.\nArea loss. A more natural loss can be computed as the mean distance beween the paths depicted by\ntwo matrices Y 1, Y 2 \u2208 Y(X). This loss corresponds to the area between the paths of two matrices\nY , as represented by the grey zone on Fig. 1.\nFormally, as in Fig. 1, for each t \u2208 {1, . . . , TB} we put \u03b4t =| min{k, Y 1\nt,k = 1}|.\nThen the area loss is the mean of the \u03b4t. In the audio literature [14], this loss is sometimes called the\n\u201cmean absolute deviation\u201d loss and is noted \u03b4abs(Y 1, Y 2).\nUnfortunately, for the general alignment problem, \u03b4abs is not linear in the matrices Y . But in the\ncontext of alignment of sequences of two different natures, one of the signal is a reference and\nthus the index sequence \u03b1 de\ufb01ned in Eq. (1) is increasing, e.g., for the audio-to-partition alignment\nproblem [12]. This loss is then linear in each of its arguments. More precisely, if we introduce the\nmatrix LTA \u2208 RTA\u00d7TA which is lower triangular with ones (including on the diagonal), we can\nwrite the loss as\n\nt,k = 1}\u2212min{k, Y 2\n\n(cid:96)O = (cid:107)LTA (Y1 \u2212 Y2)(cid:107)2\n= Tr(LTA Y11TB 1(cid:62)\n\nF\n\n(7)\n\n).\n\nTA\n\nTA\n\nTA\n\n2 L(cid:62)\n\n) \u2212 2 Tr(LTA Y1Y (cid:62)\n\n) + Tr(LTA Y21TB 1(cid:62)\n\nk(LTA )i,kYk,j = (cid:80)i\n\nthen it is easy see that (LTA Y )i,j = (cid:80)\nonly if i \u2265 kj. So(cid:80)\n\nWe now prove that this loss corresponds to the area loss in this special case. Let Y be an alignment,\nk=1 Yk,j. If Y does not have vertical\nmoves, i.e., for each j there is an unique kj such that Ykj ,j = 1, we have that (LTAY )i,j = 1 if and\ni,j(LTA Y )i,j = #{(i, j), i \u2265 kj}, which is exactly the area under the curve\ndetermined by the path of Y . In all our experiments, we use \u03b4abs for evaluation but not for training.\nApproximation of the area loss:\nIn many real world applica-\ntions [14], a meaningful loss to assess the quality of an alignment is the area loss. As shown by\nour experiments, if the Hamming loss is suf\ufb01cient in some simple situations and allows to learn a\nmetric that leads to good alignment performance in terms of area loss, on more challenging datasets\nit does not work at all (see Sec. 5). This is due to the fact that two alignments that are very close\nin terms of area loss can suffer a big Hamming loss (cf. Fig. 2). Thus it is natural to extend the\nformulation of Eq. (7) to matrices in Y(X). We start by symmetrizing the formulation of Eq. (7) to\novercome problems of overpenalization of vertical vs. horizontal moves. We de\ufb01ne, for any couple\nof binary matrices (Y 1, Y 2),\n\nthe symmetrized area loss.\n\n(cid:96)S(Y1, Y2) =\n\n(cid:0)(cid:107)LTA (Y1 \u2212 Y2)(cid:107)2\n(cid:104)\n\nTr(Y (cid:62)\n\n1 L(cid:62)\n=\n+ Tr(Y1LTB L(cid:62)\n\nTA\n\nTB\n\n1\n2\n1\n2\n\nF + (cid:107)(Y1 \u2212 Y2)LTB )(cid:107)2\nLTA Y1) + Tr(LTAY21TB 1(cid:62)\nY ) + Tr(Y (cid:62)\n2 1TA1TB LTB L(cid:62)\n\nF\n\nTA\n\nTB\n\n(cid:1)\n\n2 L(cid:62)\n\n) \u2212 2 Tr(Y (cid:62)\nY2) \u2212 2 Tr(Y2LTB L(cid:62)\n\nLTAY1)\nY (cid:62)\n\nTA\n\n1\n\nTB\n\n(8)\n\n(cid:1)(cid:105)\n\n.\n\n4\n\n\fFigure 2: On the real world Bach chorales dataset, we have represented a groundtruth alignment\ntogether with two others. In term of Hamming loss, both alignments are as far from the groundtruth\nwhereas for the area loss, they are not.\nIn the structured prediction setting described in Sec. 4,\nthe depicted alignment are the so-called \u201cmost violated constraint\u201d, namely the output of the loss\naugmented decoding step (see Sec. 4).\n\nWe propose now to make this loss concave over the convex hull of Y(X) that we denote from now\non Y(X). Let us introduce DT = \u03bbmax(L(cid:62)\nT LT )IT\u00d7T with \u03bbmax(U ) the largest eigenvalue of U. 2\nFor any binary matrices Y1, Y2, we have\n\n(cid:96)S(Y1, Y2) =\n\n(cid:2) Tr(Y (cid:62)\n\n1\n2\n\n\u2212 2 Tr(Y (cid:62)\n+ Tr(Y1DTB 1TB 1(cid:62)\n\nTA\n\nTA\n\n1 (L(cid:62)\n2 (L(cid:62)\n\nLTA \u2212 DTA )Y1) + Tr(DTA Y11TB 1(cid:62)\nL \u2212 DTA)Y1) + Tr(Y1(LTB L(cid:62)\n\n\u2212 DTB )Y )\n\nTA\n\nTA\n\nTB\n\n) + Tr(LTAY21TB 1(cid:62)\n\n)\n\nTA\n\n) + Tr(Y (cid:62)\n\n2 LTB L(cid:62)\n\nTB\n\nY2) \u2212 2 Tr(Y2LTB L(cid:62)\n\nTB\n\nY (cid:62)\n1 )\n\n,\n\n(cid:105)\n\nand we get a concave function over Y(X) that coincides with (cid:96)S on Y(X).\n\n3.2 Empirical loss minimization\nRecall that we are given n alignment examples (X i, Y i)1\u2264i\u2264n. For a \ufb01xed loss (cid:96), our goal is now\nto solve the following minimization problem in W :\n\n\uf8f1\uf8f2\uf8f3 1\n\nn(cid:88)\n\n(cid:96)(cid:0)Y i, argmax\n\n\uf8fc\uf8fd\uf8fe ,\nTr(C(X i; W )(cid:62)Y )(cid:1) + \u03bb\u2126(W )\n\nn\n\nmin\nW\u2208W\nF is a convex regularizer preventing from over\ufb01tting, with \u03bb \u2265 0.\n\nY \u2208YT i\n\n,T i\nB\n\ni=1\n\nA\n\nwhere \u2126 = \u03bb\n\n2(cid:107)W(cid:107)2\n\n(9)\n\n4 Large margin approach\n\nIn this section we describe a large margin approach to solve a surrogate to the problem in Eq. (9),\nwhich is untractable. As shown in Eq. (4), the decoding task is the maximum of a linear function\nin the parameter W and aims at predicting an output over a large and discrete space (the space\nof potential alignments with respect to the constraints in Eq. (1)). Learning W thus falls into the\nstructured prediction framework [25, 22]. We de\ufb01ne the hinge loss, a convex surrogate, by\n\nL(X, Y ; W ) = max\n\nY (cid:48)\u2208Y(X)\n\n(cid:96)(Y, Y (cid:48)) \u2212 Tr(W (cid:62) [\u03c6(X, Y ) \u2212 \u03c6(X, Y (cid:48))])\n\n.\n\n(10)\n\n(cid:110)\n\n(cid:111)\n\n2For completeness, in our experiments, we also try to set the matrices DT with minimal trace that dominate\nL(cid:62)\nT LT by solving a semide\ufb01nite program (SDP). We report the associated result in Fig 4. Note also that\nother matrices could have been chosen. In particular, since our matrices LT are pointwise positive, the matrix\nDiag(L(cid:62)\n\nT LT is such that the loss is concave.\n\nT LT ) \u2212 L(cid:62)\n\n5\n\n02004006008001000120014001600180002004006008001000120014001600tAtB Most violated constraint for Hamming LossMost violated constraint for lSGroundruth alignment\fThe evaluation of L is usually referred to as \u201closs-augmented decoding\u201d, see [25]. If we de\ufb01ne (cid:98)Y i\n\nas the argmax in Eq. (10) when (X, Y ) = (X i, Y i), then elementary computations show that\n\n(cid:98)Y i = argmin\n\nY \u2208Y(X)\n\nTr((U(cid:62) \u2212 2Y i(cid:62) \u2212 C(X i; W )(cid:62))Y ),\n\nwhere U = 1TB 1(cid:62)\nWe now aim at solving the following problem, sometimes called the margin-rescaled problem:\n\n\u2208 RTA\u00d7TB .\n\nTB\n\n(cid:110)\n(cid:111)\n(cid:96)(Y, Y i) \u2212 Tr(W (cid:62)(cid:2)\u03c6(X i, Y i) \u2212 \u03c6(X i, Y )(cid:3))\n\n.\n\n(11)\n\nmin\nW\u2208W\n\n(cid:107)W(cid:107)2\n\nF +\n\n\u03bb\n2\n\n1\nn\n\nmax\n\nY \u2208Y(X)\n\nn(cid:88)\n\ni=1\n\nHamming loss case. From Eq. (4), one can notice that our joint feature map is linear in Y . Thus,\nif we take a loss that is linear in the \ufb01rst argument of (cid:96), for instance the Hamming loss, the loss-\naugmented decoding is the maximization of a linear function over the spaces Y(X) that we can\nsolve ef\ufb01ciently using dynamic programming algorithms (see Sec. 2.1 and supplementary material).\nThat way, plugging the Hamming loss (Eq. (6)) in Eq. (11) leads to a convex structured prediction\nproblem. This problem can be solved using standard techniques such as cutting plane methods [11],\nstochastic gradient descent [20], or block-coordinate Frank-Wolfe in the dual [15]. Note that we\nadapted the standard unconstrained optimization methods to our setting, where W (cid:23) 0.\n\n2\u03bbn2(cid:107)(cid:80)n\n\ni=1 \u2212(cid:80)\n\n1\n\nmin\n\nOptimization using the symmetrized area loss. The symmetrized area loss is concave in its \ufb01rst\nargument, thus the problem of Eq. (11) is in a min/max form and deriving a dual is straightforward.\nDetails can be found in the supplementary material. If we plug the symmetrized area loss (cid:96)S (SAL)\nde\ufb01ned in Eq. (8) into our problem (11), we can show that the dual of (11) has the following form:\n\nj,k(Yi \u2212 Z i)j,k(aj \u2212 bk)(aj \u2212 bk)T(cid:107)2\n\nF \u2212 1\n\ni=1 (cid:96)S(Z, Z i), (12)\n\n(Z1,...,Zn)\u2208Y\nif we denote by Y(X i) the convex hull of the sets Y(X i), and by Y the cartesian product over all\nthe training examples i of such sets. Note that we recover a similar result as [15]. Since the SAL\nloss is concave, the aforementioned problem is convex.\nThe problem (12) is a quadratic program over the compact set Y. Thus we can use a Frank-Wolfe [7]\nalgorithm. Note that it is similar to the one proposed by Lacoste-Julien et al. [15] but with an\nadditional term due to the concavity of the loss.\n\nn\n\n(cid:80)n\n\n5 Experiments\nWe applied our method to the task of learning a good similarity measure for aligning audio sig-\nnals. In this \ufb01eld researchers have spent a lot of efforts in designing well-suited and meaningful\nfeatures [12, 4]. But the problem of combining these features for aligning temporal sequences is\nstill challenging. For simplicity, we took W diagonal for our experiments.\n\n5.1 Dataset of Kirchhoff and Lerch [14]\nDataset description. First, we applied our method on the dataset of Kirchhoff and Lerch [14]. In\nthis dataset, pairs of aligned examples (Ai, Bi) are arti\ufb01cially created by stretching an original audio\nsignal. That way, the groundtruth alignment Y i is known and thus the data falls into our setting A\nmore precise description of the dataset can be found in [14].\nThe N = 60 pairs are stretched along two different tempo curves. Each signal is made of 30s\nof music divided in frames of 46ms with a hopsize of 23ms, thus leading to a typical length of the\nsignals of T \u2248 1300 in our setting. We keep p = 11 features that are simple to implement and known\nto perform well for alignment tasks [14]. Those were: \ufb01ve MFCC [8] (labeled M 1, . . . , M 5 in\nFig. 3), the spectral \ufb02atness (SF), the spectral centroid (SC), the spectral spread (SS), the maximum\nof the envelope (Max), and the power level of each frame (Pow), see [14] for more details on the\ncomputation of the features. We normalize each feature by subtracting the median value and dividing\nby the standard deviation to the median, as audio data are subject to outliers.\n\n6\n\n\fFigure 3: Comparison of performance between individual features and the learned metric. Error\nbars for the performance of the learned metric were determined with the best and the worst perfor-\nmance on 5 different experiments. W denotes the learned combination using our method, and M\nthe best MFCC combination.\n\nExperiments. We conducted the following experiment: for each individual feature, we perform\nalignment using dynamic time warping algorithm and evaluate the performance of this single feature\nin terms of losses typically used to asses performance in this setting [14]. In Fig. 3, we report the\nresults of these experiments.\nThen, we plug these data into our method, using the Hamming loss to learn a linear positive com-\nbination of these features. The result is reported in Fig. 3. Thus, combining these features on this\ndataset yields to better performances than only considering a single feature.\nFor completeness, we also conducted the experiments using the standard 13 \ufb01rst MFCCs coef\ufb01cients\nand their \ufb01rst and second order derivatives as features. These results competed with the best learned\ncombination of the handcrafted features. Namely, in terms of the \u03b4abs loss, they perform at 0.046\nseconds. Note that these results are slightly worse than the best single handcrafted feature, but better\nthan the best MFCC coef\ufb01cient used as a feature.\nAs a baseline, we also compared ourselves against the uniform combination of handcrafted features\n(the metric being the identity matrix). The results are off the charts on Fig. 3 with \u03b4abs at 4.1 seconds\n(individual values ranging from 1.4 seconds to 7.4 seconds).\n\n5.2 Chorales dataset\nDataset. The Bach 10 dataset3 consists in ten J. S. Bach\u2019s Chorales (small quadriphonic pieces).\nFor each Chorale, a MIDI reference \ufb01le corresponding to the \u201cscore\u201d, or basically a representation\nof the partition. The alignments between the MIDI \ufb01les and the audio \ufb01le are given, thus we have\nconverted these MIDI \ufb01les into audio following what is classically done for alignment (see e.g, [10]).\nThat way we fall into the audio-to-audio framework in which our technique apply. Each piece of\nmusic is approximately 25s long, leading to similar signal length (T \u2248 1300).\nExperiments. We use the same features as in Sec. 5.1. As depicted in Fig. 4, the optimization with\nHamming loss performs poorly on this dataset. In fact, the best individual feature performance is far\nbetter than the performance of the learned W . Thus metric learning with the \u201cpractical\u201d Hamming\nloss performs much worse than the best single feature.\nThen, we conducted the same learning experiment with the symetrized area loss (cid:96)S. The resulting\nlearned parameter is far better than the one learned using the Hamming loss. We get a performance\nthat is similar to the one of the best feature. Note that these features were handcrafted and reaching\ntheir performance on this hard task with only a few training instances is already challenging.\n\n3http://music.cs.northwestern.edu/data/Bach10.html.\n\n7\n\nWMPowM1SCM4SRSFM3MaxSSM2M500.050.10.150.2\u03b4abs (s)\fFigure 4: Performance of our algorithms on the Chorales dataset. From left to right: (1) Best\nsingle feature, (2) Best learned combination of features using the symmetrized area loss (cid:96)S, (3)\nBest combination of MFCC using SAL and DT obtained via SDP (see footnote in section 3) (4)\nBest combination of MFCC and derivatives learned with (cid:96)S, (5) Best combination of MFCCs and\nderivatives learned with Hamming loss, (6) Best combination of features of [14] using Hamming\nloss.\n\nIn Fig. 2, we have depicted the result, for a learned parameter W , of the loss augmented decoding\nperformed either using the area. As it is known for structured SVM, this represents the most violated\nconstraint [25]. We can see that the most violated constraint for the Hamming loss leads to an align-\nment which is totally unrelated to the groundtruth alignment whereas the one for the symmetrized\narea loss is far closer and much more discriminative.\n\n5.3 Feature selection\nLast, we conducted feature selection experiments over the same datasets. Starting from low level\nfeatures, namely the 13 leading MFCCs coef\ufb01cients and their \ufb01rst two derivatives, we learn a linear\ncombination of these that achieves good alignment performance in terms of the area loss. Note\nthat very little musical prior knowledge is put into these. Moreover we either improve on the best\nhandcrafted feature on the dataset of [14] or perform similarly. On both datasets, the performance\nof learned combination of handcrafted features performed similarly to the combination of these 39\nMFCCs coef\ufb01cients.\n\n6 Conclusion\n\nIn this paper, we have presented a structured prediction framework for learning the metric for tempo-\nral alignment problems. We are able to combine hand-crafted features, as well as building automat-\nically new state-of-the-art features from basic low-level information with little expert knowledge.\nTechnically, this is made possible by considering a loss beyond the usual Hamming loss which is\ntypically used because it is \u201cpractical\u201d within a structured prediction framework (linear in the output\nrepresentation).\nThe present work may be extended in several ways, the main one being to consider cases where\nonly partial information about the alignments is available. This is often the case in music [4] or\nbioinformatics applications. Note that, similarly to Lajugie et al. [16] a simple alternating optimiza-\ntion between metric learning and constrained alignment provide a simple \ufb01rst solution, which could\nprobably be improved upon.\n\nAcknowledgements. The authors acknowledge the support of the European Research Council\n(SIERRA project 239993), the GARGANTUA project funded by the Mastodons program of CNRS\nand the Airbus foundation through a PhD fellowship. Thanks to Piotr Bojanowski, for helpful\ndiscussions. Warm thanks go to Arshia Cont and Philippe Cuvillier for sharing their knowledge\nabout audio processing, and to Holger Kirchhoff and Alexander Lerch for their dataset.\n\n8\n\n(1)(2)(3)(4)(5)(6)02468\u03b4abs (s)\fReferences\n[1] J. Aach and G. M. Church. Aligning gene expression time series with time warping algorithms. Bioinfor-\n\nmatics, 17(6):495\u2013508, 2001.\n\n[2] C. Banderier and S. Schwer. Why Delannoy numbers? Journal of statistical planning and inference, 135\n\n(1):40\u201354, 2005.\n\n[3] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. IEEE Trans.\n\non PAMI, 31(6):1048\u20131058, 2009.\n\n[4] A. Cont, D. Schwarz, N. Schnell, C. Raphael, et al. Evaluation of real-time audio-to-score alignment. In\n\nProc. ISMIR, 2007.\n\n[5] M. Cuturi, J.-P. Vert, O. Birkenes, and T. Matsui. A kernel for time series based on global alignments. In\n\nProc. ICASSP, volume 2, pages II\u2013413. IEEE, 2007.\n\n[6] S. Dixon and G. Widmer. Match: A music alignment tool chest. In Proc. ISMIR, pages 492\u2013497, 2005.\n\n[7] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3\n\n(1-2):95\u2013110, 1956.\n\n[8] B. Gold, N. Morgan, and D. Ellis. Speech and audio signal processing: processing and perception of\n\nspeech and music. John Wiley & Sons, 2011.\n\n[9] R. Hamming. Error detecting and error correcting codes. Bell system technical journal, 29(2), 1950.\n\n[10] N. Hu, R. B. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and alignment for music\n\nretrieval. Computer Science Department, page 521, 2003.\n\n[11] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[12] C. Joder, S. Essid, and G. Richard. Learning optimal features for polyphonic audio-to-score alignment.\n\nIEEE Trans. on Audio, Speech, and Language Processing, 21(10):2118\u20132128, 2013.\n\n[13] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan. A large margin algorithm for speech-to-phoneme\nand music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8):\n2373\u20132382, 2007.\n\n[14] H. Kirchhoff and A. Lerch. Evaluation of features for audio-to-audio alignment. Journal of New Music\n\nResearch, 40(1):27\u201341, 2011.\n\n[15] S. Lacoste-Julien, M. Jaggi, M. Schmidt, P. Pletscher, et al. Block-coordinate Frank-Wolfe optimization\n\nfor structural SVMs. In Proc. ICML, 2013.\n\n[16] R. Lajugie, F. Bach, and S. Arlot. Large-margin metric learning for constrained partitioning problems. In\n\nProc. ICML, 2014.\n\n[17] B. McFee and G. R. Lanckriet. Metric learning to rank. In Proc. ICML, pages 775\u2013782, 2010.\n\n[18] M. M\u00a8uller. Information retrieval for music and motion. Springer, 2007.\n\n[19] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition.\n\nAcoustics, Speech and Signal Processing, IEEE Transactions on, 26(1):43\u201349, 1978.\n\n[20] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nSVM. Mathematical Programming, 127(1):3\u201330, 2011.\n\n[21] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In Proc. CVPR. 2008.\n\n[22] B. Taskar, D. Koller, and C. Guestrin. Max-margin Markov networks. Adv. NIPS, 2003.\n\n[23] J. D. Thompson, F. Plewniak, and O. Poch. Balibase: a benchmark alignment database for the evaluation\n\nof multiple alignment programs. Bioinformatics, 15(1):87\u201388, 1999.\n\n[24] A. Torres, A. Cabada, and J. J. Nieto. An exact formula for the number of alignments between two dna\n\nsequences. Mitochondrial DNA, 14(6):427\u2013430, 2003.\n\n[25] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer. Large margin methods for structured\n\nand interdependent output variables. Journal of Machine Learning Research, 6(9):1453\u20131484, 2005.\n\n[26] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJournal of Machine Learning Research, 10:207\u2013244, 2009.\n\n9\n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "Damien", "family_name": "Garreau", "institution": "ENS/INRIA"}, {"given_name": "R\u00e9mi", "family_name": "Lajugie", "institution": "Inria/ENS"}, {"given_name": "Sylvain", "family_name": "Arlot", "institution": "CNRS"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA & ENS"}]}