{"title": "Tempo tracking and rhythm quantization by sequential Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 1361, "page_last": 1368, "abstract": "", "full_text": "Tempo Tracking\n\nRhythm\nby Sequential Monte\n\nAli Taylan Ce:mgil and Bert Kappen\n\nSNN, University of Nijmegen\n\nNL 6525 EZ Nijmegen\n\nThe Netherlands\n\n{cemgil,bert}@mbfys.kun.nl\n\nAbstract\n\nWe present a probabilistic generative model for timing deviations\nin expressive music. performance. The structure of the proposed\nmodel is equivalent to a switching state space model. We formu(cid:173)\nlate two well known music recognition problems, namely tempo\ntracking and automatic transcription (rhythm quantization) as fil(cid:173)\ntering and maximum a posteriori (MAP) state estimation tasks.\nThe inferences are carried out using sequential Monte Carlo in(cid:173)\ntegration (particle filtering) techniques. For this purpose, we have\nderived a novel Viterbi algorithm for Rao-Blackwellized particle fil(cid:173)\nters, where a subset of the hidden variables is integrated out. The\nresulting model is suitable for realtime tempo tracking and tran(cid:173)\nscription and hence useful in a number of music applications such\nas adaptive automatic accompaniment and score typesetting.\n\n1\n\nIntroduction\n\nAutomatic music transcription refers to extraction of a high level description from\nmusical performance, for example in form of a music notation. Music notation can\nbe viewed as a list of the pitch levels and corresponding timestamps.\n\nIdeally, one would like to recover a score directly frOID: sound. Such a representation\nof the surface structure of music would be very useful in music information retrieval\n(Music-IR) and content description of musical material in large audio databases.\nHowever, when operating on sampled audio data from polyphonic acoustical signals,\nextraction of a score-like description is a very challenging auditory scene analysis\ntask [13].\nIn this paper, we focus on a subproblem in music-ir, where we assume that exact\ntiming information of notes is available, for example as a stream of MIDI! events\n\n1 Musical Instruments Digital Interface. A standard communication protocol especially\ndesigned for digital instruments such as keyboards. Each time a key is pressed, a MIDI\nkeyboard generates a short message containing pitch and key velocity. A computer can tag\neach received message by a timestamp for real-time processing and/or \"recording\" into a\nfile.\n\n\ffrom a digital keyboard.\n\nA model for tempo tracking and transcription is useful in a broad spectrum of ap(cid:173)\nplications. One example is automatic score typesetting, the musical analog of word\nprocessing. Almost all score typesetting applications provide a means of automatic\ngeneration of a conventional music notation from MIDI data.\n\nIn conventional music notation, onset time of each note is implicitly represented\nby the cumulative sum of durations of previous notes. Durations are encoded by\nsimple rational numbers (e.q. quarter note, eight note), consequently all events in\nmusic are placed on a discrete grid. So the basic task in MIDI transcription is to\nassociate discrete grid locations with onsets, Le. quantization.\n\nHowever, unless the music is performed with mechanical precision, identification of\nthe correct association becomes difficult. Consequently resulting scores have often\nvery poor quality. This is due to the fact that musicians introduce intentional (and\nunintentional) deviations from a mechanical prescription. For example timing of\nevents can be deliberately delayed or pushed. Moreover, the tempo can fluctuate\nby slowing down or accelerating.\nIn fact, such deviations are natural aspects of\nexpressive performance; in the absence of these, music tends to sound rather dull.\n\nRobust and fast quantization and tempo tracking is also an important requirement\nin interactive performance systems. These are emerging applications that \"listen\"\nto the performance for generating an accompaniment or improvisation in real time\n[10, 12]. At last, such models are also useful in musicology for systematic study and\ncharacterization of express~ve timing by principled analysis of existing performance\ndata.\n\n2 Model\n\nConsider the following generative model for timing deviations in music\n\nCk\n\nWk\n\nTk\n\n'Yk\n\nCk-1 + \"/k-1\nWk-1 + (k\nTk-1 + 2Wk\n\n(Ck - Ck-1)\n\nTk +\u20ack\n\n(1)\n(2)\n(3)\n(4)\n\nIn Eq. 1, Ck denotes the grid location of k'th onset in a score. The interval between\ntwo consecutive onsets in the score is denoted by \"/k-1 .\n.For example consider the\nnotation j n which encodes ,,/1:3 == [1 0.5 0.5], hence C1:4 == [0 1 1.5 2]. We\nassign a prior of form p(Ck) ex exp(-d(Ck)) where d(Ck) is the number of significant\ndigits in the binary expansion of the fraction of Ck [1]. One can check that such a\nprior prefers simpler notations, e.g. p( J]~ITJ ) < p( j n ). We note that Ck are\ndrawn from an infinite (but discrete) set and are increasing in k, i.e Ck 2:: Ck-1. To\nallow for different time signatures and alternative rhythmic subdivisions, one can\nintroduce additional hidden variables [1], but this is not addressed in this paper.\n\nEq. 2 defines a prior over possible tempo deviations. We denote the logarithm of\nthe period (inverse tempo) by w. For example if the tempo is 60 beats per minute\n(bpm), w == log 1sec == O. Since tempo appears as a scale variable in mapping grid\nlocations on a score to the actual performance time, we have chosen to represent it\nin the logarithmic scale (eventually a gamma distribution can also be used). This\nrepresentation is both perceptually plausible and mathematically convenient since a\nsymmetric noise model on w assigns equal probabilities to equal relative c~anges in\ntempo. We take (k to be a Gaussian random variable with N(O, A2'kQ). Depending\n\n\fupon the interval between consecutive onsets, the model scales the noise covariance;\nlonger jumps in the score allow for more freedom in fluctuating the tempo. Given\nthe W sequence, Eq. 3 defines a model of noiseless onsets with variable tempo. We\nwill denote the pair of hidden continuous variables by Zk == (Tk' Wk).\nEq. 4 defines the observation model. Here Yk is the observed onset time of the\nk'th onset in the performance. The noise term tk models small scale expressive\ndeviations in timing of individual notes and has a Gaussian distribution parame(cid:173)\nterized by N(tt(\"(k-l), \"E(\"(k-l)).Such a parameterization is useful for appropriate\nquantization of phrases (short sequences of notes) that are shifted or delayed as a\nwhole [1].\nill reality, a random walk model for tempo such as in Eq. 2 is not very realistic.\nill the dynamical model framework\nTempo deviations are usually more smooth.\nsuch smooth deviations can be allowed by increasing the dimensionality of W by\ninclude higher order \"inertia\" variables [2]. ill this case we simply rewrite Eq. 2 as\n\nWk\n\n== AWk-l + (k\n\nand take a diagonal Q. Accordingly, the observation model (Eq. 4) changed such\nthat Wk is replaced by CWk where C == [1 0 ... 0].\nThe graphical model is shown in Figure 1. The model is similar to a switching\nstate space model, that has been recently applied in the context of music tran(cid:173)\nscription [11]. The differences are in parameterization and more importantly in the\ninference method.\n\nFigure 1: Graphical Model. The pair of continuous hidden variables (Tk' Wk) is\ndenoted by Zk. Both C and Z are hidden; only the onsets Y are observed.\n\nWe define tempo tracking as a filtering problem\n\n==\n\nargmax LP(Ck,ZkIYl:k)\n\nZk\n\nand rhythm transcription as a MAP state estimation problem\n\nargmaxp(Cl:KIY1:K)\n\nCl:K\n\np(Cl:K IY1:K)\n\n(5)\n\n(6)\n\n(7)\n\nThe exact computation of the quantities in Eq. 6 and Eq. 5 is intractable due to\nthe explosion in the number of mixture components required to represent the exact\nposterior at each step k. Consequently we will use Monte Carlo approximation\ntechniques.\n\n\f3 Sequential Monte Carlo Sampling\n\nSequential Monte Carlo sampling (a.k.a. particle filtering) is an integration method\nespecially powerful for inference in dynamical systems. See [4] for a detailed review\nof state of the art. At each step k, the exact marginal posterior over hidden states\nXk is approximated by an empirical distribution of form\n\np(Xk IY1:k) ~ L wii\n\nN\n\n) <5(Xk - x~i))\n\ni==l\n\n(8)\n\n) are associated importance weights such that 2::::1 wii\n\nwhere x~i) are a set of points obtained by sampling from a proposal distribution\nand wii\n) == 1. Particles\nat step k are evolved to k + 1 by sequential importance sampling and resampling\nmethods [6]. Once a set of discrete sample points is obtained during the forward\nphase by sampling, particle approximations to quantities such as the smoothed\nmarginal posterior p(Xk IYl:K) or the maximum a posteriori state sequence (Viterbi\npath) xr:K can be obtained efficiently. Due to the discrete nature of the approximate\nrepresentation, resulting algorithms are closely related to standard smoothing and\nViterbi algorithms in Hidden Markov models [9, 7, 6].\n\nUnfortunately, if the hidden state space is of high dimensionality, sampling can be\ninefficient. Hence increasingly many particles are needed to accurately represent the\nposterior. Consequently, the estimation of \"off-line\" quantities such as p(Xk IY1:K)\nand x~:K becomes very costly since one has to store all past trajectories.\n\nFor some models, including the one proposed here, one can identify substructures\nwhere integrations, conditioned on certain nodes can be computed analytically [5].\nConditioned on C1:k, the model reduces to the (extended) 2 Kalman filter. In this\ncase the joint marginal posterior is represented as a mixture\n\n(i)\np(Ck' Zk IY1:k) ~ L...J W k p(Zk ICk ,Y1:k)<5(Ck - Ck )\n\n(i)\n\nN\n\"\"\"\"\n\n(i)\n\n(9)\n\nThe particular case of Gaussian p(ZkIcii\n) ,Y1:k) is extensively used in diverse appli(cid:173)\ncations [8] and reported to give superior results when compared to standard particle\nfiltering [3, 6].\n\ni==l\n\n3.1 Particle Filtering\n\nWe assume that we have obtained a set- of particles from filtered posterior p(Ck IY1:k).\nDue to lack of space we do not give the details of the particle filtering algorithm\nbut refer the reader to [6]. One important point to note is that we have to use the\noptimal proposal distribution given as\n\np(cklc~i~l' Y1:k) oc\n\nJdZk- 1:k p(Yklzk' Ck, c~i~l)\n\nC)\n\nC)\n\np(Zk' Ck IZk-1, ck~_1)P(Zk-1Ick~_1'Y1:k-1)\n\n(10)\nSince the state-space of Ck is effectively infinite, this step is crucial for efficiency.\nEvaluation of the proposal distribution amounts to looking forward and selecting\na set of high probability candidate grid locations for quantization. Once cii\n) are\nobtained we can use standard Kalman filtering algorithms to update the Gaussian\npotentials p(zklcii) , Y1:k). Thus tempo tracking problem as stated in Eq. 5 is ~eadily\nsolved.\n\n2We linearize the nonlinear observation model 2Wk (en - Ck-l) around the expectation\n\n(Wk).\n\n\f3.2 Modified Viterbi algorithlll\n\nThe quantization problem in Eq. 6 can only be solved approximately. Since Z\nis integrated over, in general all Ck become coupled and the Markov property is\nlost, i.e. p(Cl:K IYl:K) is in general not a chain. One possible approximation, that\nwe adapt also here, is to assume smoothed estimates are not much different from\nfiltered estimates [8] i.e.\n\np(Ck' zklck-l, Zk-l, Yl:K) ~ p(Ck' zklck-l, Zk-1, Yl:k)\n\n(11)\n\nand to write\n\np(Cl:KIYl:K)\n\nK\n\nR:j f dZ1:KP(CIZIIYl)\n