{"title": "Beat Tracking the Graphical Model Way", "book": "Advances in Neural Information Processing Systems", "page_first": 745, "page_last": 752, "abstract": null, "full_text": " Beat Tracking the Graphical Model Way\n\n\n\n Dustin Lang Nando de Freitas\n Department of Computer Science\n University of British Columbia\n Vancouver, BC\n {dalang, nando}@cs.ubc.ca\n\n\n\n Abstract\n\n We present a graphical model for beat tracking in recorded music. Using\n a probabilistic graphical model allows us to incorporate local information\n and global smoothness constraints in a principled manner. We evaluate\n our model on a set of varied and difficult examples, and achieve impres-\n sive results. By using a fast dual-tree algorithm for graphical model in-\n ference, our system runs in less time than the duration of the music being\n processed.\n\n\n\n1 Introduction\n\nThis paper describes our approach to the beat tracking problem. Dixon describes beats as\nfollows: \"much music has as its rhythmic basis a series of pulses, spaced approximately\nequally in time, relative to which the timing of all musical events can be described. This\nphenomenon is called the beat, and the individual pulses are also called beats\"[1]. Given a\npiece of recorded music (an MP3 file, for example), we wish to produce a set of beats that\ncorrespond to the beats perceived by human listeners.\n\nThe set of beats of a song can be characterised by the trajectories through time of the tempo\nand phase offset. Tempo is typically measured in beats per minute (BPM), and describes\nthe frequency of beats. The phase offset determines the time offset of the beat. When\ntapping a foot in time to music, tempo is the rate of foot tapping and phase offset is the\ntime at which the tap occurs.\n\nThe beat tracking problem, in its general form, is quite difficult. Music is often ambiguous;\ndifferent human listeners can perceive the beat differently. There are often several beat\ntracks that could be considered correct. Human perception of the beat is influenced both by\n`local' and contextual information; the beat can continue through several seconds of silence\nin the middle of a song.\n\nWe see the beat tracking problem as not only an interesting problem in its own right, but\nas one aspect of the larger problem of machine analysis of music. Given beat tracks for\na number of songs, we could extract descriptions of the rhythm and use these features for\nclustering or searching in music collections. We could also use the rhythm information to\ndo structural analysis of songs - for example, to find repeating sections. In addition, we\nnote that beat tracking produces a description of the time scale of a song; knowledge of the\ntempo of a song would be one way to achieve time-invariance in a symbolic description.\nFinally, we note that beat tracking tells us where the important parts of a song are; the\n\n\f\nbeats (and major divisions of the beats) are good sampling points for other music-analysis\nproblems such as note detection.\n\n\n2 Related Work\n\nMany researchers have investigated the beat tracking problem; we present only a brief\noverview here. Scheirer [2] presents a system, based on psychoacoustical observations, in\nwhich a bank of resonators compete to explain the processed audio input. The system is\ntested on a difficult set of examples, and considerable success is reported. The most com-\nmon problem is a lack of global consistency in the results - the system switches between\nlocally optimal solutions.\n\nGoto [3] has described several systems for beat tracking. He takes a very pragmatic view\nof the problem, and introduces a number of assumptions that allow good results in a limited\ndomain - pop music in 4/4 time with roughly constant tempo, where bass or snare drums\nkeep the beat according to drum patterns known a priori, or where chord changes occur at\nparticular times within the measure.\n\nCemgil and Kappen [4] phrase the beat tracking problem in probabilistic terms, and we\nadapt their model as our local observation model. They use MIDI-like (event-based) input\nrather than audio, so the results are not easily comparable to our system.\n\n\n3 Graphical Model\n\nIn formulating our model for beat tracking, we assume that the tempo is nearly constant\nover short periods of time, and usually varies smoothly. We expect the phase to be con-\ntinuous. This allows us to use the simple graphical model shown in Figure 1. We break\nthe song into a set\n PSfrag of frames\n replacemen of\n ts two seconds; each frame is a node in the graphical model.\nWe expect the tempo to be constant within each frame, and the tempo and phase offset\nparameters to vary smoothly between frames.\n\n \n X1 X2 X3 XF\n\n \n\n\n Y1 Y2 Y3 YF\n\n\nFigure 1: Our graphical model for beat tracking. The hidden state X is composed of\nthe state variables tempo and phase offset. The observations Y are the features extracted\nby our audio signal processing. The potential function describes the compatibility of\nthe observations with the state, while the potential function describes the smoothness\nbetween neighbouring states.\n\n\nIn this undirected probabilistic graphical model, the potential function describes the com-\npatibility of the state variables X = {T, P } composed of tempo T and phase offset P with\nthe local observations Y . The potential function describes the smoothness constraints\nbetween frames. The observation Y comes from processing the audio signal, which is de-\nscribed in Section 5. The function comes from domain knowledge and is described in\nSection 4. This model allows us to trade off local fit and global smoothness in a princi-\npled manner. By using an undirected model, we allow contextual information to flow both\nforward and backward in time.\n\nIn such models, belief propagation (BP) [5] allows us to compute the marginal probabilities\nof the state variables in each frame. Alternatively, maximum belief propagation (max-BP)\n\n\f\nallows a joint maximum a posteriori (MAP) set of state variables to be determined. That\nis, given a song, we generate the observations Yi, i = 1 . . . F , (where F is the number of\nframes in the song) and seek a set of states Xi that maximize the joint product\n F F -1\n 1\n P (X, Y ) = (Yi, Xi) (Xi, Xi+1) .\n Z i=1 i=1\n\n\nOur smoothness function is the product of tempo and phase smoothness components T\nand . For the tempo component, we use a Gaussian on the log of tempo. For the phase\n P\noffset component, we want the phases to agree at a particular point in time: the boundary\nbetween the two frames (nodes), tb. We find the phase of tb predicted by the parameters\nin each frame, and place a Gaussian prior on the distance between points on the unit circle\nwith these phases:\n\n (X1, X2 | tb) = (T (T\n T 1, T2) P 1, P1, T2, P2 | tb)\n\n = N (log T1 - log T2, 2 ) N ((cos )\n T 1 - cos 2, sin 1 - sin 2), 2\n P\n\nwhere i = 2Titb - Pi and N (x, 2) is a zero-mean Gaussian with variance 2. We set\n = 0.1 and = 0.1. The qualitative results seem to be fairly stable as a function of\n T P\nthese smoothness parameters.\n\n\n4 Domain Knowledge\n\nIn this section, we describe the derivation of our local potential function (also known as the\nobservation model) (Yi, Xi).\n\nOur model is an adaptation of the work of [4], which was developed for use with MIDI\ninput. Their model is designed so that it \"prefers simpler [musical] notations\". The beat\nis divided into a fixed number of bins (some power of two), and each note is assigned to\nthe nearest bin. The probability of observing a note at a coarse subdivision of the beat is\ngreater than at a finer subdivision. More precisely, a note that is quantized to the bin at beat\nnumber k has probability p(k) exp(- d(k)), where d(k) is the number of digits in the\nbinary representation of the number k mod 1.\n\nSince we use recorded music rather than MIDI, we must perform signal processing to\nextract features from the raw data. This process produces a signal that has considerably\nmore uncertainty than the discrete events of MIDI data, so we adjust the model. We add\nthe constraint that features should be observed near some quantization point, which we\nexpress by centering a Gaussian around each of the quantization points. The variance of\nthis Gaussian, 2 is in units of beats, so we arrive at the periodic template function b(t),\n Q\nshown in Figure 2. We have set the number of bins to 8, to one, and = 0.025.\n Q\n\n\nThe template function b(t) expresses our belief about the distribution of musical events\nwithin the beat. By shifting and scaling b(t), we can describe the expected distribution of\nnotes in time for different tempos and phase offsets:\n\n P\n b(t | T, P ) = b T t - .\n 2\n\nOur signal processing (described below) yields a discrete set of events that are meant to\ncorrespond to musical events. Events occur at a particular time t and have a `strength' or\n`energy' E. Given a set of discrete events Y = {ti, Ei}, i = 1 . . . M , and state vari-\nables X = {T, P }, we take the probability that the events were drawn from the expected\ndistribution b(t | T, P ):\n\n M\n\n (Y , X) = ({t, E}, {T, P }) = b(ti | T, P )Ei .\n i=1\n\n\f\n PSfrag replacements\n\n\n\n\n y\n\n\n\n Probabilit\n\n Note 0 1/8 1/4 3/8 1/2 5/8 3/4 7/8 1\n Time (units of beats)\n\n\nFigure 2: One period of our template function b(t), which gives the expected distribution of\nnotes within a beat. Given tempo and phase offset values, we stretch and shift this function\nto get the expected distribution of notes in time.\n\n\n\nThis is a multinomial probability function in the continuous limit (as the bin size becomes\nzero). Note that is a positive, unnormalized potential function.\n\n\n5 Signal Processing\n\nOur signal processing stage is meant to extract features that approximate musical events\n(drum beats, piano notes, guitar strums, etc.) from the raw audio signal. As discussed\nabove, we produce a set of events composed of time and `strength' values, where the\nstrength describes our certainty that an event occurred. We assume that musical events\nare characterised by brief, rapid increases in energy in the audio signal. This is certainly\nthe case for percussive instruments such as drums and piano, and will often be the case for\nstring and woodwind instruments and for voices. This assumption breaks for sounds that\nfade in smoothly rather than `spikily'.\n\nWe begin by taking the short-time Fourier transform (STFT) of the signal: we slide a 50\nmillisecond Hann window over the signal in steps of 10 milliseconds, take the Fourier\ntransform, and extract the energy spectrum. Following a suggestion by [2], we pass the\nenergy spectrum through a bank of five filters that sum the energy in different portions\nof the spectrum. We take the logarithm of the summed energies to get a `loudness' signal.\nNext, we convolve each of the five resulting energy signals with a filter that detects positive-\ngoing edges. This can be considered a `loudness gain' signal. Finally, we find the maxima\nwithin 50 ms neighbourhoods. The result is a set of points that describe the energy gain\nsignal in each band, with emphasis on the maxima. These are the features Y that we use in\nour local probability model .\n\n\n6 Fast Inference\n\nTo find a maximum a posteriori (MAP) set of state variables that best explain a set of\nobservations, we need to optimize a 2F -dimensional, continuous, non-linear, non-Gaussian\nfunction that has many local extrema. F is the number of frames in the song, so is on the\norder of the length of the song in seconds - typically in the hundreds. This is clearly\ndifficult. We present two approximation strategies. In the first strategy, we convert the\ncontinuous state space into a uniform discrete grid and run discrete belief propagation. In\nthe second strategy, we run a particle filter in the forward direction, then use the particles\nas `grid' points and run discrete belief propagation as per [6].\n\nSince the landscape we are optimizing has many local maxima, we must use a fine dis-\ncretization grid (for the first strategy) or a large number of particles (for the second strat-\negy). The message-passing stage in discrete belief propagation takes O(N 2) if performed\nnaively, where N is the number of discretized states (or particles) per frame. We use a dual-\ntree recursion strategy as proposed in [7] and extended to maximum a posteriori inference\n\n\f\nin [8]. With this approach, the computation becomes feasible.\n\nAs an aside, we note that if we wish to compute the smoothed marginal probabilities rather\nthan the MAP set of parameters, then we can use standard discrete belief propagation or\nparticle smoothing. In both cases, the naive cost in O(N 2), but by using the Fast Gauss\nTransform[9] the cost becomes O(N ). This is possible because our smoothness potential\n is a low-dimensional Gaussian.\n\nFor the results presented here, we discretize the state space into N = 90 tempo values and\n T\nN = 50 phase offset values for the belief propagation version. We distribute the tempo\n P\nvalues uniformly on a log scale between 40 and 150 BPM, and distribute the phase offsets\nuniformly. For the particle filter version, we use N N = 4500 particles. With these\n T P\nvalues, our Matlab and C implementation runs at faster than real time (the duration of the\nsong) on a standard desktop computer.\n\n\n7 Results\n\nA standard corpus of labelled ground truth data for the beat-tracking problem does not exist.\nTherefore, we labelled a relatively small number of songs for evaluation of our algorithm,\nby listening to the songs and pressing a key at each perceived beat. We sought out examples\nthat we thought would be difficult, and we attempted to avoid the methods of [10]. Ideally,\nwe would have several human listeners label each song, since this would help to capture\nthe ambiguity inherent in the problem. However, this would be quite time-consuming.\n\nOne can imagine several methods for speeding up the process of generating ground truth\nlabellings and of cleaning up the noisy results generated by humans. For example, a human\nlabelling of a short segment of the song could be automatically extrapolated to the remain-\nder of the song, using energy spikes in the audio signal to fine-tune the placement of beats.\nHowever, by generating ground truth using assumptions similar to those embodied in the\nmodels we intend to test, we risk invalidating the results. We instead opted to use `raw'\nhuman-labelled songs.\n\nThere is no standard evaluation metric for beat tracking. We use the function presented\nby Cemgil et al [11] and used by Dixon [1] in his analysis:\n\n N\n 100 S (Si - Tj)2\n (S, T ) = max exp -\n (NS + NT )/2 jT 22\n i=1\n\nwhere S and T are the ground-truth and proposed beat times, and is set to 40 milliseconds.\nA value near 100 means that each predicted beat in close to a true beat, while a value near\nzero means that each predicted beat is far from a true beat.\n\nWe have focused on finding a globally-optimum beat track rather than precisely locating\neach beat. We could likely improve the values of our results by fine-tuning each predicted\nbeat, for example by finding nearby energy peaks, though we have not done this in the\nresults presented here.\n\nTable 1 shows a summary of our results. Note the wide range of genres and the choice of\nsongs with features that we thought would make beat tracking difficult. This includes all\nour results (not just the ones that look good).\n\nThe first columns list the name of the song and the reason we included it. The third column\nlists the qualitative performance of the fixed grid version: double means our algorithm\nproduced a beat track twice as fast as ground truth, half means we tracked at half speed,\nand sync means we produced a syncopated ( phase error) beat track. A blank entry means\nour algorithm produced the correct beat track. A star ( ) means that our result incorrectly\nswitches phase or tempo. The values are after compensating for the qualitative error (if\nany). The fifth column shows a histogram of the absolute phase error (0 to ); this is also\n\n\f\nErr\n\n\n Phase\n\n \n PF 86 77 71 50 59 59 77 42 69 61 68 78 72 41 82 80 79 79 74 79 71 79 67 89 88\n\n\n\n Perf. sync half sync half sync\n PF threehalf double double\n\n\n\n Err\n\n\n Phase\n\n \n 88 77 75 44 61 57 78 40 70 59 70 79 72 42 82 81 79 82 75 79 71 79 71 86 89\n BP explanation.\n for\n xtte\n Perf. sync half sync half\n BP threehalf double double the\n\n See\n\n\n end folk\n at quartet signature aluation.\n oice amelan amelan\n v g g ev\n piano string orchestra time electronica electronica r\n rubato ocal and ou\n instrumental instrumental v guitar guitar ae sitar\n wfoundland gg anic in\n g\n Comment Classical Piano; Modern Classical Jazz Jazz Jazz Solo Solo Guitar Acoustic Ne Cuban Changes Rock Rock Re Punk Pop-punk Or Ambient Electronica Solo Indonesian Indonesian\n used\n\n\n (edit) songs\n xcerpt)\n 1 Light... (e\n ichael (edit) o The\n Name\n ar'n M xcerpt) 1:\n V Presto g Diamonds (e Friend Than e\n / - No v\n ith e ub Bambo\n 82 ression Son v Old und impalsi able\n s W D / aster Mo\n F T\n 19 (edit) Michael Ha Bh\n Minor) Major Green y e\n Opp er' v es Aro Mo / xcerpt)\n Raat G In / Sk Go / (e\n (C / atusi Bagus,\n Thanx olv o f\n ar'ns Girl ork\n 2 What Chan / W T\n V Ki W Survi Of\n a y W The Streets e Come India\n g So Blue Michael Mind I ay Puri\n Aaj / / / In ill The Of /\n Fug / Jerse Luci Chan The W W\n / / / our / y I Days\n Smok\n an ou Y / Bali\n v Concertos Blue Blue Y Luc When Byomantara\n Goldber Session or Chemical / Come / Sounds /\n WTC Of Of Street or F / Club Where Simple\n Cara F / From\n / Piano / Out A Hundred\n Bach / Up Nugget / The Jaya\n / emptation / /\n Bach Kind Kind\n el T assion Fight ree A\n Dookie\n / v / / / P Huron / Social T And /\n /\n o / / Sea Music\n vis vis 1967-1970 Second-Hand TNT Stadt Sekar\n Quartet Ra ista / ashion / ... / / orld\n Gould V F / Day 2\n Jand Da Da Cole Chapman Big\n Ross Ross /\n y Harper Joshua / Shankar\n / e vi\n rac ortoise\n Song Glenn Jeno Kronos Maurice Miles Miles Holly Don Don T Ben Great Buena Beatles U2 Cak Sublime Rancid Green T Pole Underw Ra Pitamaha: Gamelan\n\n\f\n 105\n\n\n 100\n\n\n 95\n\n Tempo (BPM) 90\n Smoothed ground truth Smoothed ground truth Smoothed ground truth\n Predicted Raw ground truth Predicted\n 85 50 100 150 200 250\n Time (s)\n\n\nFigure 3: Tempo tracks for Cake / I Will Survive. Center: `raw' ground-truth tempo (instan-\ntaneous tempo estimate based on the time between adjacent beats) and smoothed ground\ntruth (by averaging). Left: fixed-grid version result. Right: particle filter result.\n\n\n\nafter correcting for qualitative error. The remaining columns contain the same items for the\nparticle filter version.\n\nOut of 25 examples, the fixed grid version produces the correct answer in 17 cases, tracks\nat double speed in two cases, half speed in two cases, syncopated in one case, and in three\ncases produces a track that (incorrectly) switches tempo or phase. The particle filter version\nproduces 16 correct answers, two double-speed, two half-speed, two syncopated, and the\nsame three `switching' tracks.\n\nAn example of a successful tempo track is shown in Figure 3.\n\nThe result for Lucy In The Sky With Diamonds (one of the `switching' results) is worth\nexamination. The song switches time signature between 3/4 and 4/4 time a total of five\ntimes; see Figure 4. Our results follow the time signature change the first three times.\nOn the fourth change (from 4/4 to 3/4), it tracks at 2/3 the ground truth rate instead. We\nnote an interesting effect when we examine the final message that is passed during belief\npropagation. This message tells us the maximum probability of a sequence that ends with\neach state. The global maximum corresponds to the beat track shown in the left plot. The\nlocal maximum near 50 BPM corresponds to an alternate solution in which, rather than\ntracking the quarter notes, we produce one beat per measure; this track is quite plausible.\nIndeed, the `true' track is difficult for human listeners. Note also that there is also a local\nmaximum near 100 BPM but phase-shifted a half beat. This is the solution in which the\nbeats are syncopated from the true result.\n\n\n8 Conclusions and Further Work\n\n\nWe present a graphical model for beat tracking and evaluate it on a set of varied and diffi-\ncult examples. We achieve good results that are comparable with those reported by other\nresearchers, although direct comparisons are impossible without a shared data set.\n\nThere are several advantages to formulating the problem in a probabilistic setting. The beat\ntracking problem has inherent ambiguity and multiple interpretations are often plausible.\nWith a probabilistic model, we can produce several candidate solutions with different prob-\nabilities. This is particularly useful for situations in which beat tracking is one element in\na larger machine listening application. Probabilistic graphical models allow flexible and\npowerful handling of uncertainty, and allow local and contextual information to interact in\na principled manner. Additional domain knowledge and constraints can be added in a clean\nand principled way. The adoption of an efficient dual tree recursion for graphical models\n\n\f\n 100 140\n\n 95\n\n 90 100\n 85\n\n 80 75\n 75\n Tempo (BPM) 70 Tempo (BPM)\n 1/2 Ground Truth\n 65 50\n 2/3 Ground Truth\n 60 Predicted\n 40\n 20 40 60 80 100 120 140 160 180 200 0 0.2 0.4 0.6 0.8\n Time (s) Phase offset\n\n\nFigure 4: Left: Tempo tracks for Lucy In The Sky With Diamonds. The vertical lines\nmark times at which the time signature changes between 3/4 and 4/4. Right: the last max-\nmessage computed during belief propagation. Bright means high probability. The global\nmaximum corresponds to the tempo track shown. Note the local maximum around 50\nBPM, which corresponds to an alternate feasible result. See the text for discussion.\n\n\n[7, 8] enables us to carry out inference in real time.\n\nWe would like to investigate several modifications of our model and inference methods.\nLonger-range tempo smoothness constraints as suggested by [11] could be useful. The\nextraction of MAP sets of parameters for several qualitatively different solutions would\nhelp to express the ambiguity of the problem. The particle filter could also be changed.\nAt present, we first perform a full particle filtering sweep and then run max-BP. Taking\ninto account the quality of the partial MAP solutions during particle filtering might allow\nsuperior results by directing more particles toward regions of the state space that are likely\nto contain the final MAP solution. Since we know that our probability terrain is multi-\nmodal, a mixture particle filter would be useful [12].\n\n\nReferences\n\n [1] S Dixon. An empirical comparison of tempo trackers. Technical Report TR-2001-21, Austrian\n Research Institute for Artificial Intelligence, Vienna, Austria, 2001.\n\n [2] E D Scheirer. Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Am.,\n 103(1):588601, Jan 1998.\n\n [3] M Goto. An audio-based real-time beat tracking system for music with or without drum-sounds.\n Journal of New Music Research, 30(2):159171, 2001.\n\n [4] A T Cemgil and H J Kappen. Monte Carlo methods for tempo tracking and rhythm quantization.\n Journal of Artificial Intelligence Research, 18(1):4581, 2003.\n\n [5] J Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan-\n Kaufmann, 1988.\n\n [6] S J Godsill, A Doucet, and M West. Maximum a posteriori sequence estimation using Monte\n Carlo particle filters. Ann. Inst. Stat. Math., 53(1):8296, March 2001.\n\n [7] A G Gray and A W Moore. `N-Body' problems in statistical learning. In Advances in Neural\n Information Processing Systems 4, pages 521527, 2000.\n\n [8] M Klaas, D Lang, and N de Freitas. Fast maximum a posteriori inference in monte carlo state\n space. In AI-STATS, 2005.\n\n [9] L Greengard and J Strain. The fast Gauss transform. SIAM Journal of Scientific Statistical\n Computing, 12(1):7994, 1991.\n\n[10] D LaLoudouana and M B Tarare. Data set selection. Presented at NIPS Workshop, 2002.\n\n[11] A T Cemgil, B Kappen, P Desain, and H Honing. On tempo tracking: Tempogram representa-\n tion and Kalman filtering. Journal of New Music Research, 28(4):259273, 2001.\n\n[12] J Vermaak, A Doucet, and Patrick Perez. Maintaining multi-modality through mixture tracking.\n In ICCV, 2003.\n\n\f\n", "award": [], "sourceid": 2745, "authors": [{"given_name": "Dustin", "family_name": "Lang", "institution": null}, {"given_name": "Nando", "family_name": "Freitas", "institution": null}]}