{"title": "A Bayesian Network for Real-Time Musical Accompaniment", "book": "Advances in Neural Information Processing Systems", "page_first": 1433, "page_last": 1439, "abstract": "", "full_text": "A Bayesian Network for Real-Time \n\nMusical Accompaniment \n\nChristopher Raphael \n\nDepartment of Mathematics and Statistics, \n\nUniversity of Massachusetts at Amherst, \n\nAmherst, MA 01003-4515, \nraphael~math.umass.edu \n\nAbstract \n\nWe describe a computer system that provides a real-time musi(cid:173)\ncal accompaniment for a live soloist in a piece of non-improvised \nmusic for soloist and accompaniment. A Bayesian network is devel(cid:173)\noped that represents the joint distribution on the times at which \nthe solo and accompaniment notes are played, relating the two \nparts through a layer of hidden variables. The network is first con(cid:173)\nstructed using the rhythmic information contained in the musical \nscore. The network is then trained to capture the musical interpre(cid:173)\ntations of the soloist and accompanist in an off-line rehearsal phase. \nDuring live accompaniment the learned distribution of the network \nis combined with a real-time analysis of the soloist's acoustic sig(cid:173)\nnal, performed with a hidden Markov model, to generate a musi(cid:173)\ncally principled accompaniment that respects all available sources \nof knowledge. A live demonstration will be provided. \n\n1 \n\nIntroduction \n\nWe discuss our continuing work in developing a computer system that plays the \nrole of a musical accompanist in a piece of non-improvisatory music for soloist \nand accompaniment. The system begins with the musical score to a given piece \nof music. Then, using training for the accompaniment part as well as a series of \nrehearsals, we learn a performer-specific model for the rhythmic interpretation of the \ncomposition. In performance, the system takes the acoustic signal of the live player \nand generates the accompaniment around this signal, in real-time, while respecting \nthe learned model and the constraints imposed by the score. The accompaniment \nplayed by our system responds both flexibly and expressively to the soloist's musical \ninterpretation. \n\nOur system is composed of two high level tasks we call \"Listen\" and \"Play.\" Listen \ntakes as input the acoustic signal of the soloist and, using a hidden Markov model, \nperforms a real-time analysis of the signal. The output of Listen is essentially \na running commentary on the acoustic input which identifies note boundaries in \nthe solo part and communicates these events with variable latency. The HMM \nframework is well-suited to the listening task and has several attributes we regard \n\n\fas indispensable to any workable solution: \n\n1. The HMM allows unsupervised training using the Baum-Welch algorithm. \nThus we can automatically adapt to changes in solo instrument, microphone \nplacement, ambient noise, room acoustics, and the sound of the accompa(cid:173)\nniment instrument. \n\n2. Musical accompaniment is inherently a real-time problem. Fast dynamic \nprogramming algorithms provide the computational efficiency necessary to \nprocess the soloist's acoustic signal at a rate consistent with the real-time \ndemands of our application. \n\n3. Musical signals are occasionally ambiguous locally in time, but become \neasier to parse when more context is considered. Our system owes much of \nits accuracy to the probabilistic formulation of the HMM. This formulation \nallows one to compute the probability that an event is in the past. We delay \nthe estimation of the precise location of an event until we are reasonably \nconfident that it is, in fact, past. In this way our system achieves accuracy \nwhile retaining the lowest latency possible in the identification of musical \nevents. \n\nOur work on the Listen component is documented thoroughly in [1] and we omit a \nmore detailed discussion here. \n\nThe heart of our system, the Play component, develops a Bayesian network consist(cid:173)\ning of hundreds of Gaussian random variables including both observable quantities, \nsuch as note onset times, and unobservable quantities, such as local tempo. The \nnetwork can be trained during a rehearsal phase to model both the soloist's and \naccompanist's interpretations of a specific piece of music. This model then forms \nthe backbone of a principled real-time decision-making engine used in performance. \nWe focus here on the Play component which is the most challenging part of our \nsystem. A more detailed treatment of various aspects of this work is given in [2- 4]. \n\n2 Knowledge Sources \n\nA musical accompaniment requires the synthesis of a number of different knowledge \nsources. From a modeling perspective, the fundamental challenge of musical ac(cid:173)\ncompaniment is to express these disparate knowledge sources in terms of a common \ndenominator. We describe here the three knowledge sources we use. \n\n1. We work with non-improvisatory music so naturally the musical score, \nwhich gives the pitches and relative durations of the various notes, as well \nas points of synchronization between the soloist and accompaniment, must \nfigure prominently in our model. The score should not be viewed as a rigid \ngrid prescribing the precise times at which musical events will occur; rather, \nthe score gives the basic elastic material which will be stretched in various \nways to to produce the actual performance. The score simply does not \naddress most interpretive aspects of performance. \n\n2. Since our accompanist must follow the soloist, the output of the Listen \ncomponent, which identifies note boundaries in the solo part, constitutes \nour second knowledge source. While most musical events, such as changes \nbetween neighboring diatonic pitches, can be detected very shortly after \nthe change of note, some events, such as rearticulations and octave slurs, \nare much less obvious and can only be precisely located with the benefit \nof longer term hindsight. With this in mind, we feel that any successful \n\n\faccompaniment system cannot synchronize in a purely responsive manner. \nRather it must be able to predict the future using the past and base its \nsynchronization on these predictions, as human musicians do. \n\n3. While the same player's performance of a particular piece will vary from \nrendition to rendition, many aspects of musical interpretation are clearly \nestablished with only a few repeated examples. These examples, both of \nsolo performances and human (MIDI) performances of the accompaniment \npart constitute the third knowledge source for our system. The solo data \nis used primarily to teach the system how to predict the future evolution \nof the solo part. The accompaniment data is used to learn the musicality \nnecessary to bring the accompaniment to life. \n\nWe have developed a probabilistic model, a Bayesian network, that represents all \nof these knowledge sources through a jointly Gaussian distribution containing hun(cid:173)\ndreds of random variables. The observable variables in this model are the estimated \nsoloist note onset times produced by Listen and the directly observable times for \nthe accompaniment notes. Between these observable variables lies a layer of hid(cid:173)\nden variables that describe unobservable quantities such as local tempo, change in \ntempo, and rhythmic stress. \n\n3 A Model for Rhythmic Interpretation \n\nWe begin by describing a model for the sequence of note onset times generated by a \nmonophonic (single voice) musical instrument playing a known piece of music. For \neach of the notes, indexed by n = 0, . . . ,N, we define a random vector representing \nthe time, tn, (in seconds) at which the note begins, and the local \"tempo,\" Sn, (in \nsecs. per measure) for the note. We model this sequence ofrandom vectors through \na random difference equation: \n\n(1) \n\nn = 0, ... , N - 1, where in is the musical length of the nth note, in measures, and \nthe {(Tn' CTnY} and (to, so)t are mutually independent Gaussian random vectors. \n\nThe distributions of the {CTn} will tend concentrate around \u00b0 expressing the notion \n\nthat tempo changes are gradual. The means and variances of the {CT n} show where \nthe soloist is speeding-up (negative mean), slowing-down (positive mean), and tell \nus if these tempo changes are nearly deterministic (low variance), or quite variable \n(high variance). The { Tn} variables describe stretches (positive mean) or compres(cid:173)\nsions (negative mean) in the music that occur without any actual change in tempo, \nas in a tenuto or agogic accent. The addition of the {Tn} variables leads to a more \nmusically plausible model, since not all variation in note lengths can be explained \nthrough tempo variation. Equally important, however, the {Tn} variables stabilize \nthe model by not forcing the model to explain, and hence respond to, all note length \nvariation as tempo variation. \nCollectively, the distributions of the (Tn' CTn)t vectors characterize the solo player's \nrhythmic interpretation. Both overall tendencies (means) and the repeatability of \nthese tendencies (covariances) are captured by these distributions. \n\n3.1 Joint Model of Solo and Accompaniment \n\nIn modeling the situation of musical accompaniment we begin with the our basic \nrhythm model of Eqn. 1, now applied to the composite rhythm. More precisely, \n\n\fListen \n\nUpdate \n\nComposite \n\nAccomp \n\nFigure 1: A graphical description of the dependency structure of our model. The \ntop layer of the graph corresponds to the solo note onset times detected by Listen. \nThe 2nd layer of the graph describes the ( Tn, 0\" n) variables that characterize the \nrhythmic interpretation. The 3rd layer of the graph is the time-tempo process \n{(Sn, tn)}. The bottom layer is the observed accompaniment event times. \n\nlet mo, ... , mivs and mg, ... , m'Na denote the positions, in measures, of the various \nsolo and accompaniment events. For example, a sequence of quarter notes in 3/ 4 \ntime would lie at measure positions 0, 1/ 3, 2/ 3, etc. We then let mo, ... , mN be \nthe sorted union of these two sets of positions with duplicate times removed; thus \nmo < ml < .. . < mN\u00b7 We then use the model of Eqn. 1 with In = mn+1 - m n, \nn = 0, . . . , N - 1. A graphical description of this model is given in the middle \ntwo layers of Figure 1. In this figure, the layer labeled \"Composite\" corresponds \nto the time-tempo variables, (tn, sn)t, for the composite rhythm, while the layer \nlabeled \"Update\" corresponds to the interpretation variables ( Tn, 0\" n) t. The directed \narrows of this graph indicate the conditional dependency structure of our model. \nThus, given all variables \"upstream\" of a variable, x, in the graph, the conditional \ndistribution of x depends only on the parent variables. \n\nRecall that the Listen component estimates the times at which solo notes begin. \nHow do these estimates figure into our model? We model the note onset times \nestimated by Listen as noisy observations of the true positions {tn }. Thus if m n \nis a measure position at which a solo note occurs, then the corresponding estimate \nfrom Listen is modeled as \n\nan = tn + an \n\nwhere an rv N(O, 1I2). Similarly, if m n is the measure position of an accompaniment \nevent, then we model the observed time at which the event occurs as \n\nbn = tn + f3n \n\nwhere f3n rv N(O, \",2). These two collections of observable variables constitute the \ntop layer of our figure, labeled \"Listen,\" and the bottom layer, labeled \"Accomp.\" \nThere are, of course, measure positions at which both solo and accompaniment \nevents should occur. If n indexes such a time then an and bn will both be noisy \nobservations of the true time tn. The vectors/ variables {(to, so)t, (Tn ' O\"n)t, a n , f3n} \nare assumed to be mutually independent. \n\n4 Training the Model \n\nOur system learns its rhythmic interpretation by estimating the parameters of the \n(Tn,O\"n) variables. We begin with a collection of J performances of the accompa(cid:173)\nniment part played in isolation. We refer to the model learned from this accom(cid:173)\npaniment data as the \"practice room\" distribution since it reflects the way the \naccompanist plays when the constraint of following the soloist is absent. For each \n\n\fListen \n\nUpdate \n\nComposite \n\nAccomp \n\nFigure 2: Conditioning on the observed accompaniment performance (darkened cir(cid:173)\ncles), we use the message passing algorithm to compute the conditional distributions \non the unobservable {Tn' O\"n} variables. \n\nsuch performance, we treat the sequence of times at which accompaniment events \noccur as observed variables in our model. These variables are shown with darkened \ncircles in Figure 2. Given an initial assignment of of means and covariances to the \n(Tn , O\"n) variables, we use the \"message passing\" algorithm of Bayesian Networks \n[8,9] to compute the conditional distributions (given the observed performance) of \nthe (Tn,O\"n) variables. Several such performances lead to several such estimates, \nenabling us to improve our initial estimates by reestimating the (Tn ' O\"n) parameters \nfrom these conditional distributions. \nMore specifically, we estimate the (Tn,O\"n) parameters using the EM algorithm, as \nfollows, as in [7]. We let J-L~, ~~ be our initial mean and covariance matrix for the \nvector ( Tn, 0\" n). The conditional distribution of ( Tn, 0\" n) given the jth accompani(cid:173)\nment performance, and using {J-L~ , ~~} , has a N(m;,n, S~ ) distribution where the \nm;,n and S~ parameters are computed using the message passing algorithm. We \nthen update our parameter estimates by \n\n1 J \n} Lmj,n \n\n. \n\nj = l \n\n~ i+ l \n\nn \n\nThe conventional wisdom of musicians is that the accompaniment should follow the \nsoloist. In past versions of our system we have explicitly modeled the asymmetric \nroles of soloist and accompaniment through a rather complicated graph structure \n[2- 4] . At present we deal with this asymmetry in a more ad hoc, however, perhaps \nmore effective, manner, as follows. \n\nTraining using the accompaniment performances allows our model to learn some of \nthe musicality these performances demonstrate. Since the soloist's interpretation \nmust take precedence, we want to use this accompaniment interpretation only to \nthe extent that it does not conflict with that of the soloist. We accomplish this \nby first beginning with the result of the accompaniment training described above. \nWe use the practice room distributions, (the distributions on the {(Tn, O\"n)} learned \nfrom the accompaniment data) , as the initial distributions, {J-L~ , ~~} . We then run \nthe EM algorithm as described above now treating the currently available collection \nof solo performances as the observed data. During this phase, only those parame(cid:173)\nters relevant to the soloist's rhythmic interpretation will be modified significantly. \nParameters describing the interpretation of a musical segment in which the soloist \nis mostly absent will be largely unaffected by the second training pass. \n\n\fListen \n\nUpdate \n\nComposite \n\nAccomp \n\nFigure 3: At any given point in the performance we will have observed a collection \nof solo note times estimated estimated by Listen, and the accompaniment event \ntimes (the darkened circles). We compute the conditional distribution on the next \nunplayed accompaniment event, given these observations. \n\nThis solo training actually happens over the course of a series of rehearsals. We \nfirst initialize our model to the practice room distribution by training with the \naccompaniment data. Then we iterate the process of creating a performance with \nour system, (described in the next section), extracting the sequence of solo note \nonset times in an off-line estimation process, and then retraining the model using all \ncurrently available solo performances. In our experience, only a few such rehearsals \nare necessary to train a system that responds gracefully and anticipates the soloist's \nrhythmic nuance where appropriate -\n\ngenerally less than 10. \n\n5 Real Time Accompaniment \n\nThe methodological key to our real-time accompaniment algorithm is the computa(cid:173)\ntion of (conditional) marginal distributions facilitated by the message-passing ma(cid:173)\nchinery of Bayesian networks. At any point during the performance some collection \nof solo notes and accompaniment notes will have been observed, as in Fig. 3. Con(cid:173)\nditioned on this information we can compute the distribution on the next unplayed \naccompaniment. The real-time computational requirement is limited by passing \nonly the messages necessary to compute the marginal distribution on the pending \naccompaniment note. \n\nOnce the conditional marginal distribution of the pending accompaniment note is \ncalculated we schedule the note accordingly. Currently we schedule the note to be \nplayed at the conditional mean time, given all observed information, however other \nreasonable choices are possible. Note that this conditional distribution depends on \nall of the sources of information included in our model: The score information, all \ncurrently observed solo and accompaniment note times, and the rhythmic inter(cid:173)\npretations demonstrated by both the soloist and accompanist captured during the \ntraining phase. \n\nThe initial scheduling of each accompaniment note takes place immediately after \nthe previous accompaniment note is played. It is possible that a solo note will be \ndetected before the pending accompaniment is played; in this event the pending \naccompaniment event is rescheduled by recomputing the its conditional distribu(cid:173)\ntion using the newly available information. The pending accompaniment note is \nrescheduled each time an additional solo note is detected until its currently sched(cid:173)\nuled time arrives, at which time it is finally played. In this way our accompaniment \nmakes use of all currently available information. \n\nDoes our system pass the musical equivalent of the Turing Test? We presume \nno more objectivity in answering this question than we would have in judging \n\n\fthe merits of our other children. However, we believe that the level of mu(cid:173)\nsicality attained by our system is truly surprising, while the reliability is suf(cid:173)\nficient for live demonstration. We hope that the interested reader will form \nan independent opinion, even if different from ours, and to this end we have \nmade musical examples demonstrating our progress available on the web page: \nhttp://fafner.math.umass.edu/musicplus_one. \n\nAcknowledgments \n\nThis work supported by NSF grants IIS-998789 and IIS-0113496. \n\nReferences \n\n[1] Raphael C. (1999), \"Automatic Segmentation of Acoustic Musical Signals Using Hidden \nMarkov Models,\" IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. \n21, No.4, pp. 360-370. \n\n[2] Raphael C. (2001), \"A Probabilistic Expert System for Automatic Musical Accompa(cid:173)\nniment,\" Journal of Computational and Graphical Statistics, vol. 10 no. 3, 487-512. \n\n[3] Raphael C. (2001), \"Can the Computer Learn to Play Expressively?\" Proceedings of \nEighth International Workshop on Artificial Intelligence and Statistics, 113-120, Morgan \nKauffman. \n\n[4] Raphael C. (2001), \"Synthesizing Musical Accompaniments with Bayesian Belief Net(cid:173)\nworks,\" Journal of New Music Research, vol. 30, no. 1, 59-67. \n\n[5] Spiegelhalter D., Dawid A. P., Lauritzen S., Cowell R. (1993), \"Bayesian Analysis in \nExpert Systems,\" Statistical Science, Vol. 8, No.3, pp. 219-283. \n\n[6] Cowell R., Dawid A. P., Lauritzen S., Spiegelhalter D. (1999), \"Probabilistic Networks \nand Expert Systems,\" Springer, New York. \n\n[7] Lauritzen S. L. (1995), \"The EM Algorithm for Graphical Association Models with \nMissing Data,\" Computational Statistics and Data Analysis, Vol. 19, pp. 191-20l. \n\n[8] Lauritzen S. L. (1992), \"Propagation of Probabilities, Means, and Variances in Mixed \nGraphical Association Models,\" Journal of the American Statistical Association, Vol. 87, \nNo. 420, (Theory and Methods), pp. 1098-1108. \n\n[9] Lauritzen S. L. and F. Jensen (1999), \"Stable Local Computation with Conditional \nGaussian Distributions,\" Technical Report R-99-2014, Department of Mathematic Sci(cid:173)\nences, Aalborg University. \n\n\f", "award": [], "sourceid": 2035, "authors": [{"given_name": "Christopher", "family_name": "Raphael", "institution": null}]}