{"title": "\"Name That Song!\" A Probabilistic Approach to Querying on Music and Text", "book": "Advances in Neural Information Processing Systems", "page_first": 1529, "page_last": 1536, "abstract": "", "full_text": "\u201cName That Song!\u201d: A Probabilistic Approach\n\nto Querying on Music and Text\n\nEric Brochu\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nVancouver, BC, Canada\nebrochu@cs.ubc.ca\n\nNando de Freitas\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nVancouver, BC, Canada\nnando@cs.ubc.ca\n\nAbstract\n\nWe present a novel, \ufb02exible statistical approach for modelling music and\ntext jointly. The approach is based on multi-modal mixture models and\nmaximum a posteriori estimation using EM. The learned models can be\nused to browse databases with documents containing music and text, to\nsearch for music using queries consisting of music and text (lyrics and\nother contextual information), to annotate text documents with music,\nand to automatically recommend or identify similar songs.\n\n1 Introduction\n\nVariations on \u201cname that song\u201d-types of games are popular on radio programs. DJs play a\nshort excerpt from a song and listeners phone in to guess the name of the song. Of course,\ncallers often get it right when DJs provide extra contextual clues (such as lyrics, or a piece\nof trivia about the song or band). We are attempting to reproduce this ability in the context\nof information retrieval (IR). In this paper, we present a method for querying with words\nand/or music.\n\nWe focus on monophonic and polyphonic musical pieces of known structure (MIDI \ufb01les,\nfull music notation, etc.). Retrieving these pieces in multimedia databases, such as the\nWeb, is a problem of growing interest [1, 2]. A signi\ufb01cant step was taken by Downie [3],\nwho applied standard text IR techniques to retrieve music by, initially, converting music to\ntext format. Most research (including [3]) has, however, focused on plain music retrieval.\nTo the best of our knowledge, there has been no attempt to model text and music jointly.\n\nWe propose a joint probabilistic model for documents with music and/or text. This model\nis simple, easily extensible, \ufb02exible and powerful. It allows users to query multimedia\ndatabases using text and/or music as input. It is well-suited for browsing applications as\nit organizes the documents into \u201csoft\u201d clusters. The document of highest probability in\neach cluster serves as a music thumbnail for automated music summarisation. The model\nallows one to query with an entire text document to automatically annotate the document\nwith musical pieces. It can be used to automatically recommend or identify similar songs.\nFinally, it allows for the inclusion of different types of text, including website content,\nlyrics, and meta-data such as hyper-text links. The interested reader may further wish to\nconsult [4], in which we discuss an application of our model to the problem of jointly\n\n\fmodelling music, as well as text and images.\n\n2 Model speci\ufb01cation\n\nThe training data consists of documents with text (lyrics or information about the song) and\nmusical scores in GUIDO notation [5]. (GUIDO is a powerful language for representing\nmusical scores in an HTML-like notation. MIDI \ufb01les, plentiful on the World Wide Web,\ncan be easily converted to this format.) We model the data with a Bayesian multi-modal\nmixture model. Words and scores are assumed to be conditionally independent given the\nmixture component label.\n\nWe model musical scores with \ufb01rst-order Markov chains, in which each state corresponds\nto a note, rest, or the start of a new voice. Notes\u2019 pitches are represented by the interval\nchange (in semitones) from the previous note, rather than by absolute pitch, so that a score\nor query transposed to a different key will still have the same Markov chain. Rhythm is\nsimilarly represented as a scalar to the previous value. Rest states are represented similarly,\nsave that pitch is not represented. See Figure 1 for an example.\n\nPolyphonic scores are represented by chaining the beginning of a new voice to the end of\na previous one. In order to ensure that the \ufb01rst note in each voice appears in both the row\nand column of the Markov transition matrix, a special \u201cnew voice\u201d state with no interval or\nrhythm serves as a dummy state marking the beginning of a new voice. The \ufb01rst note of a\nvoice has a distinguishing \u201c\ufb01rst note\u201d interval value and the \ufb01rst note or rest has a duration\nvalue of one.\n\n[ *3/4 b&1*3/16 b1/16 c#2*11/16 b&1/16 a&1*3/16 b&1/16 f#1/2 ]\n\n0\n1\n2\n3\n4\n5\n6\n7\n8\n\nINTERVAL DURATION\n0\nnewvoice\nrest\nfirstnote\n+1\n+2\n-2\n-2\n+3\n-5\n\n\u0001\u0003\u0002\u0003\u0004\n\u0001\u0003\u0002\u0006\u0005\n\u0001\u0007\u0001\n\u0001\b\u0002\t\u0001\u0006\u0001\n\u0001\u0003\u0002\u0006\u0005\n\nFigure 1: Sample melody \u2013 the opening notes to \u201cThe Yellow Submarine\u201d by The Beatles\n\u2013 in different notations. From top: GUIDO notation, standard musical notation (generated\nautomatically from GUIDO notation), and as a series of states in a \ufb01rst-order Markov\nchain (also generated automatically from GUIDO notation).\n\n, where\n\u0013\u0014\u0011\n\f\u0010\u000f\u0012\u0011\nin document\n\nThe Markov chain representation of a piece of music\nis then mapped to a sparse transition\nfrequency table\ndenotes the number of times we observe the transition\nfrom state\nto denote the initial state of the Markov\n. We use\nchain. The associated text is modeled using a standard sparse term frequency vector\n,\n\u0018\u0019\r\nwhere\n. For notational\n. In essence,\nsimplicity, we group the music and text variable as follows:\nthis Markovian approach is akin to a text bigram model, save that the states are transitions\nbetween musical notes and rests rather than words.\n\ndenotes the number of times word\n\nappears in document\n\n \u001f\u0010!\n\n#\"\n\n#$\n\n\f\u000e\r\nto state\n\n\u0018\u001b\u001a\u001c\u0011\n\n\u0007\u0011\n\n\n\u0001\n\u0005\n\n\u000b\n\n\u0015\n\u0016\n\u000b\n\f\n\u0017\n\n\u001d\n\u000b\n\u001e\n\f\n\u0018\n\fOur multi-modal mixture model is as follows:\n\n\u0007\u0011\n\n\u00124\u0014\n\nto state\n\n\u000f'\u0011\n\n\u0012\u0015\u0014\u001b0\n\n\u0006\b\u0018\n\u000b\u000e\n\nbelongs to state\n\nif the \ufb01rst entry of\n\nencompasses all the model parameters and\n\nis de\ufb01ned on the standard probability simplex\n\n\u000f\u0004\u0003\n\u0006.-\n\u0006\b\u0018\n\u0006\b\u0007\n\u0006\u0010\u0018\n\u000b\u000e\r/\u000f'\u0011\n\u0012\u0015\u0014)(+*,\"\n$&%\n\u0012\u0015\u0014\u001b\u001a\f\u001c\f\u001d\u001f\u001e! #\"\n\u000f\u000e\u0011\n\n\f\u000b\u000e\r\u0010\u000f\u000e\u0011\u0013\u0012\u0015\u0014\n\u000b'\r\n\u000b\u000e\r\nwhere\u0002\n\u000f\u000e\u0011\n\u00124\u0014\n\u000f'\u0011\n\u000f'\u00113\u0012\u0015\u0014\n\u000f'\u0011\n\u0012\u0015\u0014\nwhere5\nand is: otherwise. The three-\n\u00147698\n\u0012\u0015\u0014 denotes the estimated probability of transitioning from state\ndimensional matrix\u000f'\u0011\n\u0012\u0015\u0014 denotes the initial probabilities of being in state\nin cluster\u0012 , the matrix\u000f\u000e\u0011\n, given membership in cluster\u0012 . The vector\u000f\u000e\u0011\u0013\u0012\u0015\u0014 denotes the probability of each cluster.\nin cluster\u0012 . The mixture model\n\u0012\u0015\u0014 denotes the probability of the word\nThe matrix\u000f'\u0011\n\n\f\u000b\u000e\r\nfor all\u0012 and<\n\u0006\b\u0007\n\u000f\u000e\u00113\u00124\u0014=6>8\n\u000f\u000e\u0011\u0013\u0012\u0015\u0014!;\n\rA@\nWe introduce the latent allocation variables?\n\"4B\u0015B\u0015B\b\"&C\nsequenceD\n\"\u0015B\u0015B4B\u0003\")CIH\nbelongs to a speci\ufb01c cluster\u0012 . These indicator variables\n6G8\ncorrespond to an i.i.d. sample from the distribution\u000f\u000e\u0011\n6J\u0012\u0015\u0014K6L\u000f\u000e\u0011\u0013\u0012\u0015\u0014 .\nchical structure with levelsM :\n\u000fN\u0003\n\u0006\u0010O\n\u0014,\u000f'\u0011\n\u00124\u0014,\u000f\u000e\u0011\n\u000b\u000e\n\nThis is still a multinomial model, but by applying appropriate parameter constraints we can\nproduce a tree-like browsing structure [6]. It is also easy to formulate the model in terms\nof aspects and clusters as suggested in [7, 8].\n\nThis simple model is easy to extend. For browsing applications, we might prefer a hierar-\n\n.\nto indicate that a particular\n\n\u0006\b\u0007\n\n\f\u000b\u000e\r\b\u000f\u000e\u00113\u00124\u0014\n\n\u000f\u000e\u0011\n\nFE\n\n\u0018\u001b\n\n(1)\n\n(2)\n\n2.1 Prior speci\ufb01cation\n\nThe posterior for the allocation variables will be required. It can be obtained easily using\nBayes\u2019 rule:\n\nacknowledge our uncertainty about the exact form of the prior by specifying it in terms\nare assumed\n\nWe follow a hierarchical Bayesian strategy, where the unknown parameters\u0002 and the al-\nlocation variablesQ are regarded as being drawn from appropriate prior distributions. We\nof some unknown parameters (hyperparameters). The allocation variables?\n\u0005SR\nto be drawn from a multinomial distribution,?\n\u0005UT\n\u000f\u000e\u0011\u0013\u0012\u0015\u0014)\u0014 . We place a conjugate\n\u0011\u001b8\n\u0006\b\u0007\n\u0012\u0015\u0014 ,T\nlet prior distributionsT\n\u0012\u0015\u0014 ,T\n\u00113VW\u0014 . Similarly, we place Dirich-\nDirichlet prior on the mixing coef\ufb01cients\u000f\u000e\u0011\u0013\u0012\u0015\u0014\n\u0006\b\u0007\n\u00113Z[\u0014 on each\u000f\u000e\u0011\n\u0011,XY\u0014 on each\u000f\u000e\u0011\n\u00113\\]\u0014 on each\n\u0012\u0015\u0014 , and assume that these priors are independent.\n\u000f\u000e\u0011\n\u0014,\u000f\u000e\u0011\u0013\u0012\n\u000f\u000e\u0011\n6^\u0012\n\u0014K6\n\u000f\u000e\u00113\u0012\n\u000f\u000e\u0011\n\u000f\u000e\u0011\n\u000b'\r\n\u000b\u000e\r\n\u000b\u000e\r\n\u000b'\r\n\u0006\b\u0018\n\u0006\b\u0018\n\u0006\u0010\u0018\n\u0012\u0015\u0014\n\u000f'\u0011\n\u000f'\u00113\u0012\u0015\u0014K_a`\n\u0012\u0015\u0014\n\u001ab\u001c\f\u001dN\u001e! #\"\n$)%\n\u000b\u000e\n\n\feN\u000b\u000e\r\n\u000b\u000e\r\n\u000f'\u00113\u0012df\u0004\u0014[_a`\n\u0012\u0015fN\u0014\n\u000f'\u0011\n$)%\n\u001dN\u001e! #\"\n\n3 Computation\n\n(+*,\"\n\u0012dfN\u0014\n\n\u000f\u000e\u0011\n\u000b\u000e\n\n\u0007dc\n\u0012df,\u0014\n\n(g*h\"\n\n\u000f\u000e\u0011\n\u000b\u000e\n\n\u000f'\u0011\n\n\u000f'\u0011\n\n\u0006\b-\n\n\u0006\b-\n\nThe parameters of the mixture model cannot be computed analytically unless one knows\nthe mixture indicator variables. We have to resort to numerical methods. One can imple-\nment a Gibbs sampler to compute the parameters and allocation variables. This is done by\nsampling the parameters from their Dirichlet posteriors and the allocation variables from\ntheir multinomial posterior. However, this algorithm is too computationally intensive for\n\n\u0012\u0015\u0014\n\n(3)\n\n\u001e\n\n\u0001\n\u0002\n\u000f\n\u0005\n\t\n\u0016\n\u0017\n\u0019\n\u0013\n\u0016\n\u0001\n\u0019\n\u0013\n\u0019\n\u000f\n\u0016\n\u0001\n\u0015\n\"\n\u001c\n\"\n \n\u0019\n\u001a\n\u001d\n\u0001\n-\n\"\n \n1\n2\n\u001f\n!\n\"\n\u0016\n\u0001\n\"\n\u0016\n\u0001\n\u0015\n\"\n\"\n\u001d\n\u0001\n$\n\u0013\n\u0011\n\f\n\u0017\n\f\n\n\u0016\n\u0016\n\u0001\n\u0015\n\"\n\u0015\n\u0016\n\u0016\n\u0001\n\u0016\n\u001d\n\u0001\n\u001d\n!\n:\n$\n!\n8\n\n$\n\n!\n?\n\u000b\n$\n?\n\n\u001e\n\n\u0001\n\u0002\n\u000f\n\u0005\n\t\n\t\nP\nM\n\u0001\n\f\n\n\u0001\n\u0012\n\"\nM\n\u0001\n\u0012\n\"\nM\n\u0014\n\nE\n\u0006\n\u001c\n\u0016\n\u0001\n\u0006\n\u001c\n\u0016\n\u0001\n\u0015\n\"\n\u001d\n\u0001\n\u0001\n\u000b\n\u0014\n\u001f\n?\n\n\u0001\n\u0002\n\"\n\u001e\n\n\u001e\n\n\u0001\n\u0012\n\"\n\u0002\n\u0001\n\u0002\n\u0014\n\u001e\n\n\u0001\n\u0002\n\u0014\n6\n\u0013\n\u0016\n\u0001\n`\n\u0013\n`\n\u000f\n\u0016\n\u0001\n\u0015\n\"\n\u001c\n\"\n \n`\n\u001a\n\u001d\n\u0001\n0\n-\n\"\n<\n\u0006\n\u0007\n\u0006\n\u0018\n\u0013\n\u0016\n\u0001\n\u001a\n\u001c\n`\n\u0006\n\u0018\n\u0013\n`\n\u0006\n\u0018\n\u000f\n\u0016\n\u0001\n\u0015\n\"\n\u001c\n\"\n \n`\n\u0006\n-\n\u001a\n\u001d\n\u0001\n0\n-\n\"\n\u0007\ne\nc\n\fthe applications we have in mind. Instead we opt for expectation maximization (EM) algo-\nrithms to compute the maximum likelihood (ML) and maximum a posteriori (MAP) point\nestimates of the mixture model.\n\n3.1 Maximum likelihood estimation with the EM algorithm\n\nAfter initialization, the EM algorithm for ML estimation iterates between the following\ntwo steps:\n\n*,\"\n\n\u0004\u0006\u0005\n\u0013\u0019\u0018\n\n\u000f\u000e\u0011\u0010\n\n\u000f\u000e\u0011\n\n\u000b'\n\n\u0001\u0003\u0002\n\u0010\u0017\u0016\n\n\u0013\u0015\u0014\n\n\u000b\u000e\n\n\u000f'\u0011\n\n\u0012\u0015\u0014)(\n\n\b\n\t old\u000b\n ML\n\nThe  ML function expands to\n\r\u001c\u000e\u001b\u0010\n\nrequires that we maximize  ML subject to the constraints that all probabilities for the pa-\n\nrameters sum up to 1. This constrained maximization can be carried out by introducing\nLagrange multipliers. The resulting parameter estimates are:\n\n1. E step: Compute the expectation of the complete log-likelihood with respect to the dis-\n\u0012 ,\n\n\u001d old% represents the value of the parameters at the previous time step.\n\n\f\u000b\u000e\r\u0010\u000f\u000e\u0011\u0013\u0012\n\u0012\u0015\u0014)0\n\nwhere\u0002\ntribution of the allocation variables  ML 6\n2. M step: Maximize over the parameters:\u0002\n\u001d new%\n\u0006\u001b\u001a\n ML6\n #\"\n\u0012\u0015\u0014\u001b\u001a\n\u000f\u000e\u0011\n\u000f\u000e\u00113\u00124\u0014\n\u001dN\u001e\n\u000b\u000e\r\n\u000b'\r\n\u0014 using equation (3). The corresponding M step\nIn the E step, we have to compute\u000f\u000e\u00113\u0012\n\u000f]\u00113\u0012\u0015\u0014\n\u000f\u000e\u0011\u0013\u0012\n\u000b'\r\n\u000b\u000e\r\n5\u0006\u0013\n\u0006\u001b\u001a\n\u000b'\r\n\u0012\u0015\u0014\n\u0006\u001b\u001a\n\u000f\u000e\u0011\u0013\u0012\n\u000b\u000e\r\n\u0006\u001b\u001a\n\u000b\u000e\r\n\u000b\u000e\r\n\u00124\u0014\n\u0006\u001b\u001a\n\u000b\u000e\r \u001f\n\u0006\u001b\u001a\n\u000f\u000e\u0011\u0013\u0012\n\u000b\u000e\r\n\u0012\u0015\u0014\n\u0006\u0011\u001a\n\u000f'\u00113\u0012\n\nThe EM formulation for MAP estimation is straightforward. One simply has to augment\n\n3.2 Maximum a posteriori estimation with the EM algorithm\n\n\u000f]\u0011\n\u000f\u000e\u0011\n\u000f\u000e\u0011\n\n\u0014h\u000f\u000e\u00113\u0012\n\u000f\u000e\u0011\u0013\u0012\n\n\u000b\u000e\ra\u000f\u000e\u0011\n\n\u000f\u000e\u0011\u0013\u0012\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nCIH\n\n\u001a\u001c\u0011\n\nthe objective function in the M step,  ML, by adding to it the log prior densities. That is,\n\nthe MAP objective function is\n\nThe MAP parameter estimates are:\n\n\u0001\u0003\u0002\n\n\b\"\t old\u000b\n\n MAP\n\n\u000f\u000e\u0011\u0013\u0012\u0015\u0014\n\u0012\u0015\u0014\n\u0012\u0015\u0014\n\u0012\u0015\u0014\n\n\u000f\u000e\u0011\n\u000f\u000e\u0011\n\u000f]\u0011\n\n\u0004\u0019\u0005\n\n\u0013\u0014\u0011\n\n('\n\n\fe\u001f\u000b\u000e\n\n\u0017'\n\u000b\u000e\r\n\u000b\u000e\r\ne\u001f\u000b\u000e\r\n\u0006.-\n\n\u001a\u001c\u0011\n\n\u001c\u000e\u001b\u0010\n\n\"$#\n\n\u000f'\u0011\n\u000b\u000e\r\n\u0006\u0011\u001a\n\u000f'\u00113\u0012\n\n\fe '\nCIH\n\u000b'\r\n5\u0007\u0013\n\u0006\u001b\u001a\n\n\u0017'\nC+*\n\u000b\u000e\r.\u001f\n\n\u000f\u000e\u0011\u0010\n\n ML %\n\n\u000f\u000e\u0011\n\n\u0007\u0011\n\n\u0014,\u000f'\u00113\u0012\n\u000b'\r\n\u0006\u001b\u001a\n\u000f\u000e\u0011\u0013\u0012\n\u000b\u000e\r\n\u0006\u001b\u001a\n\u000f\u000e\u0011\u0013\u0012\n\u000b\u000e\r\n\u000b'\r\n\u0006\u001b\u001a\n\u000f\u000e\u00113\u0012\n\u000b\u000e\r\n\u000f'\u00113\u0012\n\n\u001a\u001c\u0011\n\n\u000f\u0012\u0011\n\n\u0013\u0014\u0011\n\n\u000f'\u00113\u0012\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n\u001d\n\u001e\n\u0011\n\u0007\n\u0011\n%\n\f\nQ\n\"\n\f\n\"\n\u0018\n\u0001\n\u0002\n\u0014\n6\n\b\n\t\n\n\u0006\n\u0007\n\t\n\u0001\n\u000b\n\u0014\n\u0016\n\u0017\n\u0006\n\u0018\n\u0019\n\u0013\n\u0016\n\u0001\n\u001c\n$\n%\n\u0006\n\u0018\n\u0019\n\u0013\n\u0006\n\u0018\n\u0019\n\u000f\n\u0016\n\u0001\n\u0015\n\"\n\u001c\n\"\n \n\u0006\n-\n\u0019\n\u001a\n\u001d\n\u0001\n-\n\"\n \n1\n2\nB\n\u0001\n\u000b\n\u001d\n6\n8\n\u0006\n\u001a\n\t\n\n\u0001\n\u000b\n\u0014\n\u001d\n\u0016\n\u0001\n6\n<\n\n\u0011\n\u001e\n\n\u0011\n\u0017\n\u0001\n\u000b\n\u0014\n<\n\n\u0001\n\u000b\n\u0014\n\u001d\n\u0016\n\u0001\n\u0015\n\"\n6\n<\n\n\u001e\n\u000f\n\u0011\n\u0013\n\u0011\n\n\u0001\n\u000b\n\u0014\n<\n\u0006\n\u0018\n\u0013\n<\n\n\u001e\n\u000f\n\u0011\n\u0013\n\u0011\n\n\u0001\n\u000b\n\u0014\n\u001d\n\u001d\n\u0001\n6\n<\n\n\u0001\n\u000b\n\u0014\n<\n\n\u0001\n\u000b\n\u0014\n6\n\u001d\n!\n\u0011\n%\n\f\nQ\n\"\n\u0002\n\u0014\n\u0012\n6\n\u0002\n\u0014\n\u001d\n6\n&\n8\n%\n<\n\n\u0001\n\u000b\n\u0014\n<\n\u0006\n\u0007\n&\nC\n\n%\n\u001d\n\u0016\n\u0001\n6\n)\n8\n%\n<\n\n\u0011\n\u001e\n\u0017\n\u0001\n\u000b\n\u0014\n<\n\u0006\n\u0018\n\u0013\ne\n)\n\u0013\ne\n\u0011\n%\n<\n\n\u0001\n\u000b\n\u0014\n\u001d\n\u0016\n\u0001\n\u0015\n\"\n6\n,\n\u000f\n\u0011\n\u0013\n\u0011\n\n'\n8\n%\n<\n\n\u001e\n\u000f\n\u0011\n\u0013\n\u0011\n\n\u0001\n\u000b\n\u0014\n<\n\u0006\n\u0018\n\u0013\ne\n,\n\u000f\n\u0011\n\u0013\ne\n\u0011\n\n'\nC\n*\n%\n<\n\u0006\n\u0018\n\u0013\n<\n\n\u001e\n\n\u0001\n\u000b\n\u0014\n\u001d\n\u001d\n\u0001\n6\n-\n\n'\n8\n%\n<\n\u0006\n\u001a\n\n\u0001\n\u000b\n\u0014\n<\n\u001a\n-\n\u001a\ne\n\u0011\n\n'\nC\n\u001a\n%\n<\n\u0006\n\u001a\n\n\u0001\n\u000b\n\u0014\n\fCLUSTER SONG\n2\n2\n2\n...\n4\n4\n4\n4\n...\n6\n...\n7\n7\n7\n...\n9\n9\n9\n\nMoby \u2013 Porcelain\nNine Inch Nails \u2013 Terrible Lie\nother \u2013 \u2019Addams Family\u2019 theme\n...\nJ. S. Bach \u2013 Invention #1\nJ. S. Bach \u2013 Invention #8\nJ. S. Bach \u2013 Invention #15\nThe Beatles \u2013 Yellow Submarine\n...\nother \u2013 \u2019Wheel of Fortune\u2019 theme\n...\nThe Beatles \u2013 Taxman\nThe Beatles \u2013 Got to Get You Into My Life\nThe Cure \u2013 Saturday Night\n...\nR.E.M \u2013 Man on the Moon\nSoft Cell \u2013 Tainted Love\nThe Beatles \u2013 Got to Get You Into My Life\n\n\u0007\t\b\n\n\u0002\u0001\u0004\u0003\u0006\u0005\n1\n1\n1\n...\n1\n1\n1\n0.9975\n...\n1\n...\n1\n0.7247\n1\n...\n1\n1\n0.2753\n\nFigure 2: Representative probabilistic cluster allocations using MAP estimation.\n\nThese expressions can also be derived by considering the posterior modes and by replacing\n\nroom for various stochastic and deterministic ways of improving EM.\n\nthe cluster indicator variable with its posterior estimate\u000f\u000e\u0011\u0013\u0012\n\n\u0014 . This observation opens up\n\n4 Experiments\n\nTo test the model with text and music, we clustered a database of musical scores with\nassociated text documents. The database is composed of various types of musical scores \u2013\njazz, classical, television theme songs, and contemporary pop music \u2013 as well as associated\ntext \ufb01les. The scores are represented in GUIDO notation. The associated text \ufb01les are a\nsong\u2019s lyrics, where applicable, or textual commentary on the score for instrumental pieces,\nall of which were extracted from the World Wide Web.\n\nThe experimental database contains 100 scores, each with a single associated text docu-\nment. There is nothing in the model, however, that requires this one-to-one association\nof text documents and scores \u2013 this was done solely for testing simplicity and ef\ufb01ciency.\nIn a deployment such as the world wide web, one would routinely expect one-to-many or\nmany-to-many mappings between the scores and text.\n\nWe carried out ML and MAP estimation with EM. The The Dirichlet hyper-parameters\n\n . The MAP approach resulted in sparser (reg-\nularised), more coherent clusters. Figure 2 shows some representative cluster probability\nassignments obtained with MAP estimation.\n\nBy and large, the MAP clusters are intuitive. The 15 pieces by J. S. Bach each have very\n) probabilities of membership in the same cluster. A few curious anomalies\nexist. The Beatles\u2019 song The Yellow Submarine is included in the same cluster as the Bach\npieces, though all the other Beatles songs in the database are assigned to other clusters.\n\n\f\r\f\t\f\n\nX96\n\nwere set toV\nhigh (\u000f\n\n\u0001\n\u000b\n6\n8\n\"\n8\n:\n\"\nZ\n6\n8\n:\n\"\n\\\n6\n\u000b\n:\nB\n\f4.1 Demonstrating the utility of multi-modal queries\n\nA major intended use of the text-score model is for searching documents on a combination\nof text and music.\n\nConsider a hypothetical example, using our database: A music fan is struggling to recall a\ndimly-remembered song with a strong repeating single-pitch, dotted-eight-note/sixteenth-\nnote bass line, and lyrics containing the words come on, come on, get down. A search on\nthe text portion alone turns up four documents which contain the lyrics. A search on the\nnotes alone returns seven documents which have matching transitions. But a combined\nsearch returns only the correct document (\ufb01gure 3).\n\nQUERY\n\nRETRIEVED SONGS\n\ncome on, come on, get down\n\nErksine Hawkins \u2013 Tuxedo Junction\nMoby \u2013 Bodyrock\nNine Inch Nails \u2013 Last\nSherwood Schwartz \u2013 \u2018The Brady Bunch\u2019 theme song\n\nThe Beatles \u2013 Got to Get You Into My Life\nThe Beatles \u2013 I\u2019m Only Sleeping\nThe Beatles \u2013 Yellow Submarine\nMoby \u2013 Bodyrock\nMoby \u2013 Porcelain\nGary Portnoy \u2013 \u2018Cheers\u2019 theme song\nRodgers & Hart \u2013 Blue Moon\n\ncome on, come on, get down\n\nMoby \u2013 Bodyrock\n\nFigure 3: Examples of query matches, using only text, only musical notes, and both text\nand music. The combined query is more precise.\n\n4.2 Precision and recall\n\nTo perform a query, we simply sample probabilistically without replacement from the clus-\n\norC\n\n: .\n\nsong\n\nsuch that all elements of \n\nand \n\nis com-\n. We then\nin the database, where a match is de\ufb01ned as a\nhave a frequency of 1 or greater. In order to\n\nWe evaluated our retrieval system with randomly generated queries. A query \nposed of a random series of 1 to 5 note transitions, \u0001\nand 1 to 5 words, \u0003\u0002\n\u000b\t\b\n\ndetermine the actual number of matchesC\navoid skewing the results unduly, we reject any query that hasC\u0005\u0004\u0007\u0006\nters. The probability of sampling from each cluster,\u000f'\u00113\u0012\n\nIf a cluster contains no items or later becomes empty, it is assigned a sampling probability\nof zero, and the probabilities of the remaining clusters are re-normalized.\n\n\u0014 , is computed using equation 3.\n\nIn each iteration\n\n, a cluster is selected, and the matching criteria are applied against each\n\n\u001e\n\n\n\u0002\n\u0001\n\n\u0015\n\fpiece of music that has been assigned to that cluster until a match is found. If no match is\nfound, an arbitrary piece is selected. The selected piece is returned as the rank-\nresult.\nOnce all the matches have been returned, we compute the standard precision-recall curve\n[9], as shown in Figure 4.\nOur querying method enjoys a high precision until recall is approximately \u0002\n, and experi-\nences a relatively modest deterioration of precision thereafter. By choosing clusters before\n\n\u0002\u0001\n\n:\u0004\u0003\n\nFigure 4: Precision-recall curve showing average results, over 1000 randomly-generated\nqueries, combining music and text matching criteria.\n\nmatching, we overcome the polysemy problem. For example, river banks and money banks\nappear in separate clusters. We also deal with synonimy since automobiles and cars have\nhigh probability of belonging to the same clusters.\n\n4.3 Association\n\nThe probabilistic nature of our approach allows us the \ufb02exibility to use our techniques and\ndatabase for tasks beyond traditional querying. One of the more promising avenues of\nexploration is associating documents with each other probabilistically. This could be used,\nfor example, to \ufb01nd suitable songs for web sites or presentations (matching on text), or for\nrecommending songs similar to one a user enjoys (matching on scores).\n\nGiven an input document, \n\u0010\u0017\u0016\ntermined by computing \u0013\u0015\u0014\n\n, we \ufb01rst cluster \n\u0013\u0019\u0018\n\nby \ufb01nding the most likely cluster as de-\n\nmusic only can be clustered using only those components of the database. Input documents\nthat combine text and music are clustered using all the data. We can then \ufb01nd the closest as-\nsociation by computing the distance from the input document to the other document vectors\nin the cluster using a similarity metric such as Euclidean distance, or cosine measures after\ncarrying out latent semantic indexing [10]. A few selected examples of associations we\nfound are shown in \ufb01gure 5. The results are often reasonable, though unexpected behavior\noccasionally occurs.\n\n\u0014 (equation 3). Input documents containing text or\n\n\u000f\u000e\u0011\u0013\u0012\n\n5 Conclusions\n\nWe feel that the probabilistic approach to querying on music and text presented here is\npowerful, \ufb02exible, and novel, and suggests many interesting areas of future research. In\nthe future, we should be able to incorporate audio by extracting suitable features from the\n\n\u0015\n\n\u0001\n\n\fINPUT\nJ. S. Bach \u2013 Toccata and Fugue in D Minor (score)\nNine Inch Nails \u2013 Closer (score & lyrics)\nT. S. Eliot \u2013 The Waste Land (text poem)\n\nCLOSEST MATCH\nJ. S. Bach \u2013 Invention #5\nNine Inch Nails \u2013 I Do Not Want This\nThe Cure \u2013 One Hundred Years\n\nFigure 5: The results of associating songs in the database with other text and/or musical\ninput. The input is clustered probabilistically and then associated with the existing song\nthat has the least Euclidean distance in that cluster. The association of The Wasteland with\nThe Cure\u2019s thematically similar One Hundred Years is likely due to the high co-occurance\nof relatively uncommon words such as water, death, and year(s).\n\nsignals. This will permit querying by singing, humming, or via recorded music. There are\na number of ways of combining our method with images [6, 4], opening up room for novel\napplications in multimedia [11].\n\nAcknowledgments\n\nWe would like to thank Kobus Barnard, J. Stephen Downie, Holger Hoos and Peter Car-\nbonetto for their advice and expertise in preparing this paper.\n\nReferences\n\n[1] D Huron and B Aarden. Cognitive issues and approaches in music information retrieval. In\n\nS Downie and D Byrd, editors, Music Information Retrieval. 2002.\n\n[2] J Pickens. A comparison of language modeling and probabilistic text information retrieval\napproaches to monophonic music retrieval. In International Symposium on Music Information\nRetrieval, 2000.\n\n[3] J S Downie. Evaluating a Simple Approach to Music Information Retrieval: Conceiving\n\nMelodic N-Grams as Text. PhD thesis, University of Western Ontario, 1999.\n\n[4] E Brochu, N de Freitas, and K Bao. The sound of an album cover: Probabilistic multimedia and\nIR. In C M Bishop and B J Frey, editors, Ninth International Workshop on Arti\ufb01cial Intelligence\nand Statistics, Key West, Florida, 2003. To appear.\n\n[5] H H Hoos, K A Hamel, K Renz, and J Kilian. Representing score-level music using the GUIDO\n\nmusic-notation format. Computing in Musicology, 12, 2001.\n\n[6] K Barnard and D Forsyth. Learning the semantics of words and pictures.\n\nConference on Computer Vision, volume 2, pages 408\u2013 415, 2001.\n\nIn International\n\n[7] T Hofmann. Probabilistic latent semantic analysis.\n\n1999.\n\nIn Uncertainty in Arti\ufb01cial Intelligence,\n\n[8] D M Blei, A Y Ng, and M I Jordan. Latent Dirichlet allocation. In T G Dietterich, S Becker, and\nZ Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge,\nMA, 2002. MIT Press.\n\n[9] R Baeza-Yates and B Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.\n[10] S Deerwester, S T Dumais, G W Furnas, T K Landauer, and R Harshman. Indexing by latent\nsemantic indexing. Journal of the American Society for Information Science, 41(6):391\u2013 407,\n1990.\n\n[11] P Duygulu, K Barnard, N de Freitas, and D Forsyth. Object recognition as machine translation:\n\nLearning a lexicon for a \ufb01xed image vocabulary. In ECCV, 2002.\n\n\f", "award": [], "sourceid": 2262, "authors": [{"given_name": "Brochu", "family_name": "Eric", "institution": null}, {"given_name": "Nando", "family_name": "de Freitas", "institution": null}]}