{"title": "Learning a Gaussian Process Prior for Automatically Generating Music Playlists", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": "", "full_text": "Learning a Gaussian Process Prior\n\nfor Automatically Generating Music Playlists\n\nJohn C. Platt\n\nChristopher J. C. Burges\n\nSteven Swenson\n\nChristopher Weare\nMicrosoft Corporation\n\n1 Microsoft Way\n\nRedmond, WA 98052\n\nAlice Zheng \n\n\u0001 jplatt,cburges,sswenson,chriswea\n\n@microsoft.com, alicez@cs.berkeley.edu\n\nAbstract\n\nThis paper presents AutoDJ: a system for automatically generating mu-\nsic playlists based on one or more seed songs selected by a user. AutoDJ\nuses Gaussian Process Regression to learn a user preference function\nover songs. This function takes music metadata as inputs. This paper\nfurther introduces Kernel Meta-Training, which is a method of learning\na Gaussian Process kernel from a distribution of functions that generates\nthe learned function. For playlist generation, AutoDJ learns a kernel from\na large set of albums. This learned kernel is shown to be more effective\nat predicting users\u2019 playlists than a reasonable hand-designed kernel.\n\n1 Introduction\n\nDigital music is becoming very widespread, as personal collections of music grow to thou-\nsands of songs. One typical way for a user to interact with a personal music collection is\nto specify a playlist, an ordered list of music to be played. Using existing digital music\nsoftware, a user can manually construct a playlist by individually choosing each song. Al-\nternatively, playlists can be generated by the user specifying a set of rules about songs (e.g.,\ngenre = rock), and the system randomly choosing songs that match those rules.\nConstructing a playlist is a tedious process: it takes time to generate a playlist that matches\na particular mood. It is also dif\ufb01cult to construct a playlist in advance, as a user may not\nanticipate all possible music moods and preferences he or she will have in the future.\nAutoDJ is a system for automatically generating playlists at the time that a user wants to\nlisten to music. The playlist plays with minimal user intervention: the user hears music\nthat is suitable for his or her current mood, preferences and situation.\nAutoDJ has a simple and intuitive user interface. The user selects one or more seed songs\nfor AutoDJ to play. AutoDJ then generates a playlist with songs that are similar to the seed\nsongs. The user may also review the playlist and add or remove certain songs, if they don\u2019t\n\ufb01t. Based on this modi\ufb01cation, AutoDJ then generates a new playlist.\nover\nAutoDJ uses a machine learning system that \ufb01nds a current user preference function\na feature space of music. Every time a user selects a seed song or removes a song from the\n\u0004 Current address: Department of Electrical Engineering and Computer Science, University of\n\nCalifornia at Berkeley\n\n\u0002\n\u0003\n\fare placed into the playlist.\n\n\u0002\u0001\u0004\u0003\u0006\u0005\u0007\u0003\t\b\u000b\n\nis inferred by regression. The\n\nplaylist, a training example is generated. In general, a user can give an arbitrary preference\nvalue to any song. By default, we assume that selected songs have target\nvalues of 1,\nvalues of 0. Given a training set, a full user preference\nwhile removed songs have target\nfunction\nfor each song owned by the user is evaluated,\nand the songs with the highest\nThe machine learning problem de\ufb01ned above is dif\ufb01cult to solve well. The training set\noften contains only one training example: a single seed song that the user wishes to listen\nto. Most often, AutoDJ must infer an entire function from 1\u20133 training points. An appro-\npriate machine learning method for such small training sets is Gaussian Process Regression\n(GPR) [14], which has been shown empirically to work well on small data sets. Technical\ndetails of how to apply GPR to playlist generation are given in section 2. In broad detail,\nGPR starts with a similarity or kernel function\nbetween any two songs. We de\ufb01ne\nthe input space\nto be descriptive metadata about the song. Given a training set of user\npreferences, a user preference function is generated by forming a linear blend of these ker-\nnel functions, whose weights are solved via a linear system. This user preference function\nis then used to evaluate all of the songs in the user\u2019s collection.\nThis paper introduces a new method of generating a kernel for use in GPR. We call this\nmethod Kernel Meta-Training (KMT). Technical details of KMT are described in sec-\ntion 3. KMT improves GPR by adding an additional phase of learning: meta-training.\nDuring meta-training, a kernel is learned before any training examples are available. The\nkernel is learned from a set of samples from meta-training functions. These meta-training\nfunctions are drawn from the same function distribution that will eventually generate the\ntraining function. In order to generalize the kernel beyond the meta-training data set, we\n\ufb01t a parameterized kernel to the meta-training data, with many fewer parameters than data\npoints. The kernel is parameterized as a non-negative combination of base Mercer kernels.\nThese kernel parameters are tuned to \ufb01t the samples across the meta-training functions.\nThis constrained \ufb01t leads to a simple quadratic program. After meta-training, the kernel is\nready to use in standard GPR.\nTo use KMT to generate playlists, we meta-train a kernel on a large number of albums. The\nlearned kernel thus re\ufb02ects the similarity of songs on professionally designed albums. The\nlearned kernel is hardwired into AutoDJ. GPR is then performed using the learned kernel\nevery time a user selects or removes songs from a playlist. The learned kernel forms a good\nprior, which enables AutoDJ to learn a user preference function with a very small number\nof user training examples.\n\n1.1 Previous Work\n\nThere are several commercial Web sites for playing or recommending music based on one\nseed song. The algorithms behind these sites are still unpublished.\nThis work is related to Collaborative Filtering (CF) [9] and to building user pro\ufb01les in\ntextual information retrieval [11]. However, CF does not use metadata associated with a\nmedia object, hence CF will not generalize to new music that has few or no user votes.\nAlso, no work has been published on building user pro\ufb01les for music. The ideas in this\nwork may also be applicable to text retrieval.\nPrevious work in GPR [14] learned kernel parameters through Bayesian methods from just\nthe training set, not from meta-training data. When AutoDJ generates playlists, the user\nmay select only one training example. No useful similarity metric can be derived from one\ntraining example, so AutoDJ uses meta-training to learn the kernel.\nThe idea of meta-training comes from the \u201clearning to learn\u201d or multi-task learning litera-\nture [2, 5, 10, 13]. This paper is most similar to Minka & Picard [10], who also suggested\n\ufb01tting a mean and covariance for a Gaussian Process based on related functions. However,\nin [10], in order to generalize the covariance beyond the meta-training points, a Multi-Layer\nPerceptron (MLP) is used to learn multiple tasks, which requires non-convex optimization.\n\n\u0003\n\u0003\n\u0003\n\u0003\n\u0003\n\u0003\n\fThe Gaussian Process is then extracted from the MLP. In this work, using a quadratic pro-\ngram, we \ufb01t a parameterized Mercer kernel directly to a meta-training kernel matrix in\norder to generalize the covariance.\nMeta-training is also related to algorithms that learn from both labeled and unlabeled\ndata [3, 6]. However, meta-training has access to more data than simply unlabeled data:\nit has access to the values of the meta-training functions. Therefore, meta-training may\nperform better than these other algorithms.\n\n2 Gaussian Process Regression for Playlist Generation\n\n, if\n\n\u0003\u0003\u0002\n\n\u0001\u0004\u0003\n\n.\n\n\b\u000b\n\n\u0003\u0006\u0005\u0007\u0003\n\nvectors\n\n. For any\n\ncorresponding samples\n\nare drawn from the GP, then the\n\nare chosen in the input\nare jointly\n\nAutoDJ uses GPR to generate a playlist every time a user selects one or more songs. GPR\nuses a Gaussian Process (GP) as a prior over functions. A GP is a stochastic process\nover a multi-dimensional input space\nspace, and the\nGaussian.\nThere are two statistics that fully describe a GP: the mean\n\nand the covariance\nIn this paper, we assume that the GP over user preference functions is zero\nmean. That is, at any particular time, the user does not want to listen to most of the songs\nin the world, which leads to a mean preference close enough to zero to approximate as zero.\nsimply turns into a correlation over a distribution\nTherefore, the covariance kernel\n.\nof functions\n\b\u000b\n\b\u0007\n\t\u000b\u0006\n. In this\nwhich takes music metadata as\nIn section 3, we learn a kernel\npaper, whenever we refer to a music metadata vector, we mean a vector consisting of 7\ncategorical variables: genre, subgenre, style, mood, rhythm type, rhythm description, and\nvocal code. This music metadata vector is assigned by editors to every track of a large\ncorpus of music CDs. Sample values of these variables are shown in Table 1. Our kernel\nthus computes the similarity between two metadata vectors correspond-\nfunction\ning to two songs. The kernel only depends on whether the same slot in the two vectors are\nthe same or different. Speci\ufb01c details about the kernel function are described in section 3.2.\n\n\u0002\u0001\u0004\u0003\u0006\u0005\n\n\f\u0006\n\u0001\u0004\u0003\n\u0003\u0006\u0005\u0007\u0003\n\n\b\u000b\n\n\b\u000b\n\u000e\r\n\b\u000b\n\n\u0002\u0001\u0004\u0003\u0006\u0005\u0007\u0003\n\n\u0001\u0004\u0003\u0006\u0005\n\n:\n\nand\n\n\u0003\t\b\n\nMetadata Field\n\nExample Values\n\nNumber of\n\nValues\n\nGenre\nSubgenre\nStyle\nMood\nRhythm Type\nRhythm Description\nVocal Code\n\nJazz, Reggae, Hip-Hop\nHeavy Metal, I\u2019m So Sad and Spaced Out\nEast Coast Rap, Gangsta Rap, West Coast Rap\nDreamy, Fun, Angry\nStraight, Swing, Disco\nFrenetic, Funky, Lazy\nInstrumental, Male, Female, Duet\n\n30\n572\n890\n21\n10\n13\n6\n\nTable 1: Music metadata \ufb01elds, with some example values\n\n\u0003\u0003\u0002\n\nbe the expressed user preference. In general,\n\nbe the metadata vectors\nOnce we have de\ufb01ned a kernel, it is simple to perform GPR. Let\nfor the\nsongs for which the user has expressed a preference by selecting or removing\ncan be any\nthem from the playlist. Let\nreal value. If the user does not express a real-valued preference,\nis assumed 1 if the user\nwants to listen to the song and 0 if the user does not. Even if the values\nare binary, we do\nnot use Gaussian Process Classi\ufb01cation (GPC), in order to maintain generality and because\nGPC requires an iterative procedure to estimate the posterior [1].\nLet\nsurement, with Gaussian noise of variance\nsong that will be considered to be on a playlist:\nthat song.\n\nis a noisy mea-\nbe a metadata vector of any\nis the (unknown) user preference for\n\nbe the underlying true user preference for the\n\nth song, of which\n\n. Also, let\n\n\u0013\u0015\u0014\n\n\u0003\u0011\u0002\n\n\u000f\u0010\u0002\n\n\u000f\u0010\u0002\n\n\u000f\u0010\u0002\n\n\n\u0001\n\u0003\n\n\u0003\n\u0001\n\u0001\n\u0001\n\u0004\n\u0002\n\u0004\n\u0002\n\u0005\n\n\n\u0001\n\u0003\n\u0006\n\n\u0003\n\u0001\n\u0003\n\n\u0001\n\u0003\n\b\n\n\u0001\n\u000f\n\u0002\n\u000f\n\u0002\n\u0012\n\u0003\n\n\u0003\n\n\fBefore seeing the preferences\nfrom the GP. After incorporating the\n\ninformation, the posterior mean of\n\n\u0002\u0001 forms a joint prior Gaussian derived\n\u0005\u0007\u0003\n\n(1)\n\nis\n\n\u0001\u0004\u0003\n\n, the vector \n\u0002\u0006\u0005\b\u0007\n\t\n\n and\n\nwhere\n\nThus,\n\n\u0007\u0012\u0011\n\nthe user preference function for a song s,\n\n\u0007\f\u000b\u000e\r\u0010\u000f\nthat compare the metadata vector for song \u001d with the metadata vectors\n\n\u0002\u0007\u0005\nfor the songs that the user expressed a preference. The weights\ning an\nthis matrix is very fast.\nSince the kernel is learned before GPR, and the vector\nis supplied by the user, the only\nfree hyperparameter is the noise value\n. This hyperparameter is selected via maximum\nlikelihood on the training set. The formula for the log likelihood of the training data given\n\n\u0003\u0003\u0002\nare computed by invert-\ntends to be small, inverting\n\nmatrix. Since the number of user preferences\n\nis a linear blend of kernels\n\n\u0014\u0013\u000b\u0013\n\n\u0018\u0017\u001a\u0019\n\u0001\u001e\u001d\n\n\u0007\n\nby\n\n\u0007\u001c\u001b\n\n\u0014\u0016\u0015\n\n\u0001\u001e\u001d\n\n(2)\n\n\u000f\u0010\u0002\n\n,\n\nis\n\n\u001b3/\n\n12-\n\n\u001b3/)465\n\n\u001f! #\"\n$\n\n\b\u0007.-\n\n\u001f\u0006 7\"98\u001a:\n\n\u000f0+)12-\n\n\u0001\u001e%'&)(*&,+\n\nare evaluated and the\n\n(3)\nthat gener-\n\nEvery time a playlist is generated, different values of\nates the highest log likelihood is used.\n\n\u001f! #\"\nIn order to generate the playlist, the matrix\u000f\n\u0001\u001e\u001d\n\nis computed, and the user preference function\nis computed for every song that the user owns. The songs are then ranked in descend-\n. The playlist consists of the top songs in the ranked list. The playlist can cut\ngets too low,\n\ning order of\noff after a \ufb01xed number of songs, e.g., 30. It can also cut off if the value of\nso that the playlist only contains songs that the user will enjoy.\nThe order of the playlist is the order of the songs in the ranked list. This is empirically\neffective: the playlist typically starts with the selected seed songs, proceeds to songs very\nsimilar to the seed songs, and then gradually drifts away from the seed songs towards the\nend of the list, when the user is paying less attention. We explored neural networks and\nSVMs for determining the order of the playlist, but have not found a clearly more effective\n. Here, \u201ceffective\u201d is de\ufb01ned as generating\nordering algorithm than simply the order of\nplaylists that are pleasing to the authors.\n\n3 Kernel Meta-Training (KMT)\nThis section describes Kernel Meta-Training (KMT) that creates the GP kernel\n\b\u000b\n\nused in the previous section. As described in the introduction, KMT operates on samples\n. This set of functions should be related to a \ufb01nal\ntrained function, since we derive a similarity kernel from the meta-training set of func-\ntions. In other words, we learn a Gaussian prior over the space of functions by computing\nGaussian statistics on a set of functions related to a function we wish to learn.\nWe express the kernel\n\ndrawn from a set of ;\n\nas a covariance components model [12]:\n\nfunctions\n\n\u0002\u0001\u0004\u0003\u0006\u0005\n\n\u00037<\n\n\u0001\u0004\u0003\n\n(4)\n\nto the samples drawn\nfrom the meta-training functions. We use the simpler model instead of an empirical co-\nvariance matrix, in order to generalize the GPR beyond points that are in the meta-training\nset.\n\n- . We then \ufb01t\n\n\b\u0007\n\n\u0003\u0006\u0005\u0007\u0003\n\n\u0003\u001c=\n\u0005\b\u0007@?\n> are pre-de\ufb01ned Mercer kernels and\n\u00010E\n\nwhereA\nThe functional form of the kernel A\napplication, both the form of A\n\nsection 3.2, below).\n\nand\n\nand\n\n\u0003\u0006\u0005\u0007\u0003\n\n>@AB>\n>DC\n\ncan be chosen via cross-validation. In our\nare determined by the available input data (see\n\n\u000f\n\u0002\n\u0003\n\u0002\n\u0005\n\u0003\n\u000f\n\u0002\n\u0003\n\n\u0003\n\n\u0007\n\u0003\n\u0004\n\u0002\n\n\u0002\n\n\n\u0005\n\t\n\u0002\n\u0002\n\n\u000f\n\u000f\n\u0002\n\n\n\u0001\n\u0003\n\u0002\n\u0005\n\u0003\n\n\u0002\n\u0003\n\n\n\u0001\n\u0003\n\u0003\n\t\n\u0002\n\u0001\n\u0001\n\u0001\n\u0013\n\u0013\n\u0013\n\u001b\n/\n+\n\u000f\n4\n\u0001\n\u001b\n\u0013\n\u0013\n\u0003\n\n\u0003\n\u0003\n\u0003\n\u0003\n\n\n\n\u0001\n\b\n\u0004\n>\n\u0001\n\b\n\n\u0005\n?\n?\n>\n\u0001\nE\n\fis to maximize the likelihood in (3) over all samples\nOne possible method to \ufb01t the\ndrawn from all meta-training functions [7]. However, solving for the optimal\nrequires\nan iterative algorithm whose inner loop requires Cholesky decomposition of a matrix of\ndimension the number of meta-training samples. For our application, this matrix would\nhave dimension 174,577, which makes maximizing the likelihood impractical.\nInstead of maximizing the likelihood, we \ufb01t a covariance components model to an empirical\ncovariance computed on the meta-training data set, using a least-square distance function:\n\n&\u0001\n\n\u0002\u0004\u0003\u0006\u0005\n\u0007\t\b\n\n\u0002\f\u000b\n\nwhere\nempirical covariance\n\nand\u0012\n\n\u0003\u001c=\n\u0005\u001c\u0007@?\n\n>'AB>\n\n\u0010\u000f\n\nindex all of the samples in the meta-training data set, and where\n\n(5)\n\nis the\n\n(6)\n\n\u0001\u0004\u0003\n\n\u0005\b\u0007\n\n- as a constraint\n\nIn order to ensure that the \ufb01nal kernel in (4) is Mercer, we apply\nin optimization. Solving (5) subject to non-negativity constraints results in a fast quadratic\nprogram of size\n. Such a quadratic program can be solved quickly and robustly by\nstandard optimization packages.\nThe cost function in equation (5) is the square of the Frobenius norm of the difference be-\ntween the empirical matrix\n. The use of the Frobenius norm\nis similar to the Ordinary Least Squares technique of \ufb01tting variogram parameters in geo-\nstatistics [7]. However, instead of summing variogram estimates within spatial bins, we\nform covariance estimates over all meta-training data pairs\nAnalogous to [8], we can prove that the Frobenius norm is consistent: as the amount of\ntraining data goes to in\ufb01nity, the empirical Frobenius norm, above, approaches the Frobe-\nnius norm of the difference between the true kernel and our \ufb01t kernel. (The proof is omitted\nto save space). Finally, unlike the cost function presented in [8], the cost function in equa-\ntion (5) produces an easy-to-solve quadratic program.\n\n and the \ufb01t kernel\n\n\u0005\u0014\u0012 .\n\n3.1 KMT for Music Playlist Generation\n\nIn this section, we consider the application of the general KMT technique to music playlist\ngeneration.\nWe decided to use albums to generate a prior for playlist generation, since albums can\nbe considered to be professionally designed playlists. For the meta-training function\n\nwe use album indicator functions that are 1 for songs on an album\u0017\n\n, and 0 otherwise.\nThus, KMT learns a similarity metric that professionals use when they assemble albums.\nin\nThis same similarity metric empirically makes consonant playlists. Using a small\nequation (4) forces a smoother, more general similarity metric. If we had simply used the\nmeta-training kernel matrix\n, the playlist generator would exactly\nreproduce one or more albums in the meta-training database. This is the meta-training\nequivalent of over\ufb01tting.\nBecause the album indicator functions are uniquely de\ufb01ned for songs, not for metadata\nvectors, we cannot simply generate a kernel matrix according to (6). Instead, we generate\na meta-training kernel matrix using meta-training functions that depend on songs:\n\n without \ufb01tting\n\n\u0006\u0016\u0015 ,\n\nis 1 if song\n\n, 0 otherwise. We then \ufb01t the\n\nwhere\n\n\u0006\u001a\u0015\n(5), where the A\n\nbelongs to album\u0017\n\n> Mercer kernels depend on music metadata vectors\n\n\u0006\u0018\u0015\n\n\u0005\b\u0007\n\n\f\u0006\u0018\u0015\n\n\u0001\u0019\u0012\n\n(7)\n\nthat are de\ufb01ned in\n\n> according to\n\n?\n>\n?\n>\n\"\n\n8\n\u0004\n\n\u000e\n\n\u0002\n\n1\n\u0004\n>\n\u0001\n\u0003\n\u0002\n\u0005\n\u0003\n\n\u0011\n\u0014\n\u0005\n\u0012\n\n\u0002\n\n\n\u0002\n\n\u0007\n\n;\n\u0013\n\u0004\n<\n\u0003\n<\n\u0001\n\u0003\n\u0002\n\n\u0003\n<\n\n\u001b\n?\n>\nC\n\u0001\nE\n\n\u0002\n\n\u0001\n?\n\n\u0012\n\u0001\nE\n\n\u0002\n\n\u0001\n?\n\n\n\u0002\n\n\u0007\n\n;\n\u0013\n\u0004\n\u0015\n\u0001\n\u0012\n\n\u0005\n\u0001\n\u0012\n\n\u0012\n?\n\u0003\n\fTable 1. The resulting kernel is still de\ufb01ned by (4), with a speci\ufb01c A9>\n\nin section 3.2, below.\nWe used 174,577 songs and 14,198 albums to make up the meta-training matrix\nis dimension 174,577x174,577. However, note that the\nsparse, since most songs only belong to 1 or 2 albums. Therefore, it can be stored as\na sparse matrix. We use a quadratic programming package in Matlab that requires the\nconstant and linear parts of the gradient of the cost function in (5):\n\n , which\n\r meta-training matrix is very\n\nthat will be de\ufb01ned\n\n\u0002\u0001\n\n\u0001\u0004\u0003\n\n\u0005\u0007\u0003\n\n\u0005\u0004\n\u0001\u0004\u0003\n\r . The second (linear) term requires a sum over all\n\n\u0005\u0007\u0003\nwhere the \ufb01rst (constant) term is only evaluated on those indicies\nzero\nInstead, we estimate the second term by sampling a random subset of\n\n\b\u0007\n\t\f\u000b\n\nfor each\n\n\u0002\u0007\u0005\n\n).\n\n>@A\n\n\u0014\u0013\n\n(8)\n\n(9)\n\n\u0002\u0007\u0005\n\nin the set\u000f of non-\nand\u0012 , which is impractical.\n\u0005\u0014\u0012\n\npairs (100\n\nrandom\u0012\n\n3.2 Kernels for Categorical Data\nThe kernel learned in section 3 must operate on categorical music metadata. Up until now,\nkernels have been de\ufb01ned to operate on continuous data. We could convert the categorical\ndata to a vector space by allocating one dimension for every possible value of each categor-\nical variable, using a 1-of-N sparse code. This would lead to a vector space of dimension\n1542 (see Table 1) and would produce a large number of kernel parameters. Hence, we\nhave designed a new kernel that operates directly on categorical data.\nWe de\ufb01ne a family of Mercer kernels:\n\n(10)\n\n\u0001\u0004\u0003\u0006\u0005\n\n>\u0011\u0010\n\na modern PC.\n\nis de\ufb01ned to be the binary representation of the number\n\nwhere\nas a mask: when\nth component of the two vectors must match in order\nfor the output of the kernel to be 1. Due to space limitations, proof of the Mercer property\nof this kernel is omitted.\n\n> vector serves\n\nis 1, then the\n\n. The\n\nFor playlist generation, the A\n\nthat are de\ufb01ned in\nTable 1. These vectors have 7 \ufb01elds, thus\nruns from 1 to 128.\nTherefore, there are 128 free parameters in the kernel which are \ufb01t according to (5). The\nsum of 128 terms in (4) can be expressed as a single look-up table, whose keys are 7-bit\n. Thus, the evaluation of\nlong binary vectors, the\nfrom equation (1) on thousands of pieces of music can be done in less than a second on\n\nth bit corresponding to whether\n\n;\n\n>\u0011\u0010\n\nif\notherwise,\n\n- or\n\n\u0007\u0013\u0012\n\n\u0010\u0015\u0014\u0002\u0016\n\n\b\u0007\u000e\r\n> operate on music metadata vectors\n\u0007\u0019\u0012\n\nruns from 1 to 7 and\n\n4 Experimental Results\nWe have tested the combination of GPR and KMT for the generation of playlists. We tested\nAutoDJ on 60 playlists manually designed by users in Microsoft Research. We compared\nthe full GPR + KMT AutoDJ with simply using GPR with a pre-de\ufb01ned kernel, and without\nusing GPR and with a pre-de\ufb01ned kernel (using (1) with all\nequal). We also compare to\na playlist which are all of the user\u2019s songs permuted in a random order. As a baseline, we\ndecided to use Hamming distance as the pre-de\ufb01ned kernel. That is, the similarity between\ntwo songs is the number of metadata \ufb01elds that they have in common.\nWe performed tests using only positive training examples, which emulates users choosing\nseed songs. There were 9 experiments, each with a different number of seed songs, from\n1 to 9. Let the number of seed songs for an experiment be\n. Each experiment consisted\n\n\n\u0002\n\n\u0002\n\n?\n<\n\u0007\n\u0004\n\u0002\n\u000b\n\n\u0003\n\n\u0002\n\n1\n\u0004\n>\n?\n>\n\u0002\n\nA\n<\n\u0001\n\u0003\n\u0002\n\u0005\n\u0003\n\n\u0007\n\u0004\n\u0006\n\u0002\n\u000b\nA\n<\n\u0001\n\u0003\n\u0003\n\n\u0004\n>\n?\n>\n\u0004\n\u0002\n\u000b\n\nA\n>\n\u0002\n\nA\n<\n\u0001\n\u0003\n\u0003\n\n\u0005\n\u0001\n\u0012\n\u0005\n\u0012\n\n\n\u0002\n\u0012\n\u0001\n\u0012\n\n\u0012\nA\n>\n\u0003\n\b\n\n\u0005\n\u000f\n\u0007\n\u0012\n\u0010\n\b\n-\n\u0005\n\u0017\n>\n\u0018\n\u0017\n\u000f\n\u0016\n\u0003\n\u0016\n\u0018\n\u0016\n\u0012\n\u0010\n\b\n\u0010\n\u0003\n\t\n\u0002\n\u001a\n\fis de\ufb01ned to be\n\n songs), then chose\n\nof 1000 trials. Each trial chose a playlist at random (out of the playlists that consisted of at\nleast\nsongs at random out of the playlist as a training set. The\ntest set of each trial consisted of all of the remaining songs in the playlist, plus all other\nsongs owned by the designer of the playlist. This test set thus emulates the possible songs\navailable to the playlist generator.\nTo score the produced playlists, we use a standard collaborative \ufb01ltering metric, described\n\nin [4]. The score of a playlist for trial\u0012\non playlist\u0012 , 0 otherwise),\n\n\u0003\u0002\u0001\n\u0002!\u0005\b\u0007\nth element of the\u0012 th playlist (1 if\n\u0007\u0006\u0005\u0007\u0005\b\u0005\n\u0005\b\u0007\nwhere \nsongs were at the head of the list). Thus, an \n\n are the number of test songs for playlist\u0012 . This score is summed over all 1000\n\nth element is\nis a \u201chalf-life\u201d of user interest in the playlist (set here to be\n\nis the score from (11) if that playlist were perfect (i.e., all of the true playlist\n\n10), and\ntrials, and normalized:\n\nscore of 100 indicates perfect prediction.\n\n\u0007\u0004\u0003\n\u0007\n\u0005\b\u0005\b\u0005\n\u0005\b\u0007\n\nis the user preference of the\n\n\f\u000b\u000e\r\u0010\u000f\n\nwhere\n\n\u000b\u000e\r\u0007\u000f\n\n-#-\n\n(11)\n\n(12)\n\nNumber of Seed Songs\n\n2\n\n3\n\n6\n\n8\n\n4\n\n7\n\n5\n\n1\n\n/ .\n\n46.0\n39.2\n39.0\n6.6\n\n44.8\n39.8\n39.6\n6.5\n\n43.8\n39.6\n40.2\n6.2\n\n44.4\n38.4\n41.7\n6.1\n\n45.0\n40.0\n41.4\n6.6\n\n44.2\n39.5\n41.5\n6.2\n\n46.8\n41.3\n42.6\n6.5\n\nPlaylist Method\n9\n42.9\n44.8\nKMT + GPR\nHamming + GPR\n32.7\n39.8\nHamming + No GPR 32.7\n43.2\nRandom Order\n6.3\n6.8\nTable 2: Scores for Different Playlist Methods. Boldface indicates best method with\nstatistical signi\ufb01cance level$\nThe results for the 9 different experiments are shown in Table 2. A boldface result shows\nthe best method based on pairwise Wilcoxon signed rank test with a signi\ufb01cance level of\n0.05 (and a Bonferroni correction for 6 tests).\nThere are several notable results in Table 2. First, all of the experimental systems perform\nmuch better than random, so they all capture some notion of playlist generation. This\nis probably due to the work that went into designing the metadata schema. Second, and\nmost importantly, the kernel that came out of KMT is substantially better than the hand-\ndesigned kernel, especially when the number of positive examples is 1\u20133. This matches the\nhypothesis that KMT creates a good prior based on previous experience. This good prior\nhelps when the training set is extremely small in size. Third, the performance of KMT +\nGPR saturates very quickly with number of seed songs. This saturation is caused by the\nfact that exact playlists are hard to predict: there are many appropriate songs that would be\nvalid in a test playlist, even if the user did not choose those songs. Thus, the quantitative\nresults shown in Table 2 are actually quite conservative.\n\nPlaylist 1\n\nSeed Eagles, The Sad Cafe\n1\n2\n3\n4\n5\n\nGenesis, More Fool Me\nBee Gees, Rest Your Love On Me Rolling Stones, Ruby Tuesday\nChicago, If You Leave Me Now\nEagles, After The Thrill Is Gone\nCat Stevens, Wild World\n\nLed Zeppelin, Communication Breakdown\nCreedence Clearwater, Sweet Hitch-hiker\nBeatles, Revolution\n\nPlaylist 2\nEagles, Life in the Fast Lane\nEagles, Victim of Love\n\nTable 3: Sample Playlists\n\n\u001a\n\u0013\n\u001a\n\n\n\u0007\n\u0004\n\u000f\n\u0002\n\n8\n\u0006\n\u0002\n\u0019\n\u0007\n\u0006\n\u0007\n\u0019\n\u0007\n\u0007\n\u0005\n\u000f\n\u0002\n\n\u0012\n\u0012\n?\n\u0001\n\n\u0007\n\n\u0004\n\n\n\n\t\n\u0004\n\n\u0005\n\n\t\n-\n\u001b\n-\n\fTo qualitatively test the playlist generator, we distributed a prototype version of it to a few\nindividuals in Microsoft Research. The feedback from use of the prototype has been very\npositive. Qualitative results of the playlist generator are shown in Table 3. In that table,\ntwo different Eagles songs are selected as single seed songs, and the top 5 playlist songs\nare shown. The seed song is always \ufb01rst in the playlist and is not repeated. The seed song\non the left is softer and leads to a softer playlist, while the seed song on the right is harder\nrock and leads to a more hard rock play list.\n\n5 Conclusions\nWe have presented an algorithm, Kernel Meta-Training, which derives a kernel from a\nset of meta-training functions that are related to the function that is being learned. KMT\npermits the learning of functions from very few training points. We have applied KMT to\ncreate AutoDJ, which is a system for automatically generating music playlists. However,\nthe KMT idea may be applicable to other tasks.\nExperiments with music playlist generation show that KMT leads to better results than a\nhand-built kernel when the number of training examples is small. The generated playlists\nare qualitatively very consonant and useful to play as background music.\n\nReferences\n[1] D. Barber and C. K. I. Williams. Gaussian processes for Bayesian classi\ufb01cation via\nhybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, NIPS,\nvolume 9, pages 340\u2013346, 1997.\n\n[2] J. Baxter. A Bayesian/information theoretic model of bias learning. Machine Learn-\n\ning, 28:7\u201340, 1997.\n\n[3] K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In M. S.\nKearns, S. A. Solla, and D. A. Cohn, editors, NIPS, volume 11, pages 368\u2013374, 1998.\n[4] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms\nfor collaborative \ufb01ltering. In Uncertainty in Arti\ufb01cial Intelligence, pages 43\u201352, 1998.\n[5] R. Caruana. Learning many related tasks at the same time with backpropagation. In\n\nNIPS, volume 7, pages 657\u2013664, 1995.\n\n[6] V. Castelli and T. M. Cover. The relative value of labeled and unlabled samples in\nIEEE Trans. Info. Theory,\n\npattern recognition with an unknown mixing parameter.\n42(6):75\u201385, 1996.\n\n[7] N. A. C. Cressie. Statistics for Spatial Data. Wiley, New York, 1993.\n[8] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment.\n\nTechnical Report NC-TR-01-087, NeuroCOLT, 2001.\n\n[9] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative \ufb01ltering to\n\nweave an information tapestry. CACM, 35(12):61\u201370, 1992.\n\n[10] T. Minka and R. Picard. Learning how to learn is learning with points sets. http://\n\nwwwwhite.media.mit.edu/\n\ntpminka/papers/learning.html, 1997.\n\n[11] M. Pazzani and D. Billsus. Learning and revising user pro\ufb01les: The identi\ufb01cation of\n\ninteresting web sites. Machine Learning, 27:313\u2013331, 1997.\n\n[12] P. S. R. S. Rao. Variance Components Estimation: Mixed models, methodologies and\n\napplications. Chapman & Hill, 1997.\n\n[13] S. Thrun.\n\nIs learning the n-th thing any easier than learning the \ufb01rst?\n\nvolume 8, pages 640\u2013646, 1996.\n\nIn NIPS,\n\n[14] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In NIPS,\n\nvolume 8, pages 514\u2013520, 1996.\n\n\n\f", "award": [], "sourceid": 1996, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "Steven", "family_name": "Swenson", "institution": null}, {"given_name": "Christopher", "family_name": "Weare", "institution": null}, {"given_name": "Alice", "family_name": "Zheng", "institution": null}]}