{"title": "Deep content-based music recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 2643, "page_last": 2651, "abstract": "Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. However, this approach suffers from the cold start problem: it fails when no usage data is available, so it is not effective for recommending new and unpopular songs. In this paper, we propose to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data. We compare a traditional approach using a bag-of-words representation of the audio signals with deep convolutional neural networks, and evaluate the predictions quantitatively and qualitatively on the Million Song Dataset. We show that using predicted latent factors produces sensible recommendations, despite the fact that there is a large semantic gap between the characteristics of a song that affect user preference and the corresponding audio signal. We also show that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach.", "full_text": "Deep content-based music recommendation\n\nA\u00a8aron van den Oord, Sander Dieleman, Benjamin Schrauwen\n\nElectronics and Information Systems department (ELIS), Ghent University\n\n{aaron.vandenoord, sander.dieleman, benjamin.schrauwen}@ugent.be\n\nAbstract\n\nAutomatic music recommendation has become an increasingly relevant problem\nin recent years, since a lot of music is now sold and consumed digitally. Most\nrecommender systems rely on collaborative \ufb01ltering. However, this approach suf-\nfers from the cold start problem: it fails when no usage data is available, so it is not\neffective for recommending new and unpopular songs. In this paper, we propose\nto use a latent factor model for recommendation, and predict the latent factors\nfrom music audio when they cannot be obtained from usage data. We compare\na traditional approach using a bag-of-words representation of the audio signals\nwith deep convolutional neural networks, and evaluate the predictions quantita-\ntively and qualitatively on the Million Song Dataset. We show that using predicted\nlatent factors produces sensible recommendations, despite the fact that there is a\nlarge semantic gap between the characteristics of a song that affect user preference\nand the corresponding audio signal. We also show that recent advances in deep\nlearning translate very well to the music recommendation setting, with deep con-\nvolutional neural networks signi\ufb01cantly outperforming the traditional approach.\n\n1\n\nIntroduction\n\nIn recent years, the music industry has shifted more and more towards digital distribution through\nonline music stores and streaming services such as iTunes, Spotify, Grooveshark and Google Play.\nAs a result, automatic music recommendation has become an increasingly relevant problem: it al-\nlows listeners to discover new music that matches their tastes, and enables online music stores to\ntarget their wares to the right audience.\nAlthough recommender systems have been studied extensively, the problem of music recommenda-\ntion in particular is complicated by the sheer variety of different styles and genres, as well as social\nand geographic factors that in\ufb02uence listener preferences. The number of different items that can\nbe recommended is very large, especially when recommending individual songs. This number can\nbe reduced by recommending albums or artists instead, but this is not always compatible with the\nintended use of the system (e.g. automatic playlist generation), and it disregards the fact that the\nrepertoire of an artist is rarely homogenous: listeners may enjoy particular songs more than others.\nMany recommender systems rely on usage patterns: the combinations of items that users have con-\nsumed or rated provide information about the users\u2019 preferences, and how the items relate to each\nother. This is the collaborative \ufb01ltering approach. Another approach is to predict user preferences\nfrom item content and metadata.\nThe consensus is that collaborative \ufb01ltering will generally outperform content-based recommenda-\ntion [1]. However, it is only applicable when usage data is available. Collaborative \ufb01ltering suffers\nfrom the cold start problem: new items that have not been consumed before cannot be recommended.\nAdditionally, items that are only of interest to a niche audience are more dif\ufb01cult to recommend be-\ncause usage data is scarce. In many domains, and especially in music, they comprise the majority of\n\n1\n\n\f1\n\n2\n\n3\n\nArtists with positive values\nJustin Bieber, Alicia Keys, Maroon 5, John\nMayer, Michael Bubl\u00b4e\nBonobo, Flying Lotus, Cut Copy, Chromeo,\nBoys Noize\nPhoenix, Crystal Castles, Muse, R\u00a8oyksopp,\nParamore\n\nArtists with negative values\nThe Kills, Interpol, Man Man, Beirut, the bird\nand the bee\nShinedown, Rise Against, Avenged Sevenfold,\nNickelback, Flyleaf\nTraveling Wilburys, Cat Stevens, Creedence\nClearwater Revival, Van Halen, The Police\n\nTable 1: Artists whose tracks have very positive and very negative values for three latent factors. The factors\nseem to discriminate between different styles, such as indie rock, electronic music and classic rock.\n\nthe available items, because the users\u2019 consumption patterns follow a power law [2]. Content-based\nrecommendation is not affected by these issues.\n\n1.1 Content-based music recommendation\n\nMusic can be recommended based on available metadata: information such as the artist, album and\nyear of release is usually known. Unfortunately this will lead to predictable recommendations. For\nexample, recommending songs by artists that the user is known to enjoy is not particularly useful.\nOne can also attempt to recommend music that is perceptually similar to what the user has previously\nlistened to, by measuring the similarity between audio signals [3, 4]. This approach requires the\nde\ufb01nition of a suitable similarity metric. Such metrics are often de\ufb01ned ad hoc, based on prior\nknowledge about music audio, and as a result they are not necessarily optimal for the task of music\nrecommendation. Because of this, some researchers have used user preference data to tune similarity\nmetrics [5, 6].\n\n1.2 Collaborative \ufb01ltering\n\nCollaborative \ufb01ltering methods can be neighborhood-based or model-based [7]. The former methods\nrely on a similarity measure between users or items: they recommend items consumed by other users\nwith similar preferences, or similar items to the ones that the user has already consumed. Model-\nbased methods on the other hand attempt to model latent characteristics of the users and items, which\nare usually represented as vectors of latent factors. Latent factor models have been very popular ever\nsince their effectiveness was demonstrated for movie recommendation in the Net\ufb02ix Prize [8].\n\n1.3 The semantic gap in music\n\nLatent factor vectors form a compact description of the different facets of users\u2019 tastes, and the\ncorresponding characteristics of the items. To demonstrate this, we computed latent factors for a\nsmall set of usage data, and listed some artists whose songs have very positive and very negative\nvalues for each factor in Table 1. This representation is quite versatile and can be used for other\napplications besides recommendation, as we will show later (see Section 5.1). Since usage data is\nscarce for many songs, it is often impossible to reliably estimate these factor vectors. Therefore it\nwould be useful to be able to predict them from music audio content.\nThere is a large semantic gap between the characteristics of a song that affect user preference, and the\ncorresponding audio signal. Extracting high-level properties such as genre, mood, instrumentation\nand lyrical themes from audio signals requires powerful models that are capable of capturing the\ncomplex hierarchical structure of music. Additionally, some properties are impossible to obtain\nfrom audio signals alone, such as the popularity of the artist, their reputation and and their location.\nResearchers in the domain of music information retrieval (MIR) concern themselves with extracting\nthese high-level properties from music. They have grown to rely on a particular set of engineered\naudio features, such as mel-frequency cepstral coef\ufb01cients (MFCCs), which are used as input to\nsimple classi\ufb01ers or regressors, such as SVMs and linear regression [9]. Recently this traditional\napproach has been challenged by some authors who have applied deep neural networks to MIR\nproblems [10, 11, 12].\n\n2\n\n\fIn this paper, we strive to bridge the semantic gap in music by training deep convolutional neural\nnetworks to predict latent factors from music audio. We evaluate our approach on an industrial-\nscale dataset with audio excerpts of over 380,000 songs, and compare it with a more conventional\napproach using a bag-of-words feature representation for each song. We assess to what extent it is\npossible to extract characteristics that affect user preference directly from audio signals, and evaluate\nthe predictions from our models in a music recommendation setting.\n\n2 The dataset\n\nThe Million Song Dataset (MSD) [13] is a collection of metadata and precomputed audio features\nfor one million contemporary songs. Several other datasets linked to the MSD are also available,\nfeaturing lyrics, cover songs, tags and user listening data. This makes the dataset suitable for a\nwide range of different music information retrieval tasks. Two linked datasets are of interest for our\nexperiments:\n\u2022 The Echo Nest Taste Pro\ufb01le Subset provides play counts for over 380,000 songs in the MSD,\ngathered from 1 million users. The dataset was used in the Million Song Dataset challenge [14]\nlast year.\n\n\u2022 The Last.fm dataset provides tags for over 500,000 songs.\nTraditionally, research in music information retrieval (MIR) on large-scale datasets was limited to\nindustry, because large collections of music audio cannot be published easily due to licensing issues.\nThe authors of the MSD circumvented these issues by providing precomputed features instead of raw\naudio. Unfortunately, the audio features provided with the MSD are of limited use, and the process\nby which they were obtained is not very well documented. The feature set was extended by Rauber\net al. [15], but the absence of raw audio data, or at least a mid-level representation, is still an issue.\nHowever, we were able to attain 29 second audio clips for over 99% of the dataset from 7digital.com.\nDue to its size, the MSD allows for the music recommendation problem to be studied in a more\nrealistic setting than was previously possible. It is also worth noting that the Taste Pro\ufb01le Subset is\none of the largest collaborative \ufb01ltering datasets that are publicly available today.\n\n3 Weighted matrix factorization\n\nThe Taste Pro\ufb01le Subset contains play counts per song and per user, which is a form of implicit\nfeedback. We know how many times the users have listened to each of the songs in the dataset, but\nthey have not explicitly rated them. However, we can assume that users will probably listen to songs\nmore often if they enjoy them. If a user has never listened to a song, this can have many causes:\nfor example, they might not be aware of it, or they might expect not to enjoy it. This setting is not\ncompatible with traditional matrix factorization algorithms, which are aimed at predicting ratings.\nWe used the weighted matrix factorization (WMF) algorithm, proposed by Hu et al. [16], to learn\nlatent factor representations of all users and items in the Taste Pro\ufb01le Subset. This is a modi\ufb01ed\nmatrix factorization algorithm aimed at implicit feedback datasets. Let rui be the play count for\nuser u and song i. For each user-item pair, we de\ufb01ne a preference variable pui and a con\ufb01dence\nvariable cui (I(x) is the indicator function, \u03b1 and \u0001 are hyperparameters):\n\npui = I(rui > 0),\ncui = 1 + \u03b1 log(1 + \u0001\u22121rui).\n\n(1)\n(2)\n\nThe preference variable indicates whether user u has ever listened to song i. If it is 1, we will assume\nthe user enjoys the song. The con\ufb01dence variable measures how certain we are about this particular\npreference. It is a function of the play count, because songs with higher play counts are more likely\nto be preferred. If the song has never been played, the con\ufb01dence variable will have a low value,\nbecause this is the least informative case.\nThe WMF objective function is given by:\n\n3\n\n\f(cid:88)\n\nu,i\n\nmin\nx(cid:63),y(cid:63)\n\ncui(pui \u2212 xT\n\nu yi)2 + \u03bb\n\n(cid:32)(cid:88)\n\nu\n\n||xu||2 +\n\n(cid:33)\n\n,\n\n||yi||2\n\n(cid:88)\n\ni\n\n(3)\n\nwhere \u03bb is a regularization parameter, xu is the latent factor vector for user u, and yi is the latent\nfactor vector for song i. It consists of a con\ufb01dence-weighted mean squared error term and an L2\nregularization term. Note that the \ufb01rst sum ranges over all users and all songs: contrary to matrix\nfactorization for rating prediction, where terms corresponding to user-item combinations for which\nno rating is available can be discarded, we have to take all possible combinations into account. As\na result, using stochastic gradient descent for optimization is not practical for a dataset of this size.\nHu et al. propose an ef\ufb01cient alternating least squares (ALS) optimization method, which we used\ninstead.\n\n4 Predicting latent factors from music audio\n\nPredicting latent factors for a given song from the corresponding audio signal is a regression prob-\nlem. It requires learning a function that maps a time series to a vector of real numbers. We evaluate\ntwo methods to achieve this: one follows the conventional approach in MIR by extracting local\nfeatures from audio signals and aggregating them into a bag-of-words (BoW) representation. Any\ntraditional regression technique can then be used to map this feature representation to the factors.\nThe other method is to use a deep convolutional network.\nLatent factor vectors obtained by applying WMF to the available usage data are used as ground truth\nto train the prediction models. It should be noted that this approach is compatible with any type\nof latent factor model that is suitable for large implicit feedback datasets. We chose to use WMF\nbecause an ef\ufb01cient optimization procedure exists for it.\n\n4.1 Bag-of-words representation\n\nMany MIR systems rely on the following feature extraction pipeline to convert music audio signals\ninto a \ufb01xed-size representation that can be used as input to a classi\ufb01er or regressor [5, 17, 18, 19, 20]:\n\u2022 Extract MFCCs from the audio signals. We computed 13 MFCCs from windows of 1024\naudio frames, corresponding to 23 ms at a sampling rate of 22050 Hz, and a hop size of 512\nsamples. We also computed \ufb01rst and second order differences, yielding 39 coef\ufb01cients in total.\n\u2022 Vector quantize the MFCCs. We learned a dictionary of 4000 elements with the K-means\n\u2022 Aggregate them into a bag-of-words representation. For every song, we counted how many\ntimes each mean was selected. The resulting vector of counts is a bag-of-words feature repre-\nsentation of the song.\n\nalgorithm and assigned all MFCC vectors to the closest mean.\n\nWe then reduced the size of this representation using PCA (we kept enough components to retain\n95% of the variance) and used linear regression and a multilayer perceptron with 1000 hidden units\non top of this to predict latent factors. We also used it as input for the metric learning to rank (MLR)\nalgorithm [21], to learn a similarity metric for content-based recommendation. This was used as a\nbaseline for our music recommendation experiments, which are described in Section 5.2.\n\n4.2 Convolutional neural networks\n\nConvolutional neural networks (CNNs) have recently been used to improve on the state of the art in\nspeech recognition and large-scale image classi\ufb01cation by a large margin [22, 23]. Three ingredients\nseem to be central to the success of this approach:\n\u2022 Using recti\ufb01ed linear units (ReLUs) [24] instead of sigmoid nonlinearities leads to faster conver-\ngence and reduces the vanishing gradient problem that plagues traditional neural networks with\nmany layers.\n\u2022 Parallellization is used to speed up training, so that larger models can be trained in a reasonable\n\namount of time. We used the Theano library [25] to take advantage of GPU acceleration.\n\n4\n\n\f\u2022 A large amount of training data is required to be able to \ufb01t large models with many parameters.\n\nThe MSD contains enough training data to be able to train large models effectively.\n\nWe have also evaluated the use of dropout regularization [26], but this did not yield any signi\ufb01cant\nimprovements.\nWe \ufb01rst extracted an intermediate time-frequency representation from the audio signals to use as\ninput to the network. We used log-compressed mel-spectrograms with 128 components and the same\nwindow size and hop size that we used for the MFCCs (1024 and 512 audio frames respectively).\nThe networks were trained on windows of 3 seconds sampled randomly from the audio clips. This\nwas done primarily to speed up training. To predict the latent factors for an entire clip, we averaged\nover the predictions for consecutive windows.\nConvolutional neural networks are especially suited for predicting latent factors from music audio,\nbecause they allow for intermediate features to be shared between different factors, and because their\nhierarchical structure consisting of alternating feature extraction layers and pooling layers allows\nthem to operate on multiple timescales.\n\n4.3 Objective functions\n\nLatent factor vectors are real-valued, so the most straightforward objective is to minimize the\nmean squared error (MSE) of the predictions. Alternatively, we can also continue to minimize\nthe weighted prediction error (WPE) from the WMF objective function. Let yi be the latent fac-\ntor vector for song i, obtained with WMF, and y(cid:48)\ni the corresponding prediction by the model. The\nobjective functions are then (\u03b8 represents the model parameters):\n\nmin\n\n\u03b8\n\n||yi \u2212 y(cid:48)\n\ni||2,\n\n(4)\n\nmin\n\n\u03b8\n\ncui(pui \u2212 xT\nu y(cid:48)\n\ni)2.\n\n(5)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nu,i\n\n5 Experiments\n\n5.1 Versatility of the latent factor representation\n\nTo demonstrate the versatility of the latent factor vectors, we compared them with audio features in\na tag prediction task. Tags can describe a wide range of different aspects of the songs, such as genre,\ninstrumentation, tempo, mood and year of release.\nWe ran WMF to obtain 50-dimensional latent factor vectors for all 9,330 songs in the subset, and\ntrained a logistic regression model to predict the 50 most popular tags from the Last.fm dataset\nfor each song. We also trained a logistic regression model on a bag-of-words representation of the\naudio signals, which was \ufb01rst reduced in size using PCA (see Section 4.1). We used 10-fold cross-\nvalidation and computed the average area under the ROC curve (AUC) across all tags. This resulted\nin an average AUC of 0.69365 for audio-based prediction, and 0.86703 for prediction based on\nthe latent factor vectors.\n\n5.2 Latent factor prediction: quantitative evaluation\n\nTo assess quantitatively how well we can predict latent factors from music audio, we used the pre-\ndictions from our models for music recommendation. For every user u and for every song i in the\ntest set, we computed the score xT\nu yi, and recommended the songs with the highest scores \ufb01rst. As\nmentioned before, we also learned a song similarity metric on the bag-of-words representation using\nmetric learning to rank. In this case, scores for a given user are computed by averaging similarity\nscores across all the songs that the user has listened to.\nThe following models were used to predict latent factor vectors:\n\u2022 Linear regression trained on the bag-of-words representation described in Section 4.1.\n\u2022 A multi-layer perceptron (MLP) trained on the same bag-of-words representation.\n\u2022 A convolutional neural network trained on log-scaled mel-spectrograms to minimize the mean\n\nsquared error (MSE) of the predictions.\n\n5\n\n\f\u2022 The same convolutional neural network, trained to minimize the weighted prediction error\n\n(WPE) from the WMF objective instead.\n\nModel\nMLR\n\nMLP\n\nlinear regression\n\nCNN with MSE\nCNN with WPE\n\nModel\nrandom\n\nAUC\n0.60608\n0.63518\n0.64611\n0.70987\n0.70101\n\nmAP\n0.01801\n0.02389\n0.02536\n0.05016\n0.04323\n\nTable 2: Results for all considered mod-\nels on a subset of the dataset containing\nonly the 9,330 most popular songs, and\nlistening data for 20,000 users.\n\nFor our initial experiments, we used a subset of the\ndataset containing only the 9,330 most popular songs,\nand listening data for only 20,000 users. We used 1,881\nsongs for testing. For the other experiments, we used\nall available data: we used all songs that we have usage\ndata for and that we were able to download an audio clip\nfor (382,410 songs and 1 million users in total, 46,728\nsongs were used for testing).\nWe report the mean average precision (mAP, cut off at\n500 recommendations per user) and the area under the\nROC curve (AUC) of the predictions. We evaluated all\nmodels on the subset, using latent factor vectors with\n50 dimensions. We compared the convolutional neural\nnetwork with linear regression on the bag-of-words rep-\nresentation on the full dataset as well, using latent factor vectors with 400 dimensions. Results are\nshown in Tables 2 and 3 respectively.\nOn the subset, predicting the latent factors seems to outperform the metric learning approach. Using\nan MLP instead of linear regression results in a slight improvement, but the limitation here is clearly\nthe bag-of-words feature representation. Using a convolutional neural network results in another\nlarge increase in performance. Most likely this is because the bag-of-words representation does not\nre\ufb02ect any kind of temporal structure.\nInterestingly, the WPE objective does not result in improved performance. Presumably this is be-\ncause the weighting causes the importance of the songs to be proportional to their popularity. In\nother words, the model will be encouraged to predict latent factor vectors for popular songs from\nthe training set very well, at the expense of all other songs.\nOn the full dataset, the difference between the bag-of-\nwords approach and the convolutional neural network is\nmuch more pronounced. Note that we did not train an\nMLP on this dataset due to the small difference in per-\nformance with linear regression on the subset. We also\nincluded results for when the latent factor vectors are ob-\ntained from usage data. This is an upper bound to what\nis achievable when predicting them from content. There\nis a large gap between our best result and this theoretical\nmaximum, but this is to be expected: as we mentioned be-\nfore, many aspects of the songs that in\ufb02uence user prefer-\nence cannot possibly be extracted from audio signals only.\nIn particular, we are unable to predict the popularity of\nthe songs, which considerably affects the AUC and mAP\nscores.\n\nTable 3: Results for linear regression on\na bag-of-words representation of the audio\nsignals, and a convolutional neural network\ntrained with the MSE objective, on the full\ndataset (382,410 songs and 1 million users).\nAlso shown are the scores achieved when\nthe latent factor vectors are randomized,\nand when they are learned from usage data\nusing WMF (upper bound).\n\nmAP\n0.00015\n0.00101\n0.00672\n0.23278\n\nAUC\n0.49935\n0.64522\n0.77192\n0.96070\n\nlinear regression\nCNN with MSE\n\nupper bound\n\n5.3 Latent factor prediction: qualitative evaluation\n\nEvaluating recommender systems is a complex matter, and\naccuracy metrics by themselves do not provide enough in-\nsight into whether the recommendations are sound. To establish this, we also performed some\nqualitative experiments on the subset. For each song, we searched for similar songs by measuring\nthe cosine similarity between the predicted usage patterns. We compared the usage patterns pre-\ndicted using the latent factors obtained with WMF (50 dimensions), with those using latent factors\npredicted with a convolutional neural network. A few songs and their closest matches according\nto both models are shown in Table 4. When the predicted latent factors are used, the matches are\nmostly different, but the results are quite reasonable in the sense that the matched songs are likely\nto appeal to the same audience. Furthermore, they seem to be a bit more varied, which is a useful\nproperty for recommender systems.\n\n6\n\n\fQuery\n\nJonas Brothers -\nHold On\n\nBeyonc\u00b4e -\nSpeechless\n\nColdplay -\nI Ran Away\n\nDaft Punk -\nRock\u2019n Roll\n\nMost similar tracks (WMF)\nJonas Brothers - Games\nMiley Cyrus - G.N.O. (Girl\u2019s Night Out)\nMiley Cyrus - Girls Just Wanna Have Fun\nJonas Brothers - Year 3000\nJonas Brothers - BB Good\nBeyonc\u00b4e - Gift From Virgo\nBeyonce - Daddy\nRihanna / J-Status - Crazy Little Thing Called Love\nBeyonc\u00b4e - Dangerously In Love\nRihanna - Haunted\nColdplay - Careful Where You Stand\nColdplay - The Goldrush\nColdplay - X & Y\nColdplay - Square One\nJonas Brothers - BB Good\nDaft Punk - Short Circuit\nDaft Punk - Nightvision\nDaft Punk - Too Long (Gonzales Version)\nDaft Punk - Aerodynamite\nDaft Punk - One More Time / Aerodynamic\n\nMost similar tracks (predicted)\nJonas Brothers - Video Girl\nJonas Brothers - Games\nNew Found Glory - My Friends Over You\nMy Chemical Romance - Thank You For The Venom\nMy Chemical Romance - Teenagers\nDaniel Beding\ufb01eld - If You\u2019re Not The One\nRihanna - Haunted\nAlejandro Sanz - Siempre Es De Noche\nMadonna - Miles Away\nLil Wayne / Shanell - American Star\nArcade Fire - Keep The Car Running\nM83 - You Appearing\nAngus & Julia Stone - Hollywood\nBon Iver - Creature Fear\nColdplay - The Goldrush\nBoys Noize - Shine Shine\nBoys Noize - Lava Lava\nFlying Lotus - Pet Monster Shotglass\nLCD Soundsystem - One Touch\nJustice - One Minute To Midnight\n\nTable 4: A few songs and their closest matches in terms of usage patterns, using latent factors obtained with\nWMF and using latent factors predicted by a convolutional neural network.\n\nFigure 1: t-SNE visualization of the distribution of predicted usage patterns, using latent factors predicted\nfrom audio. A few close-ups show artists whose songs are projected in speci\ufb01c areas. We can discern hip-hop\n(red), rock (green), pop (yellow) and electronic music (blue). This \ufb01gure is best viewed in color.\n\nFollowing McFee et al. [5], we also visualized the distribution of predicted usage patterns in two\ndimensions using t-SNE [27]. A few close-ups are shown in Figure 1. Clusters of songs that appeal\nto the same audience seem to be preserved quite well, even though the latent factor vectors for all\nsongs were predicted from audio.\n\n6 Related work\n\nMany researchers have attempted to mitigate the cold start problem in collaborative \ufb01ltering by\nincorporating content-based features. We review some recent work in this area of research.\n\n7\n\nShaggyDangerdoomEminemIceCubeFeaturingMack10AndMsToiD-12EminemMethodManCypressHillJustinTimberlakeFeaturingT.I.EminemEminemSugarRayfeat.SuperCatBabyBoyDaPrince/P.TownMoeLilWayneEminemFabolous/The-DreamLilScrappyCalle1350CentBabyBashSwizzBeatzYoungJeezy/AkonKanyeWestTechN9neTerrorSquad/Remy/FatJoeTheStreetsOutKastBigKuntryKingWizKhalifaTheRoots/CommonUsherfeaturingJayZLudacrisBinaryStarBlackEyedPeas/PapaRoachJillScottBubbaSparxxxBrandonHeathBinaryStarDJKhaledTheEthiopiansPuffDaddyBlackEyedPeasTheLonelyIsland/T-PainBabyBash/AkonFortMinor(FeaturingBlackThoughtOfTheRootsAndStylesOfBeyond)WillSmithSwizzBeatzMystikalYungJocfeaturingGorillaZoeFugeesBigTymers50CentEminemDonOmarMichaelFranti&Spearhead/CherineAndersonTheLonelyIslandSouljaBoyTell\u2019em/SammieT.I.TheNotoriousB.I.G.GangStarrHotChipDMXJurassic5Cam\u2019Ron/JuelzSantana/UnKasaEminem/DMX/ObieTriceBonoboDangerdoomCommon/KanyeWestYungJocfeaturingGorillaZoeBlackEyedPeasMikeJonesLupeFiascoYingYangTwinsft.LilJon&TheEastSideBoyzATribeCalledQuestM.PokoraSwizzBeatzAllianceEthnikCalle13GorillazCalle13DMXTechN9NECollabosfeaturingBigScoob,KrizzKalikoCalle13TheNotoriousB.I.G.RickRossShaggy/RicardoDucentCaliforniaSwagDistrictGirlTalkD4LCollieBuddzSexiond\u2019AssautDMXLupeFiascoTheNotoriousB.I.G.SeanPaulWestsideConnectionFeaturingNateDoggRedmanJay-Z/JohnLegendJordanFrancis/RoshonBernardFeganGangStarrBowWowCam\u2019Ron/JuelzSantana/FreekeyZeekey/ToyaShaggy/Brian&TonyGoldEPMD/NocturnalEminemEminemGuruCommonIgnitionSlickRickDangerDoomRatatatDiddy-DirtyMoney/T.I.KanyeWest/GLC/ConsequenceCommonJay-ZDaddyYankee/RandyEminem/DinaRaeNas/Damian\u201dJr.Gong\u201dMarleyFortMinor(FeaturingStylesOfBeyond)Common/KanyeWestNewEditionJay-ZBujuBantonMethodMan/BustaRhymesGotanProjectPrince&TheNewPowerGenerationLinkinParkYingYangTwinsft.PitbullWaxTailorTheNotoriousB.I.G.LLCoolJTheNotoriousB.I.G.UsherSlumVillageGymClassHeroesPliesMethodManCompanyFlowLouisSmithThePresidentsoftheUnitedStatesofAmericaKimCarnesEasyStarAll-StarsTheRollingStonesSteveWinwoodNickDrakeTheDoorsTheCarsTheContoursVelvetUnderground&NicoWarTheGrassRootsEricClaptonWildCherryMetallicaCreedenceClearwaterRevivalMarvinGayeTomPettyAndTheHeartbreakersNoDoubtBillMedley&JenniferWarnesRamonesYeahYeahYeahsTheChillsCrosby,Stills&NashR.E.M.CreedenceClearwaterRevivalBachman-TurnerOverdriveTheDoorsJourneyCreamMiriamMakebaMotrheadEnigmaParliamentDavidBowieFlightOfTheConchordsLouisPrimaAndKeelySmithSteelPulseReginaSpektorTheVelvetUndergroundPepperCherFleetwoodMacSteelyDanTomPettyAndTheHeartbreakersRicchiEPoveriEddyGrantTheStatlerBrothersHueyLewis&TheNewsChrisReaVienio&PeleBreadTravelingWilburysTheKillersMetallicaPrinceCharlelieCoutureBobbyHelmsBillWithersLoveLabelle(featuringPattiLabelle)ModestMouseJamesTaylorJimiHendrixRayLaMontagneTheKillsSteveMillerBandTheAnimalsAgainstMe!VanillaIceSteelyDanTommyJamesAndTheShondellsBeastieBoysNoirDsirTheTurtlesCatStevensLabiSiffreTheMercuryProgramCreedenceClearwaterRevivalBarryManilowTheHivesBillyJoelTheBlackKeysTheRunawaysRe-upGangCrosby,Stills,Nash&YoungTomPettyTheMamas&ThePapasCheapTrickTheTurtlesLauraBraniganThedB\u2019sInsaneClownPosseFredaPayneGipsyKingsPinbackMartyStuartWarrenZevonTheBlackKeysMudKingsOfLeonRayLaMontagneBeeGeesHappyMondaysBlueSwedeWeezerTheDoobieBrothersGrahamNashSteveMillerBandThePoliceVanMcCoyEricClaptonEuropeTheFourSeasonsRoseRoyceCatStevensEasyStarAll-StarsThePoliceZZTopFleetwoodMacBillyIdolPatriceRushenTheMonkeesJohnWaiteWeirdAlYankovicDexysMidnightRunnersBodoWartkeCreedenceClearwaterRevivalEstopaPAULACOLESteveMillerThelmaHoustonBootsyCollinsSam&DavePrince&TheRevolutionLonesomeRiverBandJoanJett&TheBlackheartsRubenBladesAerosmithBeyoncDestiny\u2019sChildRiloKileyTokioHotelKardinalOf\ufb01shall/AkonAceofBaseJemDragonetteMadonnaFloRidaMirandaLambertClineDionMariahCareyRihanna/J-StatusAmyWinehouseAliciaKeysAmyWinehouseLadyGaGaLadyGaGaFergiePrettyRickyPassionPitMyloMariahCareyBelanovaSugarlandAaliyahYeahYeahYeahsTheCorrsBeyoncSiaChristinaAguileraThe-DreamMaseVanessaCarltonAliciaKeysDanielBeding\ufb01eldTheVeronicasFloRidaBeyoncThePussycatDolls/BustaRhymesAlejandroSanzLadyGaGaSantigoldYoungMoneyJulietaVenegasADuetoConDanteRiloKileyRuskoThePussycatDollsCherylColeJustinBieberKellyClarksonAvrilLavigneShakiraBeyoncThe-DreamJackWhite&AliciaKeysBeyoncLadyGaGaAnetaLangerovaBrandyYoungMoneyfeaturingLloydEmiliaPliesGretchenWilsonMariahCareyAshantiTaylorSwiftKylieMinogueMarcAnthony;JenniferLopezBeyonceGymClassHeroesTaylorSwiftEve/TruthHurtsBoysLikeGirlsfeaturingTaylorSwiftJackJohnsonTimbaland/KeriHilson/D.O.E.Jay-ZRihannaLMFAOMonicafeaturingTyreseChrisBrownfeaturingT-PainLupeFiascofeat.NikkiJeanJustinBieberMariahCareyShakiraMariahCareyBritneySpearsKeriHilson/LilWayneBeyoncKatDeLunaColbieCaillatKeshaTitoElBambinoNextLaRouxMileyCyrusTwoDoorCinemaClubMariahCareyAvrilLavigneDonavonFrankenreiterJuvenile/MannieFresh/LilWayneBritneySpearsfeaturingYingYangTwinsErykahBaduThieveryCorporationYaelNamBeyoncTaylorSwiftSugarlandAliciaKeysThe-DreamKristiniaDeBargeLilWayne/ShanellBeyoncPortisheadAquaChrisBrownGwenStefani/EveUsherfeaturingBeyoncMariahCareyBeyoncIWayneOwlCityJustinBieber/UsherFergie/LudacrisChrisBrownLeonaLewisFlyleafRudeboyRecordsNelly/JaheimMadonnaLindsayLohanAshleeSimpsonLinkinParkKellyClarksonBritneySpearsWayOutWestRonskiSpeedBonoboBoysNoizeFlyingLotusFlyingLotusCutCopyRevl9nDaveAjuTheKnifeATBFourTetFlyingLotusATBBasshunterJuniorSeniorHolyFuckImogenHeapBasshunterHotChipPeachesLCDSoundsystemDaftPunkCirrusPerfectStrangerSanderVanDoornFriendlyFiresVitalicLCDSoundsystemDigitalismStepsMr.OizoTheKnifeFilo+PeriDaftPunkStardustChromeoAlexGaudinoFeat.ShenaDaftPunkEricPrydzFourTetJusticeShowtekBoysNoizeSafriDuoKelisDaftPunkCutCopyKCAndTheSunshineBandLangeIdaCorrVsFeddeLeGrandCrystalCastlesSimianMobileDiscoCrystalCastlesJohanGielenGorillazThePostalServiceTelefonTelAvivArmandVanHelden&A-TRAKPresentDuckSauceChromeoDaftPunkBassholesDaftPunkTrentemllerGorillazMiikeSnowAlaskaYDinaramaBasshunterJamesBlakeToroYMoiFoalsDeleriumfeat.SarahMcLachlanKateRyanAnimalCollectiveDHTFeat.EdmeDaftPunkIreneCaraAirwaveBoysNoizeRadioheadJuniorBoysSubFocusTheProdigyPanic!AtTheDiscoTelefonTelAvivLCDSoundsystemLCDSoundsystemArcticMonkeys\fWang et al. [28] extend probabilistic matrix factorization (PMF) [29] with a topic model prior on\nthe latent factor vectors of the items, and apply this model to scienti\ufb01c article recommendation.\nTopic proportions obtained from the content of the articles are used instead of latent factors when no\nusage data is available. The entire system is trained jointly, allowing the topic model and the latent\nspace learned by matrix factorization to adapt to each other. Our approach is sequential instead: we\n\ufb01rst obtain latent factor vectors for songs for which usage data is available, and use these to train\na regression model. Because we reduce the incorporation of content information to a regression\nproblem, we are able to use a deep convolutional network.\nMcFee et al. [5] de\ufb01ne an artist-level content-based similarity measure for music learned from a\nsample of collaborative \ufb01lter data using metric learning to rank [21]. They use a variation on the\ntypical bag-of-words approach for audio feature extraction (see section 4.1). Their results corrob-\norate that relying on usage data to train the model improves content-based recommendations. For\naudio data they used the CAL10K dataset, which consists of 10,832 songs, so it is comparable in\nsize to the subset of the MSD that we used for our initial experiments.\nWeston et al. [17] investigate the problem of recommending items to a user given another item as\na query, which they call \u2018collaborative retrieval\u2019. They optimize an item scoring function using a\nranking loss and describe a variant of their method that allows for content features to be incorpo-\nrated. They also use the bag-of-words approach to extract audio features and evaluate this method\non a large proprietary dataset. They \ufb01nd that combining collaborative \ufb01ltering and content-based in-\nformation does not improve the accuracy of the recommendations over collaborative \ufb01ltering alone.\nBoth McFee et al. and Weston et al. optimized their models using a ranking loss. We have opted to\nuse quadratic loss functions instead, because we found their optimization to be more easily scalable.\nUsing a ranking loss instead is an interesting direction of future research, although we suspect that\nthis approach may suffer from the same problems as the WPE objective (i.e. popular songs will have\nan unfair advantage).\n\n7 Conclusion\n\nIn this paper, we have investigated the use of deep convolutional neural networks to predict latent\nfactors from music audio when they cannot be obtained from usage data. We evaluated the predic-\ntions by using them for music recommendation on an industrial-scale dataset. Even though a lot\nof characteristics of songs that affect user preference cannot be predicted from audio signals, the\nresulting recommendations seem to be sensible. We can conclude that predicting latent factors from\nmusic audio is a viable method for recommending new and unpopular music.\nWe also showed that recent advances in deep learning translate very well to the music recommenda-\ntion setting in combination with this approach, with deep convolutional neural networks signi\ufb01cantly\noutperforming a more traditional approach using bag-of-words representations of audio signals. This\nbag-of-words representation is used very often in MIR, and our results indicate that a lot of research\nin this domain could bene\ufb01t signi\ufb01cantly from using deep neural networks.\n\nReferences\n\n[1] M. Slaney. Web-scale multimedia analysis: Does content matter? MultiMedia, IEEE, 18(2):12\u201315, 2011.\n`O. Celma. Music Recommendation and Discovery in the Long Tail. PhD thesis, Universitat Pompeu\nFabra, Barcelona, 2008.\n\n[2]\n\n[3] Malcolm Slaney, Kilian Q. Weinberger, and William White. Learning a metric for music similarity. In\n\nProceedings of the 9th International Conference on Music Information Retrieval (ISMIR), 2008.\n\n[4] Jan Schl\u00a8uter and Christian Osendorfer. Music Similarity Estimation with the Mean-Covariance Restricted\nIn Proceedings of the 10th International Conference on Machine Learning and\n\nBoltzmann Machine.\nApplications (ICMLA), 2011.\n\n[5] Brian McFee, Luke Barrington, and Gert R. G. Lanckriet. Learning content similarity for music recom-\n\nmendation. IEEE Transactions on Audio, Speech & Language Processing, 20(8), 2012.\n\n[6] Richard Stenzel and Thomas Kamps. Improving Content-Based Similarity Measures by Training a Col-\n\nlaborative Model. pages 264\u2013271, London, UK, September 2005. University of London.\n\n8\n\n\f[7] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors. Recommender Systems\n\nHandbook. Springer, 2011.\n\n[8] James Bennett and Stan Lanning. The net\ufb02ix prize. In Proceedings of KDD cup and workshop, volume\n\n2007, page 35, 2007.\n\n[9] Eric J. Humphrey, Juan P. Bello, and Yann LeCun. Moving beyond feature design: Deep architectures\nand automatic feature learning in music informatics. In Proceedings of the 13th International Conference\non Music Information Retrieval (ISMIR), 2012.\n\n[10] Philippe Hamel and Douglas Eck. Learning features from music audio with deep belief networks. In\n\nProceedings of the 11th International Conference on Music Information Retrieval (ISMIR), 2010.\n\n[11] Honglak Lee, Peter Pham, Yan Largman, and Andrew Ng. Unsupervised feature learning for audio\nclassi\ufb01cation using convolutional deep belief networks. In Advances in Neural Information Processing\nSystems 22. 2009.\n\n[12] Sander Dieleman, Phil\u00b4emon Brakel, and Benjamin Schrauwen. Audio-based music classi\ufb01cation with a\npretrained convolutional network. In Proceedings of the 12th International Conference on Music Infor-\nmation Retrieval (ISMIR), 2011.\n\n[13] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset.\n\nIn Proceedings of the 11th International Conference on Music Information Retrieval (ISMIR), 2011.\n\n[14] Brian McFee, Thierry Bertin-Mahieux, Daniel P.W. Ellis, and Gert R.G. Lanckriet. The million song\ndataset challenge. In Proceedings of the 21st international conference companion on World Wide Web,\n2012.\n\n[15] Andreas Rauber, Alexander Schindler, and Rudolf Mayer. Facilitating comprehensive benchmarking\nexperiments on the million song dataset. In Proceedings of the 13th International Conference on Music\nInformation Retrieval (ISMIR), 2012.\n\n[16] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In\n\nProceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.\n\n[17] Jason Weston, Chong Wang, Ron Weiss, and Adam Berenzweig. Latent collaborative retrieval. In Pro-\n\nceedings of the 29th international conference on Machine learning, 2012.\n\n[18] Jason Weston, Samy Bengio, and Philippe Hamel. Large-scale music annotation and retrieval: Learning\n\nto rank in joint semantic spaces. Journal of New Music Research, 2011.\n\n[19] Jonathan T Foote. Content-based retrieval of music and audio. In Voice, Video, and Data Communications,\n\npages 138\u2013147. International Society for Optics and Photonics, 1997.\n\n[20] Matthew Hoffman, David Blei, and Perry Cook. Easy As CBA: A Simple Probabilistic Model for Tagging\nMusic. In Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR),\n2009.\n\n[21] Brian McFee and Gert R. G. Lanckriet. Metric learning to rank. In Proceedings of the 27 th International\n\nConference on Machine Learning, 2010.\n\n[22] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew\nSenior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic\nmodeling in speech recognition: the shared views of four research groups. Signal Processing Magazine,\nIEEE, 29(6):82\u201397, 2012.\n\n[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems 25, 2012.\n\n[24] Vinod Nair and Geoffrey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the 27th International Conference on Machine Learning (ICML-10), 2010.\n\n[25] James Bergstra, Olivier Breuleux, Fr\u00b4ed\u00b4eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Des-\njardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expres-\nsion compiler. In Proceedings of the Python for Scienti\ufb01c Computing Conference (SciPy), June 2010.\n\n[26] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov.\n\nImproving neural\nnetworks by preventing co-adaptation of feature detectors. Technical report, University of Toronto, 2012.\n[27] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning\n\nResearch, 9(2579-2605):85, 2008.\n\n[28] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles.\nIn Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data\nmining, 2011.\n\n[29] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Infor-\n\nmation Processing Systems, volume 20, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1239, "authors": [{"given_name": "Aaron", "family_name": "van den Oord", "institution": "Ghent University"}, {"given_name": "Sander", "family_name": "Dieleman", "institution": "Ghent University"}, {"given_name": "Benjamin", "family_name": "Schrauwen", "institution": "Ghent University"}]}