{"title": "Brains on Beats", "book": "Advances in Neural Information Processing Systems", "page_first": 2101, "page_last": 2109, "abstract": "We developed task-optimized deep neural networks (DNNs) that achieved state-of-the-art performance in different evaluation scenarios for automatic music tagging. These DNNs were subsequently used to probe the neural representations of music. Representational similarity analysis revealed the existence of a representational gradient across the superior temporal gyrus (STG). Anterior STG was shown to be more sensitive to low-level stimulus features encoded in shallow DNN layers whereas posterior STG was shown to be more sensitive to high-level stimulus features encoded in deep DNN layers.", "full_text": "Brains on Beats\n\nUmut G\u00fc\u00e7l\u00fc\n\nJordy Thielen\n\nRadboud University, Donders Institute for\n\nRadboud University, Donders Institute for\n\nBrain, Cognition and Behaviour\n\nNijmegen, the Netherlands\nu.guclu@donders.ru.nl\n\nBrain, Cognition and Behaviour\n\nNijmegen, the Netherlands\nj.thielen@psych.ru.nl\n\nMichael Hanke\u2217\n\nOtto-von-Guericke University Magdeburg\n\nCenter for Behavioral Brain Sciences\n\nMagdeburg, Germany\n\nmichael.hanke@ovgu.de\n\nMarcel A. J. van Gerven\u2020\n\nRadboud University, Donders Institute for\n\nBrain, Cognition and Behaviour\n\nNijmegen, the Netherlands\n\nm.vangerven@donders.ru.nl\n\nAbstract\n\nWe developed task-optimized deep neural networks (DNNs) that achieved state-of-\nthe-art performance in different evaluation scenarios for automatic music tagging.\nThese DNNs were subsequently used to probe the neural representations of music.\nRepresentational similarity analysis revealed the existence of a representational\ngradient across the superior temporal gyrus (STG). Anterior STG was shown to\nbe more sensitive to low-level stimulus features encoded in shallow DNN layers\nwhereas posterior STG was shown to be more sensitive to high-level stimulus\nfeatures encoded in deep DNN layers.\n\n1\n\nIntroduction\n\nThe human sensory system is devoted to the processing of sensory information to drive our perception\nof the environment [1]. Sensory cortices are thought to encode a hierarchy of ever more invariant\nrepresentations of the environment [2]. A research question that is at the core of sensory neuroscience\nis what sensory information is processed as one traverses the sensory pathways from the primary\nsensory areas to higher sensory areas.\nThe majority of the work on auditory cortical representations has remained limited to understanding\nthe neural representation of hand-designed low-level stimulus features such as spectro-temporal\nmodels [3], spectro-location models [4], timbre, rhythm, tonality [5\u20137] and pitch [8] or high-level\nrepresentations such as music genre [9] and sound categories [10]. For example, Santoro et al. [3]\nfound that a joint frequency-speci\ufb01c modulation transfer function predicted observed fMRI activity\nbest compared to frequency-nonspeci\ufb01c and independent models. They showed speci\ufb01city to \ufb01ne\nspectral modulations along Heschl\u2019s gyrus (HG) and anterior superior temporal gyrus (STG), whereas\ncoarse spectral modulations were mostly located posterior-laterally to HG, on the planum temporale\n(PT), and STG. Preference for slow temporal modulations was found along HG and STG, whereas fast\ntemporal modulations were observed on PT, and posterior and medially adjacent to HG. Also, it has\nbeen shown that activity in STG, somatosensory cortex, the default mode network, and cerebellum\nare sensitive to timbre, while amygdala, hippocampus and insula are more sensitive to rhythmic and\n\u2217http://psychoinformatics.de; supported by the German federal state of Saxony-Anhalt and the European\n\u2020http://www.ccnlab.net; supported by VIDI grant 639.072.513 of the Netherlands Organization for Scienti\ufb01c\n\nRegional Development Fund (ERDF), project: Center for Behavioral Brain Sciences.\n\nResearch (NWO).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftonality features [5, 7]. However these efforts have not yet provided a complete algorithmic account\nof sensory processing in the auditory system.\nSince their resurgence, deep neural networks (DNNs) coupled with functional magnetic resonance\nimaging (fMRI) have provided a powerful approach to form and test alternative hypotheses about\nwhat sensory information is processed in different brain regions. On one hand, a task-optimized DNN\nmodel learns a hierarchy of nonlinear transformations in a supervised manner with the objective\nof solving a particular task. On the other hand, fMRI measures local changes in blood-oxygen-\nlevel dependent hemodynamic responses to sensory stimulation. Subsequently, any subset of the\nDNN representations that emerge from this hierarchy of nonlinear transformations can be used to\nprobe neural representations by comparing DNN and fMRI responses to the same sensory stimuli.\nConsidering that the sensory systems are biological neural networks that routinely perform the same\ntasks as their arti\ufb01cial counterparts, it is not inconceivable that DNN representations are suitable for\nprobing neural representations.\nIndeed, this approach has been shown to be extremely successful in visual neuroscience. To date,\nseveral task-optimized DNN models were used to accurately model visual areas on the dorsal and\nventral streams [11\u201318], revealing representational gradients where deeper neural network layers\nmap to more downstream areas along the visual pathways [19, 20]. Recently, [21] has shown that\ndeep neural networks trained to map speech excerpts to word labels could be used to predict brain\nresponses to natural sounds. Here, deeper neural network layers were shown to map to auditory brain\nregions that were more distant from primary auditory cortex.\nIn the present work we expand on this line of research where our aim was to model how the human\nbrain responds to music. We achieve this by probing neural representations of music features across\nthe superior temporal gyrus using a deep neural network optimized for music tag prediction. We\nused the representations that emerged after training a DNN to predict tags of musical excerpts as\ncandidate representations for different areas of STG in representational similarity analysis. We show\nthat different DNN layers correspond to different locations along STG such that anterior STG is\nshown to be more sensitive to low-level stimulus features encoded in shallow DNN layers whereas\nposterior STG is shown to be more sensitive to high-level stimulus features encoded in deep DNN\nlayers.\n\n2 Materials and Methods\n\n2.1 MagnaTagATune Dataset\n\nWe used the MagnaTagATune dataset [22] for DNN estimation. The dataset contains 25.863 music\nclips. Each clip is a 29 seconds long excerpt from 5223 songs from 445 albums from 230 artists. Each\nexcerpt is supplied with a vector of binary annotations of 188 tags. These annotations are obtained\nby humans playing the two-player online TagATune game. In this game, the two players are either\npresented with the same or a different audio clip. Subsequently, they are asked to come up with\ntags for their speci\ufb01c audio clip. Afterward, players view each other\u2019s tags and are asked to decide\nwhether they were presented the same audio clip. Tags are only assigned when more than two players\nagreed. The annotations include tags like \u2019singer\u2019, \u2019no singer\u2019, \u2019violin\u2019, \u2019drums\u2019, \u2019classical\u2019, \u2019jazz\u2019, et\ncetera. We restricted our analysis on this dataset to the top 50 most popular tags to ensure that there\nis enough training data for each tag. Parts 1-12 were used for training, part 13 was used for validation\nand parts 14-16 were used for testing.\n\n2.2 Studyforrest Dataset\n\nWe used the existing studyforrest dataset [23] for representational similarity analysis. The dataset\ncontains fMRI data on the perception of musical genres. Twenty participants (age 21-38 years, mean\nage 26.6 years), with normal hearing and no known history of neurological disorders, listened to\ntwenty-\ufb01ve 6 second, 44.1 kHz music clips. The stimulus set comprised \ufb01ve clips per each of the\n\ufb01ve following genres: Ambient, Roots Country, Heavy Metal, 50s Rock \u2018n Roll, and Symphonic.\nStimuli were selected according to the procedure of [9]. The Ambient and Symphonic genres can\nbe considered as non-vocal and the others as vocal. Participants completed eight runs, each with all\ntwenty-\ufb01ve clips.\n\n2\n\n\fUltra-high-\ufb01eld (7 Tesla) fMRI images were collected using a Siemens MAGNETOM scanner,\nT2*-weighted echo-planar images (gradient-echo, repetition time (TR) = 2000 ms, echo time (TE) =\n22 ms, 0.78 ms echo spacing, 1488 Hz/Px bandwidth, generalized auto-calibrating partially parallel\nacquisition (GRAPPA), acceleration factor 3, 24 Hz/Px bandwidth in phase encoding direction), and\na 32 channel brain receiver coil. Thirty-six axial slices were acquired (thickness = 1.4 mm, 1.4 \u00d7\n1.4 mm in-plane resolution, 224 mm \ufb01eld-of-view (FOV) centered on the approximate location of\nHeschl\u2019s gyrus, anterior-to-posterior phase encoding direction, 10% inter-slice gap). Along with the\nfunctional data, cardiac and respiratory traces, and a structural MRI were collected. In our analyses,\nwe only used the data from the 12 subjects (Subjects 1, 3, 4, 6, 7, 9, 12, 14\u201318) with no known data\nanomalies as reported in [23].\nThe anatomical and functional scans were preprocessed as follows: Functional scans were realigned\nto the \ufb01rst scan of the \ufb01rst run and next to the mean scan. Anatomical scans were coregistered\nto the mean functional scan. Realigned functional scans were slice-time corrected to correct for\nthe differences in image acquisition times between the slices. Realigned and slice-time corrected\nfunctional scans were normalized to MNI space. Finally, a general linear model was used to remove\nnoise regressors derived from voxels unrelated to the experimental paradigm and estimate BOLD\nresponse amplitudes [24]. We restricted our analyses to the superior temporal gyrus (STG).\n\n2.3 Deep Neural Networks\n\nWe developed three task-optimized DNN models for tag prediction. Two of the models comprised\n\ufb01ve convolutional layers followed by three fully-connected layers (DNN-T model and DNN-F model).\nThe inputs to the models were 96000-dimensional time (DNN-T model) and frequency (DNN-F\nmodel) domain representations of six second-long audio signals, respectively. One of the models\ncomprised two streams of \ufb01ve convolutional layers followed by three fully connected layers (DNN-TF\nmodel). The inputs to the streams were given by the time and frequency representations. The outputs\nof the convolutional streams were merged and fed into \ufb01rst fully-connected layer. Figure 1 illustrates\nthe architecture of the one-stream models.\n\nFigure 1: Architecture of the one-stream models. First seven layers are followed by parametric\nsoftplus units [25], and the last layer is followed by sigmoid units. The architecture is similar to that\nof AlexNet [26] except for the following modi\ufb01cations: (i) The number of convolutional kernels\nare halved. (ii) The (convolutional and pooling) kernels and strides are \ufb02attened. That is, an n \u00d7 n\nkernel is changed to an n2 \u00d7 1 kernel and an m \u00d7 m stride is changed to an m2 \u00d7 1 stride. (iii)\nLocal response normalization is replaced with batch normalization [27]. (iv) Recti\ufb01ed linear units are\nreplaced with parametric softplus units with initial \u03b1 = 0.2 and initial \u03b2 = 0.5. (v) Softmax units are\nreplaced with sigmoid units.\n\nWe used Adam [28] with parameters \u03b1 = 0.0002, \u03b21 = 0.5, \u03b22 = 0.999, \u0001 = 1e\u22128 and a mini batch\nsize of 36 to train the models by minimizing the binary cross-entropy loss function. Initial model\nparameters were drawn from a uniform distribution as described in [29]. Songs in each training\nmini-batch were randomly cropped to six seconds (96000 samples). The epoch in which the validation\nperformance was the highest was taken as the \ufb01nal model (53, 12 and 12 for T, F and TF models,\nrespectively). The DNN models were implemented in Keras [30].\nOnce trained, we \ufb01rst tested the tag prediction performance of the models and identi\ufb01ed the model\nwith the highest performance. To predict the tags of a 29 second long song excerpt in the test split of\nthe MagnaTagaTune dataset, we \ufb01rst predicted the tags of 24 six-second-long overlapping segments\nseparated by a second and averaged the predictions.\nWe then used the model with the highest performance for nonlinearly transforming the stimuli to\neight layers of hierarchical representations for subsequent analyses. Note that the arti\ufb01cial neurons in\nthe convolutional layers locally \ufb01ltered their inputs (1D convolution), nonlinearly transformed them\n\n3\n\n121/16196k6k1.5k3761.5k37637637694484812812819219212812840964096509/49/49/425999BNBNDODOconv1conv2conv3conv4conv5full6full7full8pool1pool2pool5inputchannelstridek sizei/o size\fand returned temporal representations per stimulus. These representations were further processed by\naveraging them over time. In contrast, the arti\ufb01cial neurons in the fully-connected layers globally\n\ufb01ltered their inputs (dot product), non-linearly transformed them and returned scalar representations\nper stimulus. These representations were not further processed. These transformations resulted in n\nmatrices of size m \u00d7 pi where n is the number of layers (8), m is the number of stimuli (25) and pi is\nthe number of arti\ufb01cial neurons in the ith layer (48 or 96, 128 or 256, 192 or 384, 192 or 384, 128 or\n256, 4096, 4096 and 50 for i = 1, . . . , 8, respectively).\n\n2.4 Representational Similarity Analysis\n\nWe used Representational Similarity Analysis (RSA) [31] to investigate how well the representational\nstructures of DNN model layers match with that of the response patterns in STG. In RSA, models\nand brain regions are characterized by n \u00d7 n representational dissimilarity matrices (RDMs), whose\nelements represent the dissimilarity between the neural or model representations of a pair of stimuli.\nIn turn, computing the overlap between the model and neural RDMs provides evidence about how\nwell a particular model explains the response patterns in a particular brain region. Speci\ufb01cally, we\nperformed a region of interest analysis as well as a searchlight analysis by \ufb01rst constructing the RDMs\nof STG (target RDM) and the model layers (candidate RDM). In the ROI analysis, this resulted in\none target RDM per subject and eight candidate RDMs. For each subject, we correlated the upper\ntriangular parts of the target RDM with the candidate RDMs (Spearman correlation). We quanti\ufb01ed\nthe similarity of STG representations with the model representations as the mean correlation. For the\nsearchlight analysis, this resulted in 27277 target RDMs (each derived from a spherical neighborhood\nof 100 voxels) and 8 candidate RDMs. For each subject and target RDM, we correlated the upper\ntriangular parts of the target RDM with the candidate RDMs (Spearman correlation). Then, the layers\nwhich resulted in the highest correlation were assigned to the voxels at the center of the corresponding\nneighborhoods. Finally, the layer assignments were averaged over the subjects and the result was\ntaken as the \ufb01nal layer assignment of the voxels.\n\n2.5 Control Models\n\nTo evaluate the importance of task optimization for modeling STG representations, we compared the\nrepresentational similarities of the entire STG region and the task-optimized DNN-TF model layers\nwith the representational similarities of the entire STG region and two sets of control models.\nThe \ufb01rst set of control models transformed the stimuli to the following 48-dimensional model\nrepresentations3:\n\u2022 Mel-frequency spectrum (mfs) representing a mel-scaled short-term power spectrum inspired by\nhuman auditory perception where frequencies organized by equidistant pitch locations. These\nrepresentations were computed by applying (i) a short-time Fourier transform and (ii) a mel-scaled\nfrequency-domain \ufb01lterbank.\n\n\u2022 Mel-frequency cepstral coef\ufb01cients (mfccs) representing both broad-spectrum information (timbre)\nand \ufb01ne-scale spectral structure (pitch). These representations were computed by (i) mapping the\nmfs to a decibel amplitude scale and (ii) multiplying them by the discrete cosine transform matrix.\n\u2022 Low-quefrency mel-frequency spectrum (lq_mfs) representing timbre. These representations were\ncomputed by (i) zeroing the high-quefrency mfccs, (ii) multiplying them by the inverse of discrete\ncosine transform matrix and (iii) mapping them back from the decibel amplitude scale.\n\n\u2022 High-quefrency mel-frequency spectrum (hq_mfs) representing pitch. These representations were\ncomputed by (i) zeroing the low-quefrency mfccs, (ii) multiplying them by the inverse of discrete\ncosine transform matrix and (iii) mapping them back from the decibel amplitude scale.\n\nThe second set of control models were 10 random DNN models with the same architecture as the\nDNN-TF model, but with parameters drawn from a zero mean and unit variance multivariate Gaussian\ndistribution.\n\n3These are provided as part of the studyforrest dataset [23].\n\n4\n\n\f3 Results\n\nIn the \ufb01rst set of experiments, we analyzed the task-optimized DNN models. The tag prediction\nperformance of the models for the individual tags was de\ufb01ned as the area under the receiver operator\ncharacteristics (ROC) curve (AUC).\nWe \ufb01rst compared the mean performance of the models over all tags (Figure 2). The performance of\nall models was signi\ufb01cantly above chance level (p (cid:28) 0.001, Student\u2019s t-test, Bonferroni correction).\nThe highest performance was achieved by the DNN-TF model (0.8939), followed by the DNN-F\nmodel (0.8905) and the DNN-T model (0.8852). To the best of our knowledge, this is the highest tag\nprediction performance of an end-to-end model evaluated on the same split of the same dataset [32].\nThe performance was further improved by averaging the predictions of the DNN-T and DNN-F\nmodels (0.8982) as well as those of the DNN-T, DNN-F and DNN-TF models (0.9007). To the best\nof our knowledge, this is the highest tag prediction performance of any model (ensemble) evaluated\non the same split of the same dataset [33, 32, 34]. For the remainder of the analyses, we considered\nonly the DNN-TF model since it achieved the highest single-model performance.\n\nFigure 2: Tag prediction performance of the task-optimized DNN models. Bars show AUCs over\nall tags for the corresponding task-optimized DNN models. Error bars show \u00b1 SE. All pairwise\ndifferences are signi\ufb01cant except for the pairs 1 and 2, and 2 and 3 (p < 0.05, paired-sample t-test,\nBonferroni correction).\n\nWe then compared the performance of the DNN-TF model for the individual tags (Figure 3). Visual\ninspection did not reveal a prominent pattern in the performance distribution over tags. The perfor-\nmance was not signi\ufb01cantly correlated with tag popularity (p > 0.05, Student\u2019s t-test). The only\nexception was that the performance for the positive tags were signi\ufb01cantly higher than that for the\nnegative tags (p (cid:28) 0.001, Student\u2019s t-test).\n\nFigure 3: Tag prediction performance of the task-optimized DNN-TF model. Bars show AUCs\nfor the corresponding tags. Red band shows the mean \u00b1 SE for the task-optimized DNN-TF model\nover all tags.\n\nIn the second set of experiments, we analyzed how closely the representational geometry of STG is\nrelated to the representational geometries of the task-optimized DNN-TF model layers.\nFirst, we constructed the candidate RDMs of the layers (Figure 4). Visual inspection revealed\nsimilarity structure patterns that became increasingly prominent with increasing layer depth. The\nmost prominent pattern was the non-vocal and vocal subdivision.\n\n5\n\n\fFigure 4: RDMs of the task-optimized DNN-TF model layers. Matrix elements show the dissimi-\nlarity (1 - Spearman\u2019s r) between the model layer representations of the corresponding trials. Matrix\nrows and columns are sorted according to the genres of the corresponding trials.\n\nSecond, we performed a region of interest analysis by comparing the reference RDM of the entire\nSTG region with the candidate RDMs (Figure 5). While none of the correlations between the\nreference RDM and the candidate RDMs reached the noise ceiling (expected correlation between the\nreference RDM and the RDM of the true model given the noise in the analyzed data [31]), they were\nall signi\ufb01cantly above chance level (p < 0.05, signed-rank test with subject RFX, FDR correction).\nThe highest correlation was found for Layer 1 (0.6811), whereas the lowest correlation was found for\nLayer 8 (0.4429).\n\nFigure 5: Representational similarities of the entire STG region and the task-optimized DNN-\nTF model layers. Bars show the mean similarity (Spearman\u2019s r) of the target RDM and the corre-\nsponding candidate RDMs over all subjects. Error bars show \u00b1 SE. Red band shows the expected\nrepresentational similarity of the STG and the true model given the noise in the analyzed data (noise\nceiling). All pairwise differences are signi\ufb01cant except for the pairs 1 and 5, 2 and 6, and 3 and 4\n(p < 0.05, signed-rank test with subject RFX, FDR correction).\n\nThird, we performed a searchlight analysis [35] by comparing the reference RDMs of multiple STG\nvoxel neighborhoods with the candidate RDMs (Figure 6). Each neighborhood center was assigned\na layer such that the corresponding target and candidate RDM were maximally correlated. This\nanalysis revealed a systematic change in the mean layer assignments over subjects along STG. They\nincreased from anterior STG to posterior STG such that most voxels in the region of the transverse\ntemporal gyrus were assigned to the shallower layers and most voxels in the region of the angular\ngyrus were assigned to the deeper layers. The corresponding mean correlations between the target\nand the candidate RDMs decreased from anterior to posterior STG.\nIn order to quantify the gradient in layer assignment, we correlated the mean layer assignment of the\nSTG voxels in each coronal slice with the slice position, which was taken to be the slice number. As a\nresult, it was found that layer and position are signi\ufb01cantly correlated for the voxels along the anterior\n- posterior STG direction (r = 0.7255, Pearson\u2019s r, p (cid:28) 0.001, Student\u2019s t-test). Furthermore, the\nmean correlations between the target and the candidate RDMs for the majority (85.53%) of the STG\nvoxels were signi\ufb01cant (p < 0.05, signed-rank test with subject RFX, FDR correction for the number\nof voxels followed by Bonferroni correction for the number of layers). However, the correlations\nof many voxels at the posterior end of STG were not highly signi\ufb01cant in contrast to their central\ncounterparts and ceased to be signi\ufb01cant as the (multiple comparisons corrected) critical value was\ndecreased from 0.05 to 0.01, which reduced the number of voxels surviving the critical value from\n85.53% to 75.32%. Nevertheless, the gradient in layer assignment was maintained even when the\nvoxels that did not survive the new critical value were ignored (r = 0.7332, Pearson\u2019s r, p (cid:28) 0.001,\nStudent\u2019s t-test).\n\n6\n\n\fFigure 6: Representational similarities of the spherical STG voxel clusters and the task-\noptimized DNN-TF model layers. Only the STG voxels that survived the (multiple comparisons\ncorrected) critial value of 0.05 are shown. Those that did not survive the critical value of 0.01 are\nindicated with transparent white masks and black outlines. (A) Mean representational similarities\nover subjects. (B) Mean layer assignments over subjects.\n\nThese results show that increasingly posterior STG voxels can be modeled with increasingly deeper\nDNN layers optimized for music tag prediction. This observation is in line with the visual neuro-\nscience literature where it was shown that increasingly deeper layers of DNNs optimized for visual\nobject and action recognition can be used to model increasingly downstream ventral and dorsal\nstream voxels [19, 20]. It also agrees with previous work showing a gradient in auditory cortex with\nDNNs optimized for speech-to-word mapping [21]. It would be of particular interest to compare the\nrespective gradients and use the music and speech DNNs as each other\u2019s control model such as to\ndisentangle speech- and music-speci\ufb01c representations in auditory cortex.\nIn the last set of experiments, we analyzed the control models. We \ufb01rst constructed the RDMs of the\ncontrol models (Figure 7). Visual inspection revealed considerable differences between the RDMs of\nthe task-optimized DNN-TF model and those of the control models.\n\nFigure 7: RDMs of the random DNN model layers (top row) and the baseline models (bottom\nrow). Matrix elements show the dissimilarity (1 - Spearman\u2019s r) between the model layer representa-\ntions of the corresponding trials. Matrix rows and columns are sorted according to the genres of the\ncorresponding trials.\n\nWe then compared the similarities of the task-optimized candidate RDMs and the target RDM\nversus the similarities of the control RDMs and the target RDM (Figure 8). The layers of the task-\noptimized DNN model signi\ufb01cantly outperformed the corresponding layers of the random DNN model\n(\u2206r = 0.21, p < 0.05, signed-rank test with subject RFX, FDR correction) and the four baseline\nmodels (\u2206r = 0.42 for mfs, \u2206r = 0.21 for mfcc, \u2206r = 0.44 for lq_mfs and \u2206r = 0.34 for hq_mfs,\nsigned-rank test with subject RFX, FDR correction). Furthermore, we performed the searchlight\nanalysis with the random DNN model to determine whether the gradient in layer assignment is a\nconsequence of model architecture or model representation. We found that the random DNN model\nfailed to maintain the gradient in layer assignment (r = \u22120.2175, Pearson\u2019s r, p = 0.0771, Student\u2019s\nt-test), suggesting that the gradient is in the representation that emerges from task optimization.\nThese results show the importance of task optimization for modeling STG representations. This\nobservation also is line with visual neuroscience literature where similar analyses showed the\nimportance of task optimization for modeling ventral stream representations [19, 17].\n\n7\n\n\fFigure 8: Control analyses. (A) Representational similarities of the entire STG region and the\ntask-optimized DNN-TF model versus the representational similarities of the entire STG region and\nthe control models. Different colors show different control models: Random DNN model, mfs model,\nmfcc model, lq_mfs model and hq_mfs model. Bars show mean similarity differences over subjects.\nError bars show \u00b1 SE. (B) Mean layer assignments over subjects for the random DNN model. Voxels,\nmasks and outlines are the same as those in Figure 6.\n\n4 Conclusion\n\nWe showed that task-optimized DNNs that use time and/or frequency domain representations of\nmusic achieved state-of-the-art performance in various evaluation scenarios for automatic music\ntagging. Comparison of DNN and STG representations revealed a representational gradient in STG\nwith anterior STG being more sensitive to low-level stimulus features (shallow DNN layers) and\nposterior STG being more sensitive to high-level stimulus features (deep DNN layers). These results,\nin conjunction with previous results on the visual and auditory cortical representations, suggest\nthe existence of multiple representational gradients that process increasingly complex conceptual\ninformation as we traverse sensory pathways of the human brain.\n\nReferences\n[1] B. L. Schwartz and J. H. Krantz, Sensation and Perception. SAGE Publications, 2015.\n\n[2] J. M. Fuster, Cortex and Mind: Unifying Cognition. Oxford University Press, 2003.\n\n[3] R. Santoro, M. Moerel, F. D. Martino, R. Goebel, K. Ugurbil, E. Yacoub, and E. Formisano, \u201cEncoding\nof natural sounds at multiple spectral and temporal resolutions in the human auditory cortex,\u201d PLOS\nComputational Biology, vol. 10, p. e1003412, jan 2014.\n\n[4] M. Moerel, F. D. Martino, K. U\u02d8gurbil, E. Yacoub, and E. Formisano, \u201cProcessing of frequency and location\n\nin human subcortical auditory structures,\u201d Scienti\ufb01c Reports, vol. 5, p. 17048, nov 2015.\n\n[5] V. Alluri, P. Toiviainen, I. P. J\u00e4\u00e4skel\u00e4inen, E. Glerean, M. Sams, and E. Brattico, \u201cLarge-scale brain\nnetworks emerge from dynamic processing of musical timbre, key and rhythm,\u201d NeuroImage, vol. 59,\npp. 3677\u20133689, feb 2012.\n\n[6] V. Alluri, P. Toiviainen, T. E. Lund, M. Wallentin, P. Vuust, A. K. Nandi, T. Ristaniemi, and E. Brattico,\n\u201cFrom vivaldi to beatles and back: Predicting lateralized brain responses to music,\u201d NeuroImage, vol. 83,\npp. 627\u2013636, dec 2013.\n\n[7] P. Toiviainen, V. Alluri, E. Brattico, M. Wallentin, and P. Vuust, \u201cCapturing the musical brain with lasso:\n\nDynamic decoding of musical features from fMRI data,\u201d NeuroImage, vol. 88, pp. 170\u2013180, mar 2014.\n\n[8] R. D. Patterson, S. Uppenkamp, I. S. Johnsrude, and T. D. Grif\ufb01ths, \u201cThe processing of temporal pitch and\n\nmelody information in auditory cortex,\u201d Neuron, vol. 36, pp. 767\u2013776, nov 2002.\n\n[9] M. Casey, J. Thompson, O. Kang, R. Raizada, and T. Wheatley, \u201cPopulation codes representing musical\n\ntimbre for high-level fMRI categorization of music genres,\u201d in MLINI, 2011.\n\n[10] N. Staeren, H. Renvall, F. D. Martino, R. Goebel, and E. Formisano, \u201cSound categories are represented as\n\ndistributed patterns in the human auditory cortex,\u201d Current Biology, vol. 19, pp. 498\u2013502, mar 2009.\n\n8\n\nAB\f[11] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo, \u201cPerformance-\noptimized hierarchical models predict neural responses in higher visual cortex,\u201d Proceedings of the National\nAcademy of Sciences, vol. 111, pp. 8619\u20138624, may 2014.\n\n[12] P. Agrawal, D. Stansbury, J. Malik, and J. L. Gallant, \u201cPixels to voxels: Modeling visual representation in\n\nthe human brain,\u201d arXiv:1407.5104, 2014.\n\n[13] S.-M. Khaligh-Razavi and N. Kriegeskorte, \u201cDeep supervised, but not unsupervised, models may explain\n\nIT cortical representation,\u201d PLOS Computational Biology, vol. 10, p. e1003915, nov 2014.\n\n[14] C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo,\n\u201cDeep neural networks rival the representation of primate IT cortex for core visual object recognition,\u201d\nPLOS Computational Biology, vol. 10, p. e1003963, dec 2014.\n\n[15] T. Horikawa and Y. Kamitani, \u201cGeneric decoding of seen and imagined objects using hierarchical visual\n\nfeatures,\u201d arXiv:1510.06479, 2015.\n\n[16] R. M. Cichy, A. Khosla, D. Pantazis, A. Torralba, and A. Oliva, \u201cDeep neural networks predict hierarchical\n\nspatio-temporal cortical dynamics of human visual object recognition,\u201d arXiv:1601.02970, 2016.\n\n[17] D. Seibert, D. L. Yamins, D. Ardila, H. Hong, J. J. DiCarlo, and J. L. Gardner, \u201cA performance-optimized\n\nmodel of neural responses across the ventral visual stream,\u201d bioRxiv, 2016.\n\n[18] R. M. Cichy, A. Khosla, D. Pantazis, and A. Oliva, \u201cDynamics of scene representations in the human brain\n\nrevealed by magnetoencephalography and deep neural networks,\u201d NeuroImage, apr 2016.\n\n[19] U. G\u00fc\u00e7l\u00fc and M. A. J. van Gerven, \u201cDeep neural networks reveal a gradient in the complexity of neural\nrepresentations across the ventral stream,\u201d Journal of Neuroscience, vol. 35, pp. 10005\u201310014, jul 2015.\n\n[20] U. G\u00fc\u00e7l\u00fc and M. A. J. van Gerven, \u201cIncreasingly complex representations of natural movies across the\n\ndorsal stream are shared between subjects,\u201d NeuroImage, dec 2015.\n\n[21] A. Kell, D. Yamins, S. Norman-Haignere, and J. McDermott, \u201cSpeech-trained neural networks behave like\n\nhuman listeners and reveal a hierarchy in auditory cortex,\u201d in COSYNE, 2016.\n\n[22] E. Law, K. West, M. Mandel, M. Bay, and J. S. Downie, \u201cEvaluation of algorithms using games: The case\n\nof music tagging,\u201d in ISMIR, 2009.\n\n[23] M. Hanke, R. Dinga, C. H\u00e4usler, J. S. Guntupalli, M. Casey, F. R. Kaule, and J. Stadler, \u201cHigh-\nresolution 7-tesla fMRI data on the perception of musical genres \u2013 an extension to the studyforrest\ndataset,\u201d F1000Research, jun 2015.\n\n[24] K. N. Kay, A. Rokem, J. Winawer, R. F. Dougherty, and B. A. Wandell, \u201cGLMdenoise: a fast, automated\n\ntechnique for denoising task-based fMRI data,\u201d Frontiers in Neuroscience, vol. 7, 2013.\n\n[25] J. M. McFarland, Y. Cui, and D. A. Butts, \u201cInferring nonlinear neuronal computation based on physiologi-\n\ncally plausible inputs,\u201d PLOS Computational Biology, vol. 9, p. e1003143, jul 2013.\n\n[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in NIPS, 2012.\n\n[27] S. Ioffe and C. Szegedy, \u201cBatch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift,\u201d arXiv:1502.03167, 2015.\n\n[28] D. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv:1412.6980, 2014.\n\n[29] X. Glorot and Y. Bengio, \u201cUnderstanding the dif\ufb01culty of training deep feedforward neural networks,\u201d in\n\nAISTATS, 2010.\n\n[30] F. Chollet, \u201cKeras.\u201d https://github.com/fchollet/keras, 2015.\n\n[31] N. Kriegeskorte, \u201cRepresentational similarity analysis \u2013 connecting the branches of systems neuroscience,\u201d\n\nFrontiers in Systems Neuroscience, 2008.\n\n[32] S. Dieleman and B. Schrauwen, \u201cEnd-to-end learning for music audio,\u201d in ICASSP, 2014.\n\n[33] S. Dieleman and B. Schrauwen, \u201cMultiscale approaches to music audio feature learning,\u201d in ISMIR, 2013.\n\n[34] A. van den Oord, S. Dieleman, and B. Schrauwen, \u201cTransfer learning by supervised pre-training for\n\naudio-based music classi\ufb01cation,\u201d in ISMIR, 2014.\n\n[35] N. Kriegeskorte, R. Goebel, and P. Bandettini, \u201cInformation-based functional brain mapping,\u201d Proceedings\n\nof the National Academy of Sciences, vol. 103, pp. 3863\u20133868, feb 2006.\n\n9\n\n\f", "award": [], "sourceid": 1108, "authors": [{"given_name": "Umut", "family_name": "G\u00fc\u00e7l\u00fc", "institution": "Radboud University"}, {"given_name": "Jordy", "family_name": "Thielen", "institution": "Radboud University"}, {"given_name": "Michael", "family_name": "Hanke", "institution": "Otto-von-Guericke University Magdeburg"}, {"given_name": "Marcel", "family_name": "van Gerven", "institution": "Radboud University"}]}