{"title": "SoundNet: Learning Sound Representations from Unlabeled Video", "book": "Advances in Neural Information Processing Systems", "page_first": 892, "page_last": 900, "abstract": "We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.", "full_text": "SoundNet: Learning Sound\n\nRepresentations from Unlabeled Video\n\nYusuf Aytar\u2217\n\nMIT\n\nyusuf@csail.mit.edu\n\nCarl Vondrick\u2217\n\nMIT\n\nvondrick@mit.edu\n\ntorralba@mit.edu\n\nAntonio Torralba\n\nMIT\n\nAbstract\n\nWe learn rich natural sound representations by capitalizing on large amounts of\nunlabeled sound data collected in the wild. We leverage the natural synchronization\nbetween vision and sound to learn an acoustic representation using two-million\nunlabeled videos. Unlabeled video has the advantage that it can be economically\nacquired at massive scales, yet contains useful signals about natural sound. We\npropose a student-teacher training procedure which transfers discriminative visual\nknowledge from well established visual recognition models into the sound modality\nusing unlabeled video as a bridge. Our sound representation yields signi\ufb01cant\nperformance improvements over the state-of-the-art results on standard benchmarks\nfor acoustic scene/object classi\ufb01cation. Visualizations suggest some high-level\nsemantics automatically emerge in the sound network, even though it is trained\nwithout ground truth labels.\n\nIntroduction\n\n1\nThe \ufb01elds of object recognition, speech recognition, machine translation have been revolutionized by\nthe emergence of massive labeled datasets [31, 42, 10] and learned deep representations [17, 33, 10,\n35]. However, there has not yet been the same corresponding progress in natural sound understanding\ntasks. We attribute this partly to the lack of large labeled datasets of sound, which are often both\nexpensive and ambiguous to collect. We believe that large-scale sound data can also signi\ufb01cantly\nadvance natural sound understanding. In this paper, we leverage over one year of sounds collected\nin-the-wild to learn semantically rich sound representations.\nWe propose to scale up by capitalizing on the natural synchronization between vision and sound\nto learn an acoustic representation from unlabeled video. Unlabeled video has the advantage that\nit can be economically acquired at massive scales, yet contains useful signals about sound. Recent\nprogress in computer vision has enabled machines to recognize scenes and objects in images and\nvideos with good accuracy. We show how to transfer this discriminative visual knowledge into sound\nusing unlabeled video as a bridge.\nWe present a deep convolutional network that learns directly on raw audio waveforms, which is\ntrained by transferring knowledge from vision into sound. Although the network is trained with\nvisual supervision, the network has no dependence on vision during inference. In our experiments,\nwe show that the representation learned by our network obtains state-of-the-art accuracy on three\nstandard acoustic scene classi\ufb01cation datasets. Since we can leverage large amounts of unlabeled\nsound data, it is feasible to train deeper networks without signi\ufb01cant over\ufb01tting, and our experiments\nsuggest deeper models perform better. Visualizations of the representation suggest that the network is\nalso learning high-level detectors, such as recognizing bird chirps or crowds cheering, even though it\nis trained directly from audio without ground truth labels.\n\n\u2217contributed equally\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: SoundNet: We propose a deep convolutional architecture for natural sound recognition.\nWe train the network by transferring discriminative knowledge from visual recognition networks into\nsound networks. Our approach capitalizes on the synchronization of vision and sound in video.\n\nThe primary contribution of this paper is the development of a large-scale and semantically rich\nrepresentation for natural sound. We believe large-scale models of natural sounds can have a large\nimpact in many real-world applications, such as robotics and cross-modal understanding. The\nremainder of this paper describes our method and experiments in detail. We \ufb01rst review related work.\nIn section 2, we describe our unlabeled video dataset and in section 3 we present our network and\ntraining procedure. Finally in section 4 we conclude with experiments on standard benchmarks and\nshow several visualizations of the learned representation. Code, data, and models will be released.\n\n1.1 Related Work\nSound Recognition: Although large-scale audio understanding has been extensively studied in the\ncontext of music [5, 37] and speech recognition [10], we focus on understanding natural, in-the-wild\nsounds. Acoustic scene classi\ufb01cation, classifying sound excerpts into existing acoustic scene/object\ncategories, is predominantly based on applying a variety of general classi\ufb01ers (SVMs, GMMs, etc.)\nto the manually crafted sound features (MFCC, spectrograms, etc.) [4, 29, 21, 30, 34, 32]. Even\nthough there are unsupervised [20] and supervised [27, 23, 6, 12] deep learning methods applied to\nsound classi\ufb01cation, the models are limited by the amount of available labeled natural sound data.\nWe distinguish ourselves from the existing literature by training a deep fully convolutional network\non a large scale dataset (2M videos). This allows us to train much deeper networks. Another key\nadvantage of our approach is that we supervise our sound recognition network through semantically\nrich visual discriminative models [33, 17] which proved their robustness on a variety of large scale\nobject/scene categorization challenges[31, 42]. [26] also investigates the relation between vision and\nsound modalities, but focuses on producing sound from image sequences. Concurrent work [11] also\nexplores video as a form of weak labeling for audio event classi\ufb01cation.\nTransfer Learning: Transfer learning is widely studied within computer vision such as transferring\nknowledge for object detection [1, 2] and segmentation [18], however transferring from vision to\nother modalities are only possible recently with the emergence of high performance visual models\n[33, 17]. Our method builds upon teacher-student models [3, 9] and dark knowledge transfer [13].\nIn [3, 13] the basic idea is to compress (i.e. transfer) discriminative knowledge from a well-trained\ncomplex model to a simpler model without loosing considerable accuracy. In [3] and [13] both the\nteacher and the student are in the same modality, whereas in our approach the teacher operates on\nvision to train the student model in sound. [9] also transfer visual supervision into depth models.\nCross-Modal Learning and Unlabeled Video: Our approach is broadly inspired by efforts to\nmodel cross-modal relations [24, 14, 7, 26] and works that leverage large amounts of unlabeled video\n[25, 41, 8, 40, 39]. In this work, we leverage the natural synchronization between vision and sound\nto learn a deep representation of natural sounds without ground truth sound labels.\n\n2\n\nInputconv1conv2conv3conv4conv5conv6conv7conv8Visual\t\r \u00a0Recognition\t\r \u00a0NetworksUnlabeled\t\r \u00a0VideoSoundNetArchitectureKLpool1pool2pool5Raw\t\r \u00a0WaveformRGB\t\r \u00a0FramesObject\t\r \u00a0DistributionScene\t\r \u00a0DistributionKLPlaces\t\r \u00a0CNNImageNetCNNDeep\t\r \u00a01D\t\r \u00a0Convolutional\t\r \u00a0Network\fBeach\n\nClassroom\n\nConstruction\n\nRiver\n\nClub\n\nForrest\n\nHockey\n\nPlayroom\n\nEngine\n\nVegetation\n\nFigure 2: Unlabeled Video Dataset: Sample frames from our 2+ million video dataset. For visual-\nization purposes, each frame is automatically categorized by object and scene vision networks.\n2 Large Unlabeled Video Dataset\nWe seek to learn a representation for sound by leveraging massive amounts of unlabeled videos.\nWhile there are a variety of sources available on the web (e.g., YouTube, Flickr), we chose to use\nvideos from Flickr because they are natural, not professionally edited, short clips that capture various\nsounds in everyday, in-the-wild situations. We downloaded over two million videos from Flickr by\nquerying for popular tags [36] and dictionary words, which resulted in over one year of continuous\nnatural sound and video, which we use for training. The length of each video varies from a few\nseconds to several minutes. We show a small sample of frames from the video dataset in Figure 2.\nWe wish to process sound waves in the raw. Hence, the only post-processing we did on the videos\nwas to convert sound to MP3s, reduce the sampling rate to 22 kHz, and convert to single channel\naudio. Although this slightly degrades the quality of the sound, it allows us to more ef\ufb01ciently operate\non large datasets. We also scaled the waveform to be in the range [\u2212256, 256]. We did not need to\nsubtract the mean because it was naturally near zero already.\n\n3 Learning Sound Representations\n3.1 Deep Convolutional Sound Network\n\nConvolutional Network: We present a deep convolutional architecture for learning sound represen-\ntations. We propose to use a series of one-dimensional convolutions followed by nonlinearities (i.e.\nReLU layer) in order to process sound. Convolutional networks are well-suited for audio signals for\na couple of reasons. Firstly, like images [19], we desire our network to be invariant to translations, a\nproperty that reduces the number of parameters we need to learn and increases ef\ufb01ciency. Secondly,\nconvolutional networks allow us to stack layers, which enables us to detect higher-level concepts\nthrough a series of lower-level detectors.\nVariable Length Input/Output: Since sound can vary in temporal length, we desire our network to\nhandle variable-length inputs. To do this, we use a fully convolutional network. As convolutional\nlayers are invariant to location, we can convolve each layer depending on the length of the input.\nConsequently, in our architecture, we only use convolutional and pooling layers. Since the represen-\ntation adapts to the input length, we must design the output layers to work with variable length inputs\nas well. While we could have used a global pooling strategy [37] to down-sample variable length\ninputs to a \ufb01xed dimensional vector, such a strategy may unnecessarily discard information useful\nfor high-level representations. Since we ultimately aim to train this network with video, which is\nalso variable length, we instead use a convolutional output layer to produce an output over multiple\ntimesteps in video. This strategy is similar to a spatial loss in images [22], but instead temporally.\nNetwork Depth: Since we will use a large amount of video to train, it is feasible to use deep archi-\ntectures without signi\ufb01cant over-\ufb01tting. We experiment with both \ufb01ve-layer and eight-layer networks.\n\n3\n\n\fLayer\nDim.\n# of Filters\nFilter Size\nStride\nPadding\n\nconv1\n220,050 27,506 13,782 1,722\n\nconv2\n\npool1\n\npool2 conv3 conv4 conv5 pool5 conv6 conv7 conv8\n\n8\n2\n0\nTable 1: SoundNet (8 Layer): The con\ufb01guration of the layers for the 8-layer SoundNet.\n\n4\n2\n2\n\n862\n64\n16\n2\n8\n\n432\n128\n8\n2\n4\n\n217\n256\n4\n2\n2\n\n54\n256\n4\n1\n0\n\n28\n512\n4\n2\n2\n\n32\n32\n2\n16\n\n32\n8\n1\n0\n\n15\n1024\n\n4\n\n1401\n\n16\n64\n2\n32\n\n16\n8\n1\n0\n\nconv1\n220,050 27,506 13,782 1,722\n\nconv2\n\npool1\n\npool2 conv3 pool3 conv4 conv5\n\n32\n64\n2\n32\n\n32\n8\n8\n0\n\n64\n32\n2\n16\n\n64\n8\n8\n0\n\n862\n128\n16\n2\n8\n\n432\n128\n8\n8\n0\n\n217\n256\n8\n2\n4\n\n54\n1401\n16\n12\n4\n\nTable 2: SoundNet (5 Layer): The con\ufb01guration for the 5-layer SoundNet.\n\nWe visualize the eight-layer network architecture in Figure 1, which conists of 8 convolutional layers\nand 3 max-pooling layers. We show the layer con\ufb01guration in Table 1 and Table 2.\n\n3.2 Visual Transfer into Sound\nThe main idea in this paper is to leverage the natural synchronization between vision and sound in\nunlabeled video in order to learn a representation for sound. We model the learning problem from a\nstudent-teacher perspective. In our case, state-of-the-art networks for vision will teach our network\nfor sound to recognize scenes and objects.\nLet xi \u2208 RD be a waveform and yi \u2208 R3\u00d7T\u00d7W\u00d7H be its corresponding video for 1 \u2264 i \u2264 N, where\nW, H, T are width, height and number of sampled frames in the video, respectively. During learning,\nwe aim to use the posterior probabilities from a teacher vision network gk(yi) in order to train our\nstudent network fk(xi) to recognize concepts given sound. As we wish to transfer knowledge from\nboth object and scene networks, k enumerates the concepts we are transferring. During learning, we\noptimize min\u03b8\nis the KL-\ndivergence. While there are a variety of distance metrics we could have use, we chose KL-divergence\nbecause the outputs from the vision network gk can be interpreted as a distribution of categories. As\nKL-divergence is differentiable, we optimize it using back-propagation [19] and stochastic gradient\ndescent. We transfer from both scene and object visual networks (K = 2).\n\n(cid:80)N\ni=1 DKL (gk(yi)||fk(xi; \u03b8)) where DKL(P||Q) =(cid:80)\n\nj Pj log Pj\nQj\n\n(cid:80)K\n\nk=1\n\n3.3 Sound Classi\ufb01cation\nAlthough we train SoundNet to classify visual categories, the categories we wish to recognize may not\nappear in visual models (e.g., sneezing). Consequently, we use a different strategy to attach semantic\nmeaning to sounds. We ignore the output layer of our network and use the internal representation as\nfeatures for training classi\ufb01ers, using a small amount of labeled sound data for the concepts of interest.\nWe pick a layer in the network to use as features and train a linear SVM. For multi-class classi\ufb01cation,\nwe use a one-vs-all strategy. We perform cross-validation to pick the margin regularization hyper-\nparameter. For robustness, we follow a standard data augmentation procedure where each training\nsample is split into overlapping \ufb01xed length sound excerpts, which we compute features on and use\nfor training. During inference, we average predictions across all windows.\n\nImplementation\n\n3.4\nOur approach is implemented in Torch7. We use the Adam [16] optimizer and a \ufb01xed learning rate of\n0.001 and momentum term of 0.9 throughout our experiments. We experimented with several batch\nsizes, and found 64 to produce good results. We initialized all the weights to zero mean Gaussian\nnoise with a standard deviation of 0.01. After every convolution, we use batch normalization [15]\nand recti\ufb01ed linear activation units [17]. We train the network for 100, 000 iterations. Optimization\ntypically took 1 day on a GPU.\n4 Experiments\nExperimental Setup: We split the unlabeled video dataset into a training set and a held-out validation\nset. We use 2, 000, 000 videos for training, and the remaining 140, 000 videos for validation. After\ntraining the network, we use the hidden representation as a feature extractor for learning on smaller,\n\n4\n\n\fMethod\nRG [29]\nLTT [21]\nRNH [30]\nEnsemble [34]\nSoundNet\n\nAccuracy\n\n69%\n72%\n77%\n78%\n88%\n\nTable 3: Acoustic Scene Classi\ufb01cation\non DCASE: We evaluate classi\ufb01cation\naccuracy on the DCASE dataset. By\nleveraging large amounts of unlabeled\nvideo, SoundNet generally outperforms\nhand-crafted features by 10%.\n\nMethod\nSVM-MFCC [28]\nConvolutional Autoencoder\nRandom Forest [28]\nPiczak ConvNet [27]\nSoundNet\nHuman Performance [28]\n\nAccuracy on\n\nESC-50 ESC-10\n67.5%\n39.6%\n74.3%\n39.9%\n72.7%\n44.3%\n64.5%\n81.0%\n74.2% 92.2%\n81.3%\n95.7%\n\nTable 4: Acoustic Scene Classi\ufb01cation on ESC-50\nand ESC-10: We evaluate classi\ufb01cation accuracy on\nthe ESC datasets. Results suggest that deep convolu-\ntional sound networks trained with visual supervision\non unlabeled data outperforms baselines.\n\nlabeled sound only datasets. We extract features for a given layer, and train an SVM on the task of\ninterest. For training the SVM, we use the standard training/test splits of the datasets. We report\nclassi\ufb01cation accuracy.\nBaselines:: In addition to published baselines on standard datasets, we explored an additional baseline\ntrained on our unlabeled videos. We experimented using a convolutional autoencoder for sound,\ntrained over our video dataset. We use an autoencoder with 4 encoder layers and 4 decoder layers. For\nthe encoder layers, we used the same \ufb01rst four convolutional layers as SoundNet. For the decoders,\nwe used a fractionally strided convolutional layers (in order to upsample instead of downsample).\nNote that we experimented with deeper autoencoders, but they performed worse. We used mean\nsquared error for the reconstruction loss, and trained the autoencoders for several days.\n\n4.1 Acoustic Scene Classi\ufb01cation\nWe evaluate the SoundNet representation for acoustic scene classi\ufb01cation. The aim in this task is to\ncategorize sound clips into one of the many acoustic scene categories. We use three standard, publicly\navailable datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28].\nDCASE[34]: One of the tasks in the Detection and Classi\ufb01cation of Acoustic Scenes and Events\nChallenge (DCASE)[34] is to recognize scenes from natural sounds. In the challenge, there are 10\nacoustic scene categories, 10 training examples per category, and 100 held-out testing examples.\nEach example is a 30 seconds audio recording. The task is to categorize natural sounds into existing\n10 acoustic scene categories. Multi-class classi\ufb01cation accuracy is used as the performance metric.\nESC-50 and ESC-10 [28]: The ESC-50 dataset\nis a collection of 2000 short (5 seconds) en-\nvironmental sound recordings of equally bal-\nanced 50 categories selected from 5 major\ngroups (animals, natural soundscapes, human\nnon-speech sounds, interior/domestic sounds,\nand exterior/urban noises). Each category has\n40 samples. The data is prearranged into 5 folds\nand the accuracy results are reported as the mean\nof 5 leave-one-fold-out evaluations. The per-\nformance of untrained human participants on\nthis dataset is 81.3% [28]. ESC-10 is a subset\nof ESC-50 which consists of 10 classes (dog\nbark, rain, sea waves, baby cry, clock tic, per-\nson sneeze, helicopter, chainsaw, rooster, and\n\ufb01re cracking). The human performance on this\ndataset is 95.7%.\nWe have two major evaluations on this section:\n(a) comparison with the existing state of the art\nresults, (b) diagnostic performance evaluation\nof inner layers of SoundNet as generic features\nfor this task. In DCASE we used 5 second excerpts, and in ESC datasets we used 1 second windows.\nIn both evaluations a multi-class SVM (multiple one-vs all classi\ufb01ers) is trained over extracted\n\nFigure 3: SoundNet confusions on ESC-50\n\n5\n\n\fTeacher Net\n\nComparison of SoundNet Model\n8 Layer, (cid:96)2 Loss\nLoss\n8 Layer, KL Loss\n8 Layer, ImageNet Only\n8 Layer, Places Only\n8 Layer, Both\n5 Layer, Scratch Init\n8 Layer, Scratch Init\n\nDepth and\nVisual Transfer 5 Layer, Unlabeled Video\n8 Layer, Unlabeled Video\n\nAccuracy on\n\nESC-50 ESC-10\n47.8% 81.5%\n72.9% 92.2%\n69.5% 89.8%\n71.1% 89.5%\n72.9% 92.2%\n65.0% 82.3%\n51.1% 75.5%\n66.1% 86.8%\n72.9% 92.2%\n\nTable 5: Ablation Analysis:\nWe breakdown accuracy of\nvarious con\ufb01gurations using\npool5 from SoundNet trained\nwith VGG. Results suggest that\ndeeper convolutional sound\nnetworks trained with visual\nsupervision on unlabeled data\nhelps recognition.\n\nDataset\nDCASE [34]\n\nESC50 [28]\n\nModel\n8 Layer, AlexNet\n8 Layer, VGG\n8 Layer, AlexNet\n8 Layer, VGG\n\nconv5\n85%\n88%\n\nconv8\nconv4\n68%\n84%\n74%\n77%\n66.0% 71.2% 74.2% 74% 63.8% 45.7%\n66.0% 69.3% 72.9% 73.3% 59.8% 43.7%\n\npool5\n84%\n88%\n\nconv6\n83%\n87%\n\nconv7\n78%\n84%\n\nTable 6: Which layer and teacher network gives better features? The performance comparison\nof extracting features at different SoundNet layers on acoustic scene/object classi\ufb01cation tasks.\n\nSoundNet features. Same data augmentation procedure is also applied during testing and the mean\nscore of all sound excerpts is used as the \ufb01nal score of a test recording for any particular category.\nComparison to State-of-the-Art: Table 3 and 4 compare recognition performance of SoundNet\nfeatures versus previous state-of-the-art features on three datasets. In all cases SoundNet features\noutperformed the existing results by around 10%. Interestingly, SoundNet features approach human\nperformance on ESC-10 dataset, however we stress that this dataset may be easy. We report the\nconfusion matrix across all folds on ESC-50 in Figure 3. The results suggest our approach obtains\nvery good performance on categories such as toilet \ufb02ush (97% accuracy) or door knocks (95%\naccuracy). Common confusions are laughing confused as hens, foot steps confused as door knocks,\nand insects confused as washing machines.\n\n4.2 Ablation Analysis\n\nTo better understand our approach, we perform an ablation analysis in Table 5 and Table 6.\nComparison of Loss and Teacher Net (Table 5): We tried training with different subsets of target\ncategories. In general, performance generally improves with increasing visual supervision. As\nexpected, our results suggest that using both ImageNet and Places networks as supervision performs\nbetter than a single one. This indicates that progress in sound understanding may be furthered by\nbuilding stronger vision models. We also experimented with using (cid:96)2 loss on the target outputs\ninstead of KL loss, which performed signi\ufb01cantly worse.\nComparison of Network Depth (Table 5): We quanti\ufb01ed the impact of network depth. We use \ufb01ve\nlayer version of SoundNet (instead of the full eight) as a feature extractor instead. The \ufb01ve-layer\nSoundNet architecture performed 8% worse than the eight-layer architecture, suggesting depth is\nhelpful for sound understanding. Interestingly, the \ufb01ve-layer network still generally outperforms\nprevious state-of-the-art baselines, but the margin is less. We hypothesize even deeper networks may\nperform better, which can be trained without signi\ufb01cant over-\ufb01tting by leveraging large amounts of\nunlabeled video.\nComparison of Supervision (Table 5): We also experimented with training the network without\nvideo by using only the labeled target training set, which is relatively small (thousands of examples).\nWe simply change the network to output the class probabilities, and train it from random initialization\nwith a cross entropy loss. Hence, the only change is that this baseline does not use any unlabeled\nvideo, allowing us to quantify the contribution of unlabeled video. The \ufb01ve layer SoundNet achieves\nslightly better results than [27] which is also a convolutional network trained with same data but with\na different architecture, suggesting our \ufb01ve layer architecture is similar. Increasing the depth from\n\ufb01ve layers to eight layers decreases the performance from 65% to 51%, probably because it over\ufb01ts\nto the small training set. However, when trained with visual transfer from unlabeled video, the eight\nlayer SoundNet achieves a signi\ufb01cant gain of around 20% compared to the \ufb01ve layer version. This\n\n6\n\n\f(a) t-SNE embedding of visual features\n\n(b) t-SNE embedding of sound features\n\nFigure 4: t-SNE embeddings using visual features and sound features (SoundNet conv7). The visual\nfeatures are concatenated fc7 features from the VGG networks for ImageNet and Places2. Note that\nt-SNE embeddings do not use the class labels. Labels are only used during \ufb01nal visualization.\n\nFeature\n8 Layer, conv7\n8 Layer, conv8\n\nsound\nvision\n32.4% 49.4%\n32.3% 49.4%\n\nvision+sound\n\n51.4%\n50.5%\n\nTable 7: Multi-Modal Recognition: We\nreport classi\ufb01cation accuracy on \u223c 4K la-\nbeled test videos over 44 categories.\n\nsuggests that unlabeled video is a powerful signal for sound understanding, and it can be acquired at\nlarge enough scales to support training high-capacity deep networks.\nComparison of Layer and Teacher Network (Table 6): We analyze the discriminative performance\nof each SoundNet layer. Generally, features from the pool5 layer gives the best performance. We\nalso compared different teacher networks for visual supervision (either VGGNet or AlexNet). The\nresults are inconclusive on which teacher network to use: VGG is a better teacher network for DCASE\nwhile AlexNet is a better teacher network for ESC50.\n\n4.3 Multi-Modal Recognition\nIn order to compare sound features with visual features on scene/object categorization, we annotated\nadditional 9,478 videos (vision+sound) which are not seen by the trained networks before. This new\ndataset consists of 44 categories from 6 major groups of concepts (i.e. urban, nature, work/home,\nmusic/entertainment, sports, and vehicles). It is annotated by Amazon Mechanical Turk workers. The\nfrequency of categories depend on natural occurrences on the web, hence unbalanced.\nVision vs. Sound Embeddings: In order to show the semantic relevance of the features, we per-\nformed a two dimensional t-SNE [38] embedding and visualized our dataset in \ufb01gure 4. The visual\nfeatures are concatenated fc7 features of the two VGG networks trained using ImageNet and Places2\ndatasets. We computed the visual features from uniformly selected 4 frames for each video and\ncomputed the mean feature as the \ufb01nal visual representation. The sound features are the conv7\nfeatures extracted using SoundNet trained with VGG supervision. This visualizations suggests that\nsound features alone also contain considerable amount of semantic information.\nObject and Scene Classi\ufb01cation: We also performed a quantitative comparison between sound\nfeatures and visual features. We used 60% of our dataset for training and the rest for the testing.\nThe chance level of the task is 2.2% and choosing always the most common category (i.e. music\nperformance) yields 14% accuracy. Similar to acoustic scene classi\ufb01cation methods, we trained a\nmulti-class SVM over both sound and visual features individually and then jointly. The results are\ndisplayed in Table 7. Visual features alone obtained an accuracy of 49.4%. The SoundNet features\nobtained 32.4% accuracy. This suggests that even though sound is not as informative as vision, it still\ncontains considerable amount of discriminative information. Furthermore, sound and vision together\nresulted in a modest improvement of 2% over vision only models.\n\n4.4 Visualizations\nIn order to have a better insight on what network learned, we visualize its representation. Figure 5\ndisplays the \ufb01rst 16 convolutional \ufb01lters applied to the raw input audio. The learned \ufb01lters are diverse,\nincluding low and high frequencies, wavelet-like patterns, increasing and decreasing amplitude \ufb01lters.\nWe also visualize some of the hidden units in the last hidden layer (conv7) of our sound representation\n\n7\n\n\fFigure 5: Learned \ufb01lters in conv1: We visualize the \ufb01lters for raw audio in the \ufb01rst layer of the\ndeep convolutional network.\n\nBaby Talk\n\nBubbles\n\nCheering\n\nBird Chirps\n\nFigure 6: What emerges in sound hidden units? We visualize some of the hidden units in the last\nhidden layer of our sound representation by \ufb01nding inputs that maximally activate a hidden unit.\nAbove, we illustrate what these units capture by showing the corresponding video frames. No vision\nis used in this experiment; we only show frames for visualization purposes only.\n\nby \ufb01nding inputs that maximally activate a hidden unit. These visualization are displayed on Figure\n6. Note that visual frames are not used during computation of activations; they are only included in\nthe \ufb01gure for visualization purposes.\n\n5 Conclusion\nWe propose to train deep sound networks (SoundNet) by transferring knowledge from established\nvision networks and large amounts of unlabeled video. The synchronous nature of videos (sound +\nvision) allow us to perform such a transfer which resulted in semantically rich audio representations\nfor natural sounds. Our results show that transfer with unlabeled video is a powerful paradigm for\nlearning sound representations. All of our experiments suggest that one may obtain better performance\nsimply by downloading more videos, creating deeper networks, and leveraging richer vision models.\nAcknowledgements: We thank MIT TIG, especially Garrett Wollman, for helping store 26 TB of\nvideo. We are grateful for the GPUs donated by NVidia. This work was supported by NSF grant\n#1524817 to AT and the Google PhD fellowship to CV.\n\nReferences\n[1] Yusuf Aytar and Andrew Zisserman. Tabula rasa: Model transfer for object category detection. In ICCV,\n\n[2] Yusuf Aytar and Andrew Zisserman. Part level transfer regularization for enhancing exemplar svms. CVIU,\n\n[3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014.\n[4] Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, and Mark D Plumbley. Acoustic scene classi\ufb01cation:\n\nClassifying environments from the sounds they produce. SPM, 2015.\n\n[5] Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In\n\n2011.\n\n2015.\n\nISMIR, 2011.\n\n[6] Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Polyphonic sound event detection\n\nusing multi label deep neural networks. In IJCNN, 2015.\n\n[7] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Learning aligned\n\ncross-modal representations from weakly aligned data. In CVPR, 2016.\n\n[8] Chao-Yeh Chen and Kristen Grauman. Watching unlabeled video helps learn new human actions from\n\n[9] Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. arXiv\n\nvery few labeled snapshots. In CVPR, 2013.\n\npreprint arXiv:1507.00448, 2015.\n\n[10] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger,\nSanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech\nrecognition. arXiv preprint arXiv:1412.5567, 2014.\n\n[11] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore,\nManoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio\nclassi\ufb01cation. arXiv, 2016.\n\n8\n\n\f[12] Lars Hertel, Huy Phan, and Alfred Mertins. Comparing time and frequency domain for audio event\n\nrecognition using deep learning. arXiv, 2016.\n\n[13] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv, 2015.\n[14] Jing Huang and Brian Kingsbury. Audio-visual deep learning for noise robust speech recognition. In\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\n[16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\n[18] Daniel Kuettel and Vittorio Ferrari. Figure-ground segmentation by transferring window masks. In CVPR,\n\n[19] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\nICASSP, 2013.\n\ninternal covariate shift. arXiv, 2015.\n\narXiv:1412.6980, 2014.\n\nneural networks. In NIPS, 2012.\n\n2012.\n\ndocument recognition. IEEE, 1998.\n\n[20] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised feature learning for audio\n\nclassi\ufb01cation using convolutional deep belief networks. In NIPS, 2009.\n\n[21] David Li, Jason Tam, and Derek Toub. Auditory scene classi\ufb01cation using machine learning techniques.\n\n[22] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\nAASP Challenge, 2013.\n\ntion. In CVPR, 2015.\n\nusing deep neural networks. ASL, 2015.\n\ndeep learning. In ICML, 2011.\n\nmicro-videos. arXiv, 2016.\n\n[23] Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. Robust sound event classi\ufb01cation\n\n[24] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal\n\n[25] Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, and Deva Ramamnan. The open world of\n\n[26] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T\n\nFreeman. Visually indicated sounds. arXiv preprint arXiv:1512.08512, 2015.\n\n[27] Karol J Piczak. Environmental sound classi\ufb01cation with convolutional neural networks. In MLSP, 2015.\n[28] Karol J Piczak. Esc: Dataset for environmental sound classi\ufb01cation. In ACM Multimedia, 2015.\n[29] Alain Rakotomamonjy and Gilles Gasso. Histogram of gradients of time-frequency representations for\n\naudio scene classi\ufb01cation. TASLP, 2015.\n\n[30] Guido Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quanti\ufb01cation analysis features for\n\nenvironmental sound recognition. In WASPAA, 2013.\n\n[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nIJCV, 2015.\n\n[32] Justin Salamon and Juan Pablo Bello. Unsupervised feature learning for urban sound classi\ufb01cation. In\n\n[33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[34] Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley.\n\nDetection and classi\ufb01cation of acoustic scenes and events. TM, 2015.\n\n[35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nICASSP, 2015.\n\nNIPS, 2014.\n\n[36] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian\nBorth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM,\n2016.\n\n[37] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommen-\n\n[38] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.\n[39] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled\n\n[40] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. NIPS,\n\n[41] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical \ufb02ow prediction from a static image. In\n\n[42] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features\n\nfor scene recognition using places database. In NIPS, 2014.\n\ndation. In NIPS, 2013.\n\nvideo. CVPR, 2016.\n\n2016.\n\nICCV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 542, "authors": [{"given_name": "Yusuf", "family_name": "Aytar", "institution": "MIT"}, {"given_name": "Carl", "family_name": "Vondrick", "institution": "MIT"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "MIT"}]}