{"title": "Learning Deep Features for Scene Recognition using Places Database", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.", "full_text": "Learning Deep Features for Scene Recognition\n\nusing Places Database\n\nBolei Zhou1, Agata Lapedriza1,3, Jianxiong Xiao2, Antonio Torralba1, and Aude Oliva1\n\n1Massachusetts Institute of Technology\n\n2Princeton University\n\n3Universitat Oberta de Catalunya\n\nAbstract\n\nScene recognition is one of the hallmark tasks of computer vision, allowing de\ufb01-\nnition of a context for object recognition. Whereas the tremendous recent progress\nin object recognition tasks is due to the availability of large datasets like ImageNet\nand the rise of Convolutional Neural Networks (CNNs) for learning high-level fea-\ntures, performance at scene recognition has not attained the same level of success.\nThis may be because current deep features trained from ImageNet are not competi-\ntive enough for such tasks. Here, we introduce a new scene-centric database called\nPlaces with over 7 million labeled pictures of scenes. We propose new methods\nto compare the density and diversity of image datasets and show that Places is as\ndense as other scene datasets and has more diversity. Using CNN, we learn deep\nfeatures for scene recognition tasks, and establish new state-of-the-art results on\nseveral scene-centric datasets. A visualization of the CNN layers\u2019 responses al-\nlows us to show differences in the internal representations of object-centric and\nscene-centric networks.\n\n1\n\nIntroduction\n\nUnderstanding the world in a single glance is one of the most accomplished feats of the human brain:\nit takes only a few tens of milliseconds to recognize the category of an object or environment, em-\nphasizing an important role of feedforward processing in visual recognition. One of the mechanisms\nsubtending ef\ufb01cient human visual recognition is our capacity to learn and remember a diverse set of\nplaces and exemplars [11]; by sampling the world several times per second, our neural architecture\nconstantly registers new inputs even for a very short time, reaching an exposure to millions of nat-\nural images within just a year. How much would an arti\ufb01cial system have to learn before reaching\nthe scene recognition abilities of a human being?\nBesides the exposure to a dense and rich variety of natural images, one important property of the\nprimate brain is its hierarchical organization in layers of increasing processing complexity, an ar-\nchitecture that has inspired Convolutional Neural Networks or CNNs [2, 14]. These architectures\ntogether with recent large databases (e.g., ImageNet [3]) have obtained astonishing performance on\nobject classi\ufb01cation tasks [12, 5, 20]. However, the baseline performance reached by these networks\non scene classi\ufb01cation tasks is within the range of performance based on hand-designed features\nand sophisticated classi\ufb01ers [24, 21, 4]. Here, we show that one of the reasons for this discrepancy\nis that the higher-level features learned by object-centric versus scene-centric CNNs are different:\niconic images of objects do not contain the richness and diversity of visual information that pictures\nof scenes and environments provide for learning to recognize them.\nHere we introduce Places, a scene-centric image dataset 60 times larger than the SUN database [24].\nWith this database and a standard CNN architecture, we establish new baselines of accuracies on\n\n1\n\n\fvarious scene datasets (Scene15 [17, 13], MIT Indoor67 [19], SUN database [24], and SUN Attribute\nDatabase [18]), signi\ufb01cantly outperforming the results obtained by the deep features from the same\nnetwork architecture trained with ImageNet1.\nThe paper is organized as follows: in Section 2 we introduce the Places database and describe the\ncollection procedure.\nIn Section 3 we compare Places with the other two large image datasets:\nSUN [24] and ImageNet [3]. We perform experiments on Amazon Mechanical Turk (AMT) to com-\npare these 3 datasets in terms of density and diversity. In Section 4 we show new scene classi\ufb01cation\nperformance when training deep features from millions of labeled scene images. Finally, we visual-\nize the units\u2019 responses at different layers of the CNNs, demonstrating that an object-centric network\n(using ImageNet [12]) and a scene-centric network (using Places) learn different features.\n\n2 Places Database\n\nThe \ufb01rst benchmark for scene classi\ufb01cation was the Scene15 database [13] based on [17]. This\ndataset contains only 15 scene categories with a few hundred images per class, where current clas-\nsi\ufb01ers are saturating this dataset nearing human performance at 95%. The MIT Indoor67 database\n[19] has 67 categories on indoor places. The SUN database [24] was introduced to provide a wide\ncoverage of scene categories. It is composed of 397 categories containing more than 100 images per\ncategory.\nDespite those efforts, all these scene-centric datasets are small in comparison with current object\ndatasets such as ImageNet (note that ImageNet also contains scene categories but in a very small\nproportion as is shown in Fig. 2). Complementary to ImageNet (mostly object-centric), we present\nhere a scene-centric database, that we term the Places database. As now, Places contain more than 7\nmillion images from 476 place categories, making it the largest image database of scenes and places\nso far and the \ufb01rst scene-centric database competitive enough to train algorithms that require huge\namounts of data, such as CNNs.\n\n2.1 Building the Places Database\n\nSince the SUN database [24] has a rich scene taxonomy, the Places database has inherited the same\nlist of scene categories. To generate the query of image URL, 696 common adjectives (messy, spare,\nsunny, desolate, etc), manually selected from a list of popular adjectives in English, are combined\nwith each scene category name and are sent to three image search engines (Google Images, Bing\nImages, and Flickr). Adding adjectives to the queries allows us to download a larger number of\nimages than what is available in ImageNet and to increase the diversity of visual appearances. We\nthen remove duplicated URLs and download the raw images with unique URLs. To date, more than\n40 million images have been downloaded. Only color images of 200\u00d7200 pixels or larger are kept.\nPCA-based duplicate removal is conducted within each scene category in the Places database and\nacross the same scene category in the SUN database, which ensures that Places and the SUN do not\ncontain the same images, allowing us to combine the two datasets.\nThe images that survive this initial selection are sent to Amazon Mechanical Turk for two rounds\nof individual image annotation. For a given category name, its de\ufb01nition as in [24], is shown at\nthe top of a screen, with a question like is this a living room scene? A single image at a time\nis shown centered in a large window, and workers are asked to press a Yes or No key. For the\n\ufb01rst round of labeling, the default answer is set to No, requiring the worker to actively pick up the\npositive images. The positive images resulting from the \ufb01rst round annotation are further sent for a\nsecond round annotation, in which the default answer is set to Yes (to pick up the remaining negative\nimages). In each HIT(one assignment for each worker), 750 downloaded images are included for\nannotation, and an additional 30 positive samples and 30 negative samples with ground truth from\nthe SUN database are also randomly injected as control. Valid HITs kept for further analyses require\nan accuracy of 90% or higher on these control images. After the two rounds of annotation, and as this\npaper is published, 7,076,580 images from 476 scene categories are included in the Places database.\nFig. 1 shows image samples obtained with some of the adjectives used in the queries.\n\n1The database and pre-trained networks are available at http://places.csail.mit.edu\n\n2\n\n\fFigure 1: Image samples from the scene categories grouped by their queried adjectives.\n\nFigure 2: Comparison of the number of images per scene category in three databases.\n\nWe made 2 subsets of Places that will be used across the paper as benchmarks. The \ufb01rst one is Places\n205, with the 205 categories with more than 5000 images. Fig. 2 compares the number of images in\nPlaces 205 with ImageNet and SUN. Note that ImageNet only has 128 of the 205 categories, while\nSUN contains all of them (we will call this set SUN 205, and it has, at least, 50 images per category).\nThe second subset of Places used in this paper is Places 88. It contains the 88 common categories\nwith ImageNet such that there are at least 1000 images in ImageNet. We call the corresponding\nsubsets SUN 88 and ImageNet 88.\n\n3 Comparing Scene-centric Databases\n\nDespite the importance of benchmarks and training datasets in computer vision, comparing datasets\nis still an open problem. Even datasets covering the same visual classes have notable differences\nproviding different generalization performance when used to train a classi\ufb01er\n[23]. Beyond the\nnumber of images and categories, there are aspects that are important but dif\ufb01cult to quantify, like\nthe variability in camera poses, in decoration styles or in the objects that appear in the scene.\nAlthough the quality of a database will be task dependent, it is reasonable to assume that a good\ndatabase should be dense (with a high degree of data concentration), and diverse (it should include\na high variability of appearances and viewpoints). Both quantities, density and diversity, are hard to\nestimate in image sets, as they assume some notion of similarity between images which, in general,\nis not well de\ufb01ned. Two images of scenes can be considered similar if they contain similar objects,\nand the objects are in similar spatial con\ufb01gurations and pose, and have similar decoration styles.\nHowever, this notion is loose and subjective so it is hard to answer the question are these two images\nsimilar? For this reason, we de\ufb01ne relative measures for comparing datasets in terms of density and\ndiversity that only require ranking similarities. In this section we will compare the densities and\ndiversities of SUN, ImageNet and Places using these relative measures.\n\n3\n\n stylish kitchenmessy kitchenwooded kitchensunny coastrocky coastmisty coastteenage bedroomromantic bedroomspare bedroomwintering forest pathgreener forest pathdarkest forest path100100010000100000 PlacesImageNetSUNbridgecemeterytowertrain railwaycanyonpondfountaincastlelighthousevalleyharborskyscraperaquariumpalacearchhighwaybedroomcreekbotanical gardenrestaurantkitchenoceanrailroad trackriverbaseball fieldrainforeststadium baseballart galleryoffice buildinggolf coursemansionstaircasewindmillcoaststadium footballparking lotbasilicabuilding facadelobbyabbeyvegetable gardenvolcanoamusement parkshedherb gardenalleypasturemarshraftdockplaygroundmountainhotel roomsea cliffcourtyardbadlandsofficeboardwalkdesert sandpatioliving roomrunwayplazaskymotelunderwater coral reefdrivewaydining roomtrain station platformhospitalviaductforest pathconstruction sitecampsitemausoleummusic studiomountain snowybasementcottage gardenboat deckcoffee shoppagodashowerclassroomballroomcorn fieldparloryardhot springkitchenetteart studiobutteorchardgas stationforest roadcorridorclosetfire stationdamski slopefield wildski resorticebergfairwayphone boothswampairport terminalauditoriumwheat fieldwind farmbookstorefire escapesupermarketbarwater towerrice paddycockpithome officecrosswalkbakery shopbayouverandaslumformal gardenchaletruinattictrack outdoorclothing storetree farmresidential neighborhoodcourthouserestaurant patioengine roommarket outdoorexcavationinn outdoortrenchschoolhouseconference roompavilionaqueducttemple east asiaconference centerhospital roomrock archracecourseshopfronttopiary gardenfield cultivatedchurch outdoorpulpitmuseum indoordinette homeice cream parlorgift shopboxing ringlaundromatnurserymartial arts gymswimming pool outdoorfood courtcathedral outdoorreceptiontemple south asiaamphitheatermedinapantrygalleyapartment building outdoorwatering holeisletbanquet hallcrevassejail cellcandy storekindergarden classroomdorm roombowling alleyice skating rink outdoorgarbage dumpassembly linepicnic arealocker roommonastery outdoorgame roomkasbahhotel outdoorbus interiordoorway outdoortelevision studiobutchers shopwaiting roombamboo forestrestaurant kitchensubway station platformdesert vegetationbeauty salonrope bridgestage indoorsnowfieldcafeteriashoe shopsandbarigloo\f3.1 Relative Density and Diversity\n\nDensity is a measure of data concentration. We assume that, in an image set, high density is equiva-\nlent to the fact that images have, in general, similar neighbors. Given two databases A and B, relative\ndensity aims to measure which one of the two sets has the most similar nearest neighbors. Let a1\nbe a random image from set A and b1 from set B and let us take their respective nearest neighbors\nin each set, a2 from A and b2 from B. If A is denser than B, then it would be more likely that a1\nand a2 are closer to each other than b1 and b2. From this idea we de\ufb01ne the relative density as\nDenB(A) = p (d(a1, a2) < d(b1, b2)), where d(a1, a2) is a distance measure between two images\n(small distance implies high similarity). With this de\ufb01nition of relative density we have that A is\ndenser than B if, and only if, DenB(A) > DenA(B). This de\ufb01nition can be extended to an arbitrary\nnumber of datasets, A1, ..., AN :\n\nDenA2,...,AN (A1) = p(d(a11, a12) < min\ni=2:N\n\nd(ai1, ai2))\n\n(1)\n\nwhere ai1 \u2208 Ai are randomly selected and ai2 \u2208 Ai are near neighbors of their respective ai1.\nThe quality of a dataset can not be measured just by its density. Imagine, for instance, a dataset\ncomposed of 100,000 images all taken within the same bedroom. This dataset would have a very\nhigh density but a very low diversity as all the images would look very similar. An ideal dataset,\nexpected to generalize well, should have high diversity as well.\nThere are several measures of diversity, most of them frequently used in biology to characterize the\nrichness of an ecosystem (see [9] for a review). In this section, we will use a measure inspired by\nSimpson index of diversity [22]. Simpson index measures the probability that two random individ-\nuals from an ecosystem belong to the same species. It is a measure of how well distributed are the\nindividuals across different species in an ecosystem and it is related to the entropy of the distribu-\ntion. Extending this measure for evaluating the diversity of images within a category is non-trivial if\nthere are no annotations of sub-categories. For this reason, we propose to measure relative diversity\nof image datasets A and B based on this idea: if set A is more diverse than set B, then two random\nimages from set B are more likely to be visually similar than two random samples from A. Then,\nthe diversity of A with respect to B can be de\ufb01ned as DivB(A) = 1 \u2212 p(d(a1, a2) < d(b1, b2)),\nwhere a1, a2 \u2208 A and b1, b2 \u2208 B are randomly selected. With this de\ufb01nition of relative diversity we\nhave that A is more diverse than B if, and only if, DivB(A) > DivA(B). For an arbitrary number of\ndatasets, A1, ..., AN :\n\nDivA2,...,AN (A1) = 1 \u2212 p(d(a11, a12) < min\n\ni=2:N\n\nd(ai1, ai2))\n\n(2)\n\nwhere ai1, ai2 \u2208 Ai are randomly selected.\n\n3.2 Experimental Results\n\nWe measured the relative densities and diversities between SUN, ImageNet and Places using AMT.\nBoth measures used the same experimental interface: workers were presented with different pairs\nof images and they had to select the pair that contained the most similar images. We observed that\ndifferent annotators are consistent in deciding whether a pair of images is more similar than another\npair of images.\nIn these experiments, the only difference when estimating density and diversity is how the pairs are\ngenerated. For the diversity experiment, the pairs are randomly sampled from each database. Each\ntrial is composed of 4 pairs from each database, giving a total of 12 pairs to chose from. We used\n4 pairs per database to increase the chances of \ufb01nding a similar pair and avoiding users having to\nskip trials. AMT workers had to select the most similar pair on each trial. We ran 40 trials per\ncategory and two observers per trial, for the 88 categories in common between ImageNet, SUN and\nPlaces databases. Fig. 3a shows some examples of pairs from one of the density experiments.The\npair selected by AMT workers as being more similar is highlighted.\nFor the density experiments, we selected pairs that were more likely to be visually similar. This\nwould require \ufb01rst \ufb01nding the true nearest neighbor of each image, which would be experimentally\ncostly.\nInstead we used visual similarity as measured by using the Euclidean distance between\nthe Gist descriptor [17] of two images. Each pair of images was composed from one randomly\nselected image and its 5-th nearest neighbor using Gist (we ignored the \ufb01rst 4 neighbors to avoid\n\n4\n\n\fFigure 3: a) Examples of pairs for the diversity experiment. b) Examples of pairs for the density\nexperiment. c) Scatter plot of relative diversity vs. relative density per each category and dataset.\n\nFigure 4: Cross dataset generalization of training on the 88 common scenes between Places, SUN\nand ImageNet then testing on the 88 common scenes from: a) SUN, b) ImageNet and c) Places\ndatabase.\n\nnear duplicates, which would give a wrong sense of high density). In this case we also show 12\npairs of images at each trial, but run 25 trials per category instead of 40 to avoid duplicate queries.\nFig. 3b shows some examples of pairs per one of the density experiments and also the selected pair\nis highlighted. Notice that in the density experiment (where we computed neighbors) the pairs look,\nin general, more similar than in the diversity experiment.\nFig. 3c shows a scatter plot of relative diversity vs. relative density for all the 88 categories and the\nthree databases. The point of crossing between the two black lines indicates the point where all the\nresults should fall if all the datasets were identical in terms of diversity and density. The \ufb01gure also\nshows the average of the density and diversity over all categories for each dataset.\nIn terms of density, the three datasets are, on average, very similar. However, there is a larger\nvariation in terms of diversity, showing Places to be the most diverse of the three datasets. The\naverage relative diversity on each dataset is 0.83 for Places, 0.67 for ImageNet and 0.50 for SUN.\nIn the experiment, users selected pairs from the SUN database to be the closest to each other 50%\nof the time, while the pairs from the Places database were judged to be the most similar only on\n17% of the trials. The categories with the largest variation in diversity across the three datasets are\nplayground, veranda and waiting room.\n\n3.3 Cross Dataset Generalization\n\nAs discussed in [23], training and testing across different datasets generally results in a drop of\nperformance due to the dataset bias problem. In this case, the bias between datasets is due, among\nother factors, to the differences in the density and diversity between the three datasets. Fig. 4 shows\nthe classi\ufb01cation results obtained from the training and testing on different permutations of the 3\ndatasets. For these results we use the features extracted from a pre-trained ImageNet-CNN and a\nlinear SVM. In all three cases training and testing on the same dataset provides the best performance\nfor a \ufb01xed number of training examples. As the Places database is very large, it achieves the best\nperformance on two of the test sets when all the training data is used. In the next section we will\nshow that a CNN network trained using the Places database achieves a signi\ufb01cant improvement over\nscene-centered benchmarks in comparison with a network trained using ImageNet.\n\n5\n\n(cid:49)(cid:77)(cid:66)(cid:68)(cid:70)(cid:84)(cid:42)(cid:78)(cid:66)(cid:72)(cid:70)(cid:47)(cid:70)(cid:85)(cid:52)(cid:54)(cid:47)a) b) (cid:17)(cid:17)(cid:15)(cid:18)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:20)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:17)(cid:15)(cid:24)(cid:17)(cid:15)(cid:25)(cid:17)(cid:15)(cid:26)(cid:18)(cid:17)(cid:17)(cid:15)(cid:18)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:20)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:17)(cid:15)(cid:24)(cid:17)(cid:15)(cid:25)(cid:17)(cid:15)(cid:26)(cid:18)(cid:37)(cid:70)(cid:79)(cid:84)(cid:74)(cid:85)(cid:90)(cid:37)(cid:74)(cid:87)(cid:70)(cid:83)(cid:84)(cid:74)(cid:85)(cid:90)(cid:42)(cid:78)(cid:66)(cid:72)(cid:70)(cid:47)(cid:70)(cid:85)(cid:52)(cid:54)(cid:47)(cid:49)(cid:77)(cid:66)(cid:68)(cid:70)(cid:84)c)(cid:49)(cid:77)(cid:66)(cid:68)(cid:70)(cid:84)(cid:42)(cid:78)(cid:66)(cid:72)(cid:70)(cid:47)(cid:70)(cid:85)(cid:52)(cid:54)(cid:47)10010110210310410203040506070Number of training samples per categoryClassi!cation accuracyTest on ImageNet Scene 88 Train on ImageNet 88 [65.6]Train on Places 88 [60.3]Train on SUN 88 [49.2] 10010110210310410203040506070Number of training samples per categoryClassi!cation accuracyTest on SUN 88 Train on SUN 88 [63.3]Train on ImageNet 88 [62.8]Train on Places 88 [69.5]10010110210310410152025303540455055Number of training samples per categoryClassi!cation accuracyTest on Places 88 Train on Places 88 [54.2] Train on ImageNet 88 [44.6]Train on SUN 88 [37.0]a)b)c)\fTable 1: Classi\ufb01cation accuracy on the test set of Places 205 and the test set of SUN 205.\n\nPlaces 205\n\n50.0%\n40.8%\n\nPlaces-CNN\nImageNet CNN feature+SVM\n\nSUN 205\n66.2%\n49.6%\n\n4 Training Neural Network for Scene Recognition and Deep Features\n\nDeep convolutional neural networks have obtained impressive classi\ufb01cation performance on the\nImageNet benchmark [12]. For the training of Places-CNN, we randomly select 2,448,873 images\nfrom 205 categories of Places (referred to as Places 205) as the train set, with minimum 5,000 and\nmaximum 15,000 images per category. The validation set contains 100 images per category and the\ntest set contains 200 images per category (a total of 41,000 images). Places-CNN is trained using\nthe Caffe package on a GPU NVIDIA Tesla K40. It took about 6 days to \ufb01nish 300,000 iterations of\ntraining. The network architecture of Places-CNN is the same as the one used in the Caffe reference\nnetwork [10]. The Caffe reference network, which is trained on 1.2 million images of ImageNet\n(ILSVRC 2012), has approximately the same architecture as the network proposed by [12]. We call\nthe Caffe reference network as ImageNet-CNN in the following comparison experiments.\n\n4.1 Visualization of the Deep Features\n\nThrough the visualization of the responses of the units for various levels of network layers, we\ncan have a better understanding of the differences between the ImageNet-CNN and Places-CNN\ngiven that they share the same architecture. Fig.5 visualizes the learned representation of the units\nat the Conv 1, Pool 2, Pool 5, and FC 7 layers of the two networks. Whereas Conv 1 units can\nbe directly visualized (they capture the oriented edges and opponent colors from both networks),\nwe use the mean image method to visualize the units of the higher layers: we \ufb01rst combine the\ntest set of ImageNet LSVRC2012 (100,000 images) and SUN397 (108,754 images) as the input\nfor both networks; then we sort all these images based on the activation response of each unit at\neach layer; \ufb01nally we average the top 100 images with the largest responses for each unit as a kind\nof receptive \ufb01eld (RF) visualization of each unit. To compare the units from the two networks,\nFig. 5 displays mean images sorted by their \ufb01rst principal component. Despite the simplicity of\nthe method, the units in both networks exhibit many differences starting from Pool 2. From Pool\n2 to Pool 5 and FC 7, gradually the units in ImageNet-CNN have RFs that look like object-blobs,\nwhile units in Places-CNN have more RFs that look like landscapes with more spatial structures.\nThese learned unit structures are closely relevant to the differences of the training data. In future\nwork, it will be fascinating to relate the similarity and differences of the RF at different layers of\nthe object-centric network and scene-centric network with the known object-centered and scene-\ncentered neural cortical pathways identi\ufb01ed in the human brain (for a review, [16]). In the next\nsection we will show that these two networks (only differing in the training sets) yield very different\nperformances on a variety of recognition benchmarks.\n\n4.2 Results on Places 205 and SUN 205\n\nAfter the Places-CNN is trained, we use the \ufb01nal layer output (Soft-max) of the network to classify\nimages in the test set of Places 205 and SUN 205. The classi\ufb01cation result is listed in Table 1. As\na baseline comparison, we show the results of a linear SVM trained on ImageNet-CNN features\nof 5000 images per category in Places 205 and 50 images per category in SUN 205 respectively.\nPlaces-CNN performs much better. We further compute the performance of the Places-CNN in the\nterms of the top-5 error rate (one test sample is counted as misclassi\ufb01ed if the ground-truth label is\nnot among the top 5 predicted labels of the model). The top-5 error rate for the test set of the Places\n205 is 18.9%, while the top-5 error rate for the test set of SUN 205 is 8.1%.\n\n4.3 Generic Deep Features for Visual Recognition\n\nWe use the responses from the trained CNN as generic features for visual recognition tasks. Re-\nsponses from the higher-level layers of CNN have proven to be effective generic features with state-\nof-the-art performance on various image datasets [5, 20]. Thus we evaluate performance of the\n\n6\n\n\fFigure 5: Visualization of the units\u2019 receptive \ufb01elds at different layers for the ImageNet-CNN and\nPlaces-CNN. Conv 1 units contains 96 \ufb01lters. The Pool 2 feature map is 13\u00d713\u00d7256; The Pool 5\nfeature map is 6\u00d76\u00d7256; The FC 7 feature map is 4096\u00d71. Subset of units at each layer are shown.\n\nTable 2: Classi\ufb01cation accuracy/precision on scene-centric databases and object-centric databases\nfor the Places-CNN feature and ImageNet-CNN feature. The classi\ufb01er in all the experiments is a\nlinear SVM with the same parameters for the two features.\n\nPlaces-CNN feature\nImageNet-CNN feature\n\nPlaces-CNN feature\nImageNet-CNN feature\n\nSUN397\n54.32\u00b10.14\n42.61\u00b10.16\nCaltech101\n65.18\u00b10.88\n87.22\u00b10.92\n\nMIT Indoor67\n\n68.24\n56.79\n\nCaltech256\n45.59\u00b10.31\n67.23\u00b10.27\n\nScene15\n90.19\u00b10.34\n84.23\u00b10.37\nAction40\n42.86\u00b10.25\n54.92\u00b10.33\n\nSUN Attribute\n\n91.29\n89.85\nEvent8\n\n94.12\u00b10.99\n94.42\u00b10.76\n\ndeep features from the Places-CNN on the following scene and object benchmarks: SUN397 [24],\nMIT Indoor67 [19], Scene15 [13], SUN Attribute [18], Caltech101 [7], Caltech256 [8], Stanford\nAction40 [25], and UIUC Event8 [15]. All the experiments follow the standards in those papers 2.\nAs a comparison, we evaluate the deep feature\u2019s performance from the ImageNet-CNN on those\nsame benchmarks. Places-CNN and ImageNet-CNN have exactly the same network architecture,\nbut they are trained on scene-centric data and object-centric data respectively. We use the deep\nfeatures from the response of the Fully Connected Layer (FC) 7 of the CNNs, which is the \ufb01nal fully\nconnected layer before producing the class predictions. There is only a minor difference between\nthe feature of FC 7 and the feature of FC 6 layer [5]. The deep feature for each image is a 4096-\ndimensional vector.\nTable 2 summarizes the classi\ufb01cation accuracy on various datasets for the ImageNet-CNN feature\nand the Places-CNN feature. Fig.6 plots the classi\ufb01cation accuracy for different visual features\non SUN397 database and SUN Attribute dataset. The classi\ufb01er is a linear SVM with the same\ndefault parameters for the two deep features (C=1) [6]. The Places-CNN feature shows impressive\nperformance on scene classi\ufb01cation benchmarks, outperforming the current state-of-the-art methods\nfor SUN397 (47.20% [21]) and for MIT Indoor67 (66.87% [4]). On the other hand, the ImageNet-\nCNN feature shows better performance on object-related databases. Importantly, our comparison\n\n2Detailed experimental setups are included in the supplementary materials.\n\n7\n\nConv 1Pool 2Pool 5 FC 7 NNC-teNegamINNC-secalP\fFigure 6: Classi\ufb01cation accuracy on the SUN397 Dataset and average precision on the SUN At-\ntribute Dataset with increasing size of training samples for the ImageNet-CNN feature and the\nPlaces-CNN feature. Results of other hand-designed features/kernels are fetched from [24] and\n[18] respectively.\n\nTable 3: Classi\ufb01cation accuracy/precision on various databases for Hybrid-CNN feature. The num-\nbers in bold indicate the results outperform the ImageNet-CNN feature or Places-CNN feature.\n\nSUN397\n53.86\u00b10.21\n\nMIT Indoor67\n\n70.80\n\nScene15\n91.59\u00b10.48\n\nSUN Attribute\n\n91.56\n\nCaltech101\n84.79\u00b10.66\n\nCaltech256\n65.06\u00b10.25\n\nAction40\n55.28\u00b10.64\n\nEvent8\n\n94.22\u00b10.78\n\nshows that Places-CNN and ImageNet-CNN have complementary strengths on scene-centric tasks\nand object-centric tasks, as expected from the benchmark datasets used to train these networks.\nFurthermore, we follow the same experimental setting of train and test split in [1] to \ufb01ne tune\nPlaces-CNN on SUN397: the \ufb01ne-tuned Places-CNN achieves the accuracy of 56.2%, compared to\nthe accuracy of 52.2% achieved by the \ufb01ne-tuned ImageNet-CNN in [1]. Note that the \ufb01nal output\nof the \ufb01ne-tuned CNN is directly used to predict scene category.\nAdditionally, we train a Hybrid-CNN, by combining the training set of Places-CNN and training set\nof ImageNet-CNN. We remove the overlapping scene categories from the training set of ImageNet,\nand then the training set of Hybrid-CNN has 3.5 million images from 1183 categories. Hybrid-\nCNN is trained over 700,000 iterations, under the same network architecture of Places-CNN and\nImageNet-CNN. The accuracy on the validation set is 52.3%. We evaluate the deep feature (FC\n7) from Hybrid-CNN on benchmarks shown in Table 3. Combining the two datasets yields an\nadditional increase in performance for a few benchmarks.\n\n5 Conclusion\n\nDeep convolutional neural networks are designed to bene\ufb01t and learn from massive amounts of data.\nWe introduce a new benchmark with millions of labeled images, the Places database, designed to\nrepresent places and scenes found in the real world. We introduce a novel measure of density and\ndiversity, and show the usefulness of these quantitative measures for estimating dataset biases and\ncomparing different datasets. We demonstrate that object-centric and scene-centric neural networks\ndiffer in their internal representations, by introducing a simple visualization of the receptive \ufb01elds\nof CNN units. Finally, we provide the state-of-the-art performance using our deep features on all\nthe current scene benchmarks.\nAcknowledgement. Thanks to Aditya Khosla for valuable discussions. This work is supported by the National\nScience Foundation under Grant No. 1016862 to A.O, ONR MURI N000141010933 to A.T, as well as MIT\nBig Data Initiative at CSAIL, Google and Xerox Awards, a hardware donation from NVIDIA Corporation, to\nA.O and A.T., Intel and Google awards to J.X, and grant TIN2012-38187-C03-02 to A.L. This work is also sup-\nported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory,\ncontract FA8650-12-C-7211 to A.T. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclu-\nsions contained herein are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.\n\n8\n\n1 5 102050010203040506070Number of training samples per categoryClassi!cation accuracy 1/15/520/2050/50150/1500.550.60.650.70.750.80.850.9Number of training samples per attribute (positive/negative)Average Precision Places\u2212CNN [0.912]ImageNet\u2212CNN [0.898]Combined kernel [0.879]HoG2x2 [0.848]Self\u2212similarity [0.820]Geometric Color Hist [0.783]Gist [0.799]Combined kernel [37.5]HoG2x2 [26.3] DenseSIFT [23.5] Texton [21.6] Gist [16.3] LBP [14.7] ImageNet\u2212CNN [42.6]Places\u2212CNN [54.3]Benchmark on SUN397 DatasetBenchmark on SUN Attribute Dataset\fReferences\n[1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for\n[2] Y. Bengio. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine Learning, 2009.\n[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\nobject recognition. In Proc. ECCV. 2014.\n\ndatabase. In Proc. CVPR, 2009.\n\n[4] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode\n\nseeking. In In Advances in Neural Information Processing Systems, 2013.\n\n[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolu-\n\ntional activation feature for generic visual recognition. 2014.\n\n[6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. 2008.\n\n[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An\nincremental bayesian approach tested on 101 object categories. Computer Vision and Image Understand-\ning, 2007.\n\n[8] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.\n[9] C. Heip, P. Herman, and K. Soetaert. Indices of diversity and evenness. Oceanis, 1998.\n[10] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.\n\nberkeleyvision.org/, 2013.\n\n[11] T. Konkle, T. F. Brady, G. A. Alvarez, and A. Oliva. Scene memory is more detailed than you think: the\n\nrole of categories in visual long-term memory. Psych Science, 2010.\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In In Advances in Neural Information Processing Systems, 2012.\n\n[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In Proc. CVPR, 2006.\n\n[14] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural computation, 1989.\n\n[15] L.-J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. In\n\nProc. ICCV, 2007.\n\n[16] A. Oliva. Scene perception (chapter 51). The New Visual Neurosciences, 2013.\n[17] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. Int\u2019l Journal of Computer Vision, 2001.\n\n[18] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene at-\n\ntributes. In Proc. CVPR, 2012.\n\n[19] A. Quattoni and A. Torralba. Recognizing indoor scenes. In Proc. CVPR, 2009.\n[20] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding\n\nbaseline for recognition. arXiv preprint arXiv:1403.6382, 2014.\n\n[21] J. S\u00b4anchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classi\ufb01cation with the \ufb01sher vector: Theory\n\nand practice. Int\u2019l Journal of Computer Vision, 2013.\n\n[22] E. H. Simpson. Measurement of diversity. Nature, 1949.\n[23] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proc. CVPR, 2011.\n[24] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition\n\nfrom abbey to zoo. In Proc. CVPR, 2010.\n\n[25] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning\n\nbases of action attributes and parts. In Proc. ICCV, 2011.\n\n9\n\n\f", "award": [], "sourceid": 316, "authors": [{"given_name": "Bolei", "family_name": "Zhou", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Agata", "family_name": "Lapedriza", "institution": "MIT / Universitat Oberta de Catalunya"}, {"given_name": "Jianxiong", "family_name": "Xiao", "institution": "Princeton University"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Aude", "family_name": "Oliva", "institution": "Massachusetts Institute of Technology"}]}