Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	316
Title:	Learning Deep Features for Scene Recognition using Places Database

Current Reviews

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The authors propose a new large-scale scene image dataset which is 60 times bigger than the current
standard SUN dataset. They show that deep networks learned on object centric datasets like ImageNet
are not optimal for scene recognition and training similar networks with large amounts of scene
images improves their performance substantially.

- The diversity and density approach of analysis datasets relative to each other is quite
interesting.

- The dataset has substantially more number of images than existing scene image classification
benchmark datasets and hence is surely a useful resource.

- It is convincingly demonstrated that features from CNN trained on scene-centric images, i.e. on the
proposed dataset, improves performance compared to those from CNN trained on object centric
ImageNet dataset. The other way around is also demonstrated empirically i.e. the later features
work better on object centric image classification tasks.

- It is also demonstrated with visualizations that CNNs trained with scene images capture landscape
and spatial structures in the higher layers of the network.

Overall the paper is well written, addresses an important problem in computer vision. The analysis
of dataset and cross-dataset performances presented are interesting and the proposed dataset is an
important resource. I recommend acceptance of the paper.

Q2: Please summarize your review in 1-2 sentences

Computer vision is entering a new area in which data might have more values than algorithms. I see this paper as a pioneer work in this area.

Submitted by Assigned_Reviewer_19

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper introduces a scene-centric database called PLACES with more than 6 million of labeled pictures for scene recognition. The purpose of building such a database is to complement to ImageNet which is an object-centric database, and improve the performance of scene recognition by learning deep features using PLACES. The database was built by scraping search engines with composite queries and utilizing Amazon Mechanical Turk (AMT) for validation. To compare the density and diversity of PLACES with existing databases such as SUN and ImageNet, the authors proposed two metrics and again utilized AMT to get the measures and show the advantages of PLACES. The authors also compared the features learned from PLACES and from ImageNet, and showed the two databases are complementary to each other. In particular, PLACES can result in to better performance for scene recognition and ImageNet better for object recognition.

Strong points of this this work:
1. A solid work on dataset building.
2. Detailed and convincing better result than existing scene databases. Shows the power of data.

Minor concerns and comments:
1. Limited algorithm contribution.
2. For density, nearest neighbors are obtained based on the GIST feature. Though this setting is the same for all the databases, I wonder if depending on a particular feature will lead to any bias for the density metric.
3. For diversity, judgment can be quite difficult if judges are presented with two random images, as it doesn’t make sense to judge the similarity between two unrelated random images. I wonder if the authors have similar observations in their AMT experiment.
4. As PLACES and ImageNet are two complementary databases, how about to mix them together to train a model that is good for both object and scene recognition?

Q2: Please summarize your review in 1-2 sentences

This is a solid work on database building. The paper is clearly written and shows convincing result of the built database. The reviewer assumes that the database will be made publically available, and believes it will have a positive impact to scene image parsing and scene recognition.

Submitted by Assigned_Reviewer_23

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The authors analyze why the recent success of CNN for visual recognition is less pronounced for scene recognition. They hypothesize that it is a lack of data and therefore collect the largest scene dataset to date with over 6 million images. Training the same convolutional architecture that has proven so successful for object recognition also improves on the state of the art in scene recognition given this new dataset.

The authors do a good job at providing further insights by comparing scene dataset by density and diversity measures. In this sense, the new dataset also compares favorably in quality.

The overall conclusion of the paper is that depending on the input data, one learns either a object-centric or scene-centric representation. They show that also either representation shows good performance on scene and object recognition task, state-of-the-art performance is only achieved with a very large dataset of the appropriate type (scene or object dataset). While this conclusion is not so surprising, up to now it was not possible to validate this hypothesis due to a lack of a large scene dataset.

I am unclear in which way the proposed visualization is different from previous ones. this is not contrasted in the paper. Please detail this in your rebuttal.

It remains somewhat disappointing that the authors didn't try the obvious experiment of combining imagenet and their new scene database and train a joint network for objects and scenes. Their conclusion reads as if we have to decide if we either learn a good scene or a good object representation. It is not clear that we can both by training the network jointly.

Q2: Please summarize your review in 1-2 sentences

The authors show that the success of CNNs in object recognition can be carried over to scene recognition by their newly collected huge scene dataset that is about a factor 60x larger than any other scene database available to us today. The dataset and the results are of great value, but there is basically no technical novelty.

Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

We thank the reviewers for their insightful comments. If the paper is accepted, we will make all the data and the pretrained networks available. We are still augmenting the database with more categories and the updated database will be also released.

Here are the responses to the reviewers' comments:

To R_19

Q: ‘for density, if depending on a particular feature will lead to any bias for the density metric.’

A: We agree with the reviewer that any similarity metric will bias the density. Our goal is to measure density when the metric is driven by human perception. We tried different features (Gist, HOG2x2, Deep Features) and observed that Gist retrieved the images that looked actually more similar than the other features. For this reason we decided to use Gist. To better understand the bias of this result we will include in the paper results with other similarity metrics in the camera ready version of the paper. We will also include the density estimation when using human driven similarity by having two steps: 1) for each query image we will ask participants to select the most similar image to the query from a pool of 100 images (those will be the gist, or other features, nearest neighbors). This will provide pairs of very similar images (closer than those provided by gist). 2) then we will run the density experiment as described in the paper with the pairs previously selected.

Q: ‘For diversity, judgment can be quite difficult if judges are presented with two random images, as it doesn’t make sense to judge the similarity between two unrelated random images. I wonder if the authors have similar observations in their AMT experiment.'

A: Despite that the pairs of images are randomly selected, when comparing multiple pairs of images, we found that there is a high degree of consistency among participants that participants in the experiment were able to do judge similarity in a consistent and repeatable way. We also found that the results obtained by AMT workers matched the results obtained when the authors performed the same experiment in the lab. We also ran several independent tests on AMT and found very high correlation between two different pools of participants.

Q: ‘how about to mix them together to train a model that is good for both object and scene recognition’

A: Yes, training hybrid CNN is plausible. This will be an interesting experiment as it is not obvious what the final performance will be. We are working on it and we will include results on the camera ready version of the paper if accepted.

To R_23

Q: 'in which way the proposed visualization is different from previous ones'

A: Previous visualization method (M.Zeiler and R.Fergus) needs to train a deconvnet to reconstruct the CNN feature, it is more complicated. The average image visualization in our paper is purely data-driven. It is straight-forward to compare two networks using the average image visualization as shown in the paper.

Q: 'train the CNN jointly for imagenet and PLACES dataset'

A: The hybrid CNN is an attractive idea. Training such a large-scale network requires the tuning of the network architecture to increase the network capacity. We are working on this and we will include the results.