Paper ID: 1152
Title: Visalogy: Answering Visual Analogy Questions
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a model for visual analogy retrieval. For a query image one wants to find a second image which relates to the query in the same way as the relationship between a given pair of images. More precisely, an offset pair (i1, i2) defines the analogy. For a query i3 an image i4 is to be found such that i3 relates to i4 in the same way as i1 relates to i2.

The model is formulated as an embedder, which embeds all four images ik -> xk (k = 1, ..., 4) such that (x4 - x3) = (x2 - x1). The embedder is a CNN based on AlexNet. The loss is a hinge loss.

In addition, the authors present a dataset of visual analogy.

The paper is well written and presents a well defined model which is empirically evaluate in two datasets.

My main concern with this work is that the presented problem might be a bit contrived. The authors define an analogy as relationship between images defined by different properties of these images. The philosophy of analogies being a pair of two properties is incorporated in both of their datasets.

So if an analogy is defined as pair of properties, I natural baseline would be to set up a classification problem for these properties. Then if analogy is defined by the properties of two images, then one can apply the trained classifier to estimate these properties. A second classifier would determine the class of the query image. Then the analogous image would be one of the class of the query and the missing of the two properties. Such a baseline is based on two classifiers and would be simple to set up.
Q2: Please summarize your review in 1-2 sentences
This is well-written paper with an interesting problem. Although the proposed model seems to work well, there are simpler and more off-the shelf ways of addressing the problem, which were not explored. Hence, I am leaning towards waiting for a stronger justification.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper introduces a method and dataset for answering visual analogy questions (image A is similar to image B, as C is similar to D). The method is based on embedding images into a space where the distance A-B will be similar to C-D.

Quality: In general this is a high quality paper that tackles a creative and interesting problem, uses a simple but reasonable technical algorithm, and contains experiments that adequately support the claims of the paper.

Clarity: The paper is clearly written and well organized

Originality: The main idea of the paper is creative and novel (to my knowledge)

Significance: The one possible weakness is that it isn't obvious that there would be some usefulness in being able to solve visual analogy questions.

The notion makes less intuitive sense than the counterpart of word-based analogies (SAT questions).

These are more so a question of demonstrating intelligence, where the person must infer the similarity and difference between A and B and extrapolate it to another pair.

By contrast, in the experimental results, the datasets constructed represent a more restricted setting where (on each dataset individually) one already knows A & C are the same object class and B & D are similar according to some pre-existing type of attribute (on one dataset B&D always have the same viewpoint, on another they always have similar visual attributes, and a 3rd they have similar action).

An experiment on a combined dataset would somewhat strengthen results.
Q2: Please summarize your review in 1-2 sentences
This paper has all the ingredients of a good paper (interesting and novel problem, good technical details, good experimental results, and good clarity).

The one weakness is that it isn't obvious what is the practical value of answering visual analogy questions.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank reviewers for acknowledging the novelty and interestingness of our paper as well as our results. (R2: "well-written paper with an interesting problem, the proposed model works well"; R4: "The results are convincing, this is an interesting problem"; R5: "This paper treats a new problem"; R6: "The paper achieves convincing results"; R7: "This paper has all the ingredients of a good paper, main idea of the paper is creative and novel" R8: "Extensive experiments are done"). We appreciate reviewers insightful comments; we will incorporate them in any final version of the paper.

R2,R6: Two-classifier and attribute-based baseline

Direct application of the suggested two-classifier baseline is not appropriate as we explain below. We run a modified version of it and the obtained results show that our method outperforms the suggested baseline.

**Issues with the suggested two-classifier baseline:

***Unseen category/property during training:
The suggested two-classifier baseline directly uses the category/property labels to train classifiers. For solving general analogy questions, the set of properties and categories are not known at test time (Line 261, Figs 3&5). Therefore, we propose to learn a network that can generalize the transformations without explicitly using the label of categories/properties. We have produced the positive analogy pairs using the category and property labels, but we are not directly using any of those labels during training and do not optimize any softmax loss. Our loss function only optimizes the similarity between transformations.

***Extendability to questions with multiple formats:
The suggested two-classifier baseline should be aware of the format of questions (whether property is changing from A to B or the category). Otherwise, a set of rules should be devised such that the two-classifier baseline would first detect the type of question by comparing A and B, and then retrieve the appropriate image. The setup of such rules in the two-classifier baseline would become complicated and non-applicable in images with multiple property types.

**Supporting experiment for the two-classifier baseline:
We ran a new experiment on 3D chairs dataset that has analogy questions of both formats (A and B with same style but different poses as well as A and B with with same pose and different styles).

***Generalization:
The results show that our network (trained on only one format of analogy questions) is able to answer another format of the analogy test questions. In this new experiment our method outperformed AlexNet baseline with a gap of ~12%.

***Poor performance of suggested two-classifier baseline:
As the format of analogy questions is unknown, we computed the style and pose probability vectors and first detect the type of question based on the pose and style similarity between A and B and then retrieve the image with the highest pose and style similarity to images C and B accordingly. This baseline resulted in a poor performance (~30% gap) compared to our method. We will add this experiment in the final version of the paper.

R4, R6, Missing citations
Thanks for the suggestions. We will include them.

R6: People agreement on analogy questions
This is a great suggestion. We will include this in our future work.

R7: Experiment on a combined dataset
Thanks for the good suggestion. We trained and tested our network with a combined set of analogy questions (action and attribute) and obtained promising results with a gap of 5% compared with our baseline (AlexNet). The images in our natural image dataset only have one property label (either attribute or action); and there is a chance that the negative analogy pair of one question type, be the positive of another one. We will reflect this in the final version of the paper.

R7: Practical values
We agree that it does not have a direct application but it is an important problem that reflects on image understanding. Reasoning based on analogies is the first step toward deeper understanding of generalization. This paper is one step toward this challenging goal. To show our generalization capability in answering multiple formats of analogy questions we ran a new experiment (see experiment above).

R8: Analogy question format with A, B from different categories but same property
Thanks for the good suggestion. To test the generalization capability of our network to answer multiple question formats, we tested on a new set of question on 3D chairs dataset with both question types and our network obtained 12% improvement over the AlexNet baseline (see experiment above).

R8: Lack of contribution
Our contribution is to propose and solve a new problem as well as a new dataset. We demonstrated results with extensive experiments. The goal we are pursuing is "learning to generalize" which is important in image understanding. The novelty and creativity of our proposed problem is also acknowledged by R2,4,5,6,7.