|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a vision method to estimate where within an image a pictured head is looking.
Given the image and the cropped out head, the system returns a saliency map consisting of confidence ratings on grid cells saying how likely that position is to be the subject of that head (person)'s gaze.
The technique uses CNN with two pathways, one for the head/gaze and one for the full image/saliency of the scene.
The authors also contribute a dataset pulling relevant images from SUN/COCO/Actions40/PASCAL/Imagenet, and have it annotated for gaze points by Mech Turk workers.
This dataset contains in total 36K people.
The method is compared to a few reasonable baselines that represent alternative approaches one might implement and sanity checks.
(To my knowledge existing methods in this space are more restrictive than the proposed method, so they are not easily comparable without taking subsets of the test data that satisfy those methods' requirements.)
Quantitative results show the promise, and the qualitative analysis and figures are interesting and well composed.
The paper tackles the very practical and yet under-explored problem.
The solution is simple but effective.
It has low technical novelty; it is largely a straightforward application of CNNs with simple input descriptors for the head and scene.
There is the detail about the shifted grids that was important to results.
Nonetheless, the dataset collection effort and systematic results (including reasonable baselines) make it valuable.
Others are likely to build upon this work.
Paper clarity is very good.
It is an enjoyable paper to read with many good illustrative figures.
Unclear points/ questions:
*More detail could be given about the annotation process and instructions etc.
For example, how were the annotators told to decide on a gaze point?
Is it literally the point of fixation as they perceive it from that person's line of sight?
Or the center of the object they are fixating on?
What rules were used to decide what constitutes a poor annotation for discarding?
*How many images are in the labeled dataset, after pruning?
There are 36K people instances.
*It is good that the results compare against human agreement on the test set, where 10 people labeled each test image.
Still, why not get multiple gaze annotations for the training data as well, to find a consensus for the ground truth?
* In Sec 3.3 this sentence is unclear "Since we only supervise...subproblems."
*Text can explain why the uniform distribution of gazes was sought in the test set.
Presumably this is to avoid bias.
*The SVM baseline description (Lines 333-335) is not clear.
Related work:
*This sentence about related work is unclear in the intro:
"Only [17] tackles the unrestricted..."
What is meant by "pre-built components" and why can't it handle people looking away from the camera?
If it only handles people looking at the camera, then how is it doing gaze estimation at all?
*The paper notes that [7] uses an eye tracker for gaze in egocentric video.
The same authors have a more recent paper that removes the need for the gaze tracker: Learning to Predict Gaze in Egocentric Video, Yin Li, Alireza Fathi, James M. Rehg, ICCV 2013
*The proposed work seems also related to the interactee prediction work of Chen & Grauman (ACCV 2014).
It too tries to produce a "saliency map form the point of view of the person inside the picture (Line 75)", uses similar features (but not CNNs), and produces a multi-modal distribution as its estimate (but using a mixture density neural network instead of classifier).
The proposed work is still distinct, mostly because it cares strictly about gaze and treats the interacting/gazed upon objects only implicitly, but I think that work can be cited and the differences explained.
Perhaps the insight of the ACCV work about representing the scale and position of the gazed upon object would be relevant to improve this system as well.
Predicting the Location of "Interactees" in Novel Human-Object Interactions.
C-Y. Chen and K. Grauman.
In Proceedings of the Asian Conference on Computer Vision (ACCV), Singapore, Nov 2014.
Results:
*The results would be more well-rounded by including failure cases and discussion on failure modes.
*I can guess why the Places-CNN and Imagenet-CNN were used for the saliency and gaze components, respectively.
But this motivation can be written in the text.
(Sec 3.3)
I am also curious if it matters in practice to have scenes/objects for the two, or whether results would be similarly if e.g. both were initialized with Imagenet or Places.
* A possible enhanced baseline over Fixed Bias: take a subset of training images for which there is a similar distribution of heads.
And how is "same head location" determined for this baseline?
Typo: Line 353 " the in the"
Q2: Please summarize your review in 1-2 sentences
The paper tackles the very practical and yet under-explored problem of estimating where people are looking in images.
The solution is simple but effective.
It has low technical novelty, yet the dataset collection effort and systematic results (including reasonable baselines) make it valuable.
Others are likely to build upon this work.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Missing citations:
Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos, Meltem Demirkus, Doina Precup, James J. Clark, Tal Arbel
Q2: Please summarize your review in 1-2 sentences
This paper argues that gaze prediction is understudied, proposes a new dataset, and builds a neural network for gaze prediction.
Head pose estimation is a very well studied related area that seems to be ignored a bit by the paper.
Results compare a number of baselines to [12], which is from 2009.
Related approaches include [15] (and citations therein), which actually performs a more complicated task of which gaze prediction is a subtask.
See missing citations below (though this is a "light review")
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper deals with a problem of estimating gaze directions of people in a given image, and proposes a method based on convolution network and saliency. The problem has already been presented in several previous researches such as [Fathi+ ECCV12], [Borji+ JoV14] and [Parks+ VS2014], however, this paper tries to handle more general cases with less constraints.
The architecture of the proposed model follows a defacto-standard pathway of a combination of Caffe CNNs. It is technically sound but no technical contributions can be found in the model.
I am afraid that the use of saliency prediction for gaze following might not be appropriate, since saliency describes visual distinctiveness from the viewpoint of observers (i.e. viewers of the image), not the target person in the image. Namely, saliency computed from a given image can estimate where observers may focus, but is cannot estimate where people in the image focus.
Q2: Please summarize your review in 1-2 sentences
The problem dealing with in this paper is interesting and significant, but the novelty of the proposed method is limited and the proposed method might contain a critical problem.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Using images containing people "attending" regions (i.e. objects, other people). The authors propose a CNN-based method combining the image saliency map (pixels), a person's head appearance (pixels) and the head location in the image to predict the gaze "attentional" region in an image.
Quality: The overall quality of the paper is good. The authors build an architecture of deep convolutional neural networks (CNN), which learns features describing image saliency, face gaze and face location in parallel. Later in a forward pass the CNN will predict the location of the gaze fixation. Technically, the algorithm is straight forward. The authors train the whole model at once using back propagation. Later they compare their prediction results with other baselines achieving the best results.
Originality: The concept of the work is original. Nevertheless, fully supervised location "detectors" using combinations of CNNs not really novel.
Pioneering work of Ross B. Girshick (RCNN) and others have this concept inside.
Clarity: The paper is well written and its easy to follow. To improve the quality, the authors could relax the use of the word "predict" (e.g line 25 and others). Prediction is used for generative models (e.g. LSTMS). This paper is fully supervised, and its analogous to the object detection papers. So "detection" should be used instead.
Significance: The problem is very interesting and important. Good gaze following algorithms can help in other tasks such as fine grained object detection. The authors will open to the community the collected and curated dataset of face-gaze fixations.
Q2: Please summarize your review in 1-2 sentences
The paper proposes a method to predict and follow the gaze of people in unconstrained images. The framework combines image saliency and human gaze to learn the predicted location of "attention" of the person. The learning is supervised and performed at once end-to-end. The results outperform baseline methods.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[this is a light review]
minor: supplemental: line 102, red and yellow have to be exchanged.
Q2: Please summarize your review in 1-2 sentences
The clearly written paper contributes a large-scale dataset and a novel model, which combines gaze direction and saliency in a deep learning framework. The approach is convincingly evaluated using different ablations and related work.
== post rebuttal ==
I keep my judgment.
Submitted by Assigned_Reviewer_6
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a framework for predicting where people at the images are looking at. This is a novel twist on the established problem of the salience / eye gaze prediction in images. Authors collect and annotate a new dataset containing a set of images as well as annotation of the gaze of each person inside an image. Eye gaze model is then learnt by using convolutional networks.
During test stage, gaze is predicted by fusing results of fixation prediction (similar as it is done in regular saliency prediction) and actual following-gaze prediction. Experimental results demonstrate that this approach provides superior results compared to regular fixation-only gaze prediction.
Quality: The paper is interesting and theoretically sound, but a few things could be improved:
1) Learning part should be better defined (mathematical formulation or at least a diagram) 2) This paper should also compare results with approaches where gaze is learnt not just from free-viewing a scene, but when people are given a task when viewing a scene. E.g., S. Mathe and C. Sminchisescu, Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths, NIPS, 2013 3) Would be interesting to analyze overalp between gaze fixations and gaze-following.
4) This approach requires location of the human head of gaze prediction. It would be interesting to see results with off-shelf head detection.
Clarity: Paper is overall well written.
Originality: Novelty is incremental.
Significance: Paper covers interesting aspect of predicting where people are looking, thus has large potential in different computer vision application which require better understanding of human-object and human-human interactions.
Q2: Please summarize your review in 1-2 sentences
The paper presents a novel approach for following-gaze prediction in videos. Results are promising but it would be interesting to see it working with off-shelf head detector
as well as understand what is correlation between following-gaze and gaze fixations.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their careful
consideration. We greatly appreciate the positive comments and address
major concerns below.
R4 Missing citations Note that we do cite
as [17] and discuss the work by Parks et al in L090-100. We further
clarify the distinctions below. We will include the work by Demirkus et al
in the next version and discuss head pose estimation below.
R4 Head
pose estimation (HPE) While we agree that HPE is an important and
well-studied problem that definitely deserves attention here, we would
like to highlight certain subtle but important distinctions. Using HPE, we
can estimate the angle of gaze, but image evidence is required to
accurately estimate the magnitude of the gaze vector to identify exactly
where one is looking. To better illustrate this, we highlight our result
in Tbl 1 - the comparison between (1) 'No image' and (2) 'Our Full'. (1)
uses head location and pose to infer gaze without using any image evidence
(unlike (2)) - the angular error is very similar to (2) while the distance
error is markedly higher resulting in a lower AUC (0.78 vs 0.83). Thus,
while head pose is certainly important, it is not sufficient.
R4,
R1 Comparison to [17] Firstly, note that the goal of [17] is not gaze
following, but saliency prediction using gaze following. Thus, while they
propose a gaze following system, they do not evaluate it directly. More
importantly, the automatic system from [17] relies on the 'face' detector
by Zhu & Ramanan which is not designed for 'head' detection and hence
does not perform well on people facing away from the camera. Further, [17]
does not use image evidence to identify gaze - they use a prior that
combines HPE with its scale. This is similar to our 'No image' method
explained above. While not being explicitly trained to use the face scale,
we found that our 'No image' model was sensitive to the scale of the face
(i.e., higher gaze length for larger faces). Lastly, the code and dataset
from [17] are not available making a direct comparison
difficult.
R2 Meaning of saliency - "critical problem" We
completely agree with you - 'saliency' typically refers to what observers
of an image fixate on (as in [12]). In our case, we re-define saliency to
be from the perspective of the person in the image e.g., in an image with
a person looking at a car, typical saliency would likely result in the
person's face being salient, while our re-definition would result in the
car being salient. We highlight this distinction in Fig 4b. We apologize
for overloading this commonly used term and will update it to
'gaze-saliency' to avoid further confusion.
R3 Head detector We
did not use a head detector as the primary focus of our work is gaze
following. We did not want to confound the results by combining two
imperfect solutions. Given the convolutional nature of our approach, we do
not expect somewhat imprecise head detections to lead to a significant
performance drop.
R3 Learning better defined We will add details
of the mathematical formulation, such as the loss equation
R1
Annotation process We told the turkers to make their 'best guess for
where a person is looking' instead of giving more explicit instructions
(related to objects, etc) to avoid biasing them in any particular way. To
test them, we selected images where the gaze location was obvious and
allowed for reasonable margins of error. This prevented random
clicking.
R1 Image count 27,967 after pruning (Fig 2, right
col)
R1 Failure cases Fig 5 (images 3&4, 1&4 and 4 in
rows 1, 2, 3) shows some failures. We believe the main failure mode is
from the lack of 3D understanding e.g., image 1 in row 2, Fig 5: the
farthest person from the camera is predicted to be looking at a stove that
is behind him. We will add more discussion of failures to the
paper.
R1 Both paths with ImageNet or Places If both pathways
are initialized with ImageNet the AUC is 82.2 and 81.5 with Places, as
compared to 83.0 with our approach
R1 Unclear sentence, Sec
3.3 We show that the saliency pathway is computing a gaze-following
saliency map and the gaze pathway is computing the head-pose map for the
target person. However, we did not explicitly enforce that each of these
pathways solve the specific subproblems. Instead, we only supervised the
final output of the network (gaze location) and the network naturally
learned to factorize the problem in this way.
R1 Uniform test
distribution This distribution was used to avoid bias in the
evaluation
R1 clarify L333-335 In the SVM baseline, the input
elements are the same as our final model, but instead of a CNN, we used an
SVM classifier. Shifted grids are also used to ensure a fair
comparison.
R1 Fixed bias baseline Similar to your suggestion,
we use a 13x13 grid for head location
We apologize for not
clarifying all questions given the limited space and many reviews. We will
fix all typos and add missing references in the next
revision. |
|