Paper ID: 1139
Title: Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a method for action detection (localization and classification of actions) using weakly supervised information (action labels + eye gaze information, no explicit definition of bounding boxes). Overall, the spatio-temporal search (a huge spatio-temporal space) is done using dynamic programming and a max-path algorithm. Gaze information is introduced into framework through a loss which acounts for gaze density at a given location.

QUALITY: The paper seems technically sound and makes for a nice study given gaze information. An interesting extension would be the application of general gaze prediction (rather than gaze points from the same dataset). The experimental case which is still not clear is whether or not having bounding box information is more or less beneficial than gaze information. As it requires a fair bit of effort to collect both types of information on the dataset, a nice baseline to have would have been the use of the ground truth bounding boxes in the training rather than the gaze information.

CLARITY: The paper is reasonably written, though some definitions of terminology and variables are missing.
(1) In line 75, what is meant exactly by top-down saliency? This is not explained until much later in the text (around line 95).
(2) In equation 1, what is w?
(3) In line 166 - what is the variable k?

ORIGINALITY
Reasonably original and is a nice way of integrating gaze information into the action recognition problem.

SIGNIFICANCE
The introduction of gaze information is becoming more common in computer vision.


MINOR PROBLEMS
Line 70, 71 – miss-classification, miss-alignment should be misclassification and misalignment
Line 134 – further extended
Q2: Please summarize your review in 1-2 sentences
Reasonably written paper which makes use of gaze information. The writing could use a bit of polishing before final publication.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper targets action recognition and localization in video. The main novelty is the use of human gaze data as supervision for learning action localization. The method is formulated within a structured-output learning framework. The loss function combines penalties for both incorrect action classification and localization. Correct action localization is assumed to coincide with the location of the gaze provided by human subjects for the training videos. Experiments for action classification and localization are reported for UCF sports dataset.

Positive:
- The use of gaze data as supervision for learning action localization is appealing since manual video annotation is very time consuming.
- The paper is clearly written and well-structured. The formulation of the method seems to be fine.
- Experimentally, action classification seems to benefit significantly from the gaze information incorporated at the training.

Negative:
- The hypothesis that gaze locations always coincides with locations of action does not seem to be verified. What's the performance of action localization by gaze on the test set? Will gaze data be reliable in all cases? What if training videos contain multiple actions?
- The output of an automatic person detector/tracker could be thought of as an alternative cue for learning action localization. Compared to gaze, person detection does not require any manual intervention. It seems your method could be easily adapted to incorporate person likelihood maps (produced by a person detector) instead of the gaze map g. It would be good to see comparison of both.
- Weakly-supervised action localization has been explored without gaze e.g. in [Siva and Xiang, BMVC11] (missing reference). This work should be discussed and compared experimentally, if possible.
- Action classification is reported for a non-standard experimental setup (l.313). To enable comparison to most of the methods reporting for UCF-Sports, results for the standard 10-fold cross-validation setup should be reported.
- Evaluation of action localization seems to be done for frames where action is detected but not for the entire duration of the action (l.365-366). If this is the case, the evaluation does not penalize low recall and the localization results in Table 1 are probably non-comparable to other methods.

Detailed comments:
- since y={-1,1}, "otherwise" value in eq. (4) should probably be just 1.
Q2: Please summarize your review in 1-2 sentences
Summary: I like the idea of using gaze as supervision for localization, the gaze data is well-integrated into the structured-output learning framework. Due to negative points above, however, the paper is not as conclusive as it could be.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes the use of eye-gaze data for spatio-temporal action localization in videos, showing improved action classification accuracy and state-of-the-art action localization in videos. The method of incorporating this information is a latent SVM framework with modified constraints. The authors conduct various experiments to show that their method performs better than existing methods for localization and using eye gaze data could serve as a better prior than bounding boxes as done by previous works.

Overall, I found the paper to be well written but the experimental results a little lacking. The concerns are explained below.

Pros:
- Interesting problem formulation of using eye-gaze data instead of bounding box data for learning
- Experimental results are thorough and various problem settings are evaluated such as action classification, localization and gaze prediction. The experimental results suggest that the method proposed in the paper outperforms existing methods on the given tasks.

Cons:
- The cost and difficulty of collecting eye-gaze data is not well justified by the observed improvements. While eye-gaze data may be more natural in some ways, it is significantly harder to collect as compared to bounding box annotations which can be easily collected via crowdsourcing as no specialized equipment is required. This makes it significantly harder to scale this approach to other datasets.
- How important is it to model gaze in a category specific manner? If gaze could be used for general saliency prediction, then the same method could be applied to other datasets more easily.
- It is not clear why the authors do not compare their results to other existing methods for action classification such as [Z1] and [Z2] (mentioned below). The performance achieved by them is 85.6% and 86.5% as compared to the reported performance of 82.1%. It should be noted that [Z1] and [Z2] do not report results on action localization but the same can be said for [16] and [12] which are reported here. While [Z1] and [Z2] use chi-square kernels for learning, it is an advantage of their methods and it is not clear if the same set of kernels can be applied in the formulation presented here.
- What is the corresponding localization result of the proposed method on the 3 categories used by Tran and Yuan[18]? It would be nice to include this in the figure caption.
- There are no baselines provided for gaze localization and prediction. It is possible to use some simple methods such as exemplar-based detectors[Z3] for gaze prediction to show the necessity/strength of your method.

References (I use Z to be distinct from paper references)
[Z1] Evaluation of local spatio-temporal features for action recognition. In BMVC, 2010.
[Z2] Learning hierarchical invariant spatio-temporal features for action recognition
with independent subspace analysis. In CVPR, 2011.
[Z3] Ensemble of Exemplar-SVMs for Object Detection and Beyond. In ICCV, 2011.

The paper is generally well-written but here are some minor language errors:
- L51: this comes at expense --> add "the"
- L69-70: miss --> mis
- L82: proved --> have proven
- L134: extend --> extended
Q2: Please summarize your review in 1-2 sentences
This paper proposes a modification of latent SVM to incorporate eye-gaze information for doing action classification and localization. While the results, are promising, there are some concerns stated in the previous question that should be addressed by the authors.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank reviewers for their helpful and thoughtful comments.

R1
==
We will address the specific comments improving the clarity of the paper.

R2
==
Q1:The hypotheses gaze locations always coincides with locations of actions does not seem to be verified...Will gaze data be reliable in all cases?

There is no guarantee that the eye-gaze data will always coincide with the region of interest for the action (we illustrate this in Fig. 1 (b)). However, portions of eye-gaze fixations usually do fall within location of an action at some point within action execution. This is precisely the reason why we treat eye-gaze as weak supervisory signal and do not aggressively penalize regions that do not contain eye-gaze during training (see Eq. 5 and 6). Moreover, we experimentally show that our localization results (Fig. 2, action localization reported on the test set) outperform other methods that use combination of hand annotations and person detections.

Q2:What if training videos contain multiple actions?

Since we are localizing actions in space and time, presence of multiple actions in the training/test videos can also be handled with the proposed framework. The fact that our model only assumes that a fraction of eye-gaze fixations fall within an action region of interest (Eq. 5 and 6) further helps with that.

Q3:The output of an automatic person detector/tracker could be thought of as an alternative ... for learning action localization. ... Weakly-supervised action localization has been explored without gaze ... in [Siva & Xiang, BMVC11]

We compare our work to [9] and [16]; in [9], the action model is driven by automatic person detector in both train and test stages; and [16] explores weakly supervised action learning and is similar in nature to [Siva & Xiang, BMVC11] (which we will cite and discuss). Our model outperforms both approaches as we show in the paper. However, we acknowledge the recommendation and agree that employing human detector to replace gaze map in our model, specifically, is an interesting experiment to conduct in the future.

Q4:Action classification is reported for a non-standard experimental setup ... 10-fold cross-validation setup should be reported

For this dataset (UCF-Sports) the 10-fold or Leave-One-Out cross-validation is biased and the model tends to learn the scene of the action rather than action itself (many actions are captured in the same location, please see [9] for more details). Therefore, we use the train/test split provided in [9], which serves a more “fair” evaluation. In addition, recent works on this dataset use the same train/test splits [9,12,16,18]. The use of the train/test splits provided in [9] further allows us to compare our results more directly to the two methods that are most closely related to the proposed model - [9] and [18].

Q5:Evaluation of action localization seems to be done for frames where action is detected ... results on Table 1 are probably non-comparable to other methods.

In footnote 7 (page 7), we mention that all methods (ours, [9], and [18]) are evaluated on different subsets of frames, and therefore are not directly comparable. Because all these methods localize actions in space and time (and hence may and do fire on different subsets of frames) a perfectly direct comparison is difficult.

R3
==
Q1:The cost and difficulty of collecting eye-gaze data is not well justified ... bounding-box annotations which can be easily collected via crowdsourcing as no specialized equipment is required.

While currently the cost of eye-trackers is more expensive, we believe that collecting eye-gaze data will become more feasible in the future and it might be easier for users to watch the video in a natural setting rather than perform frame-by-frame bounding box annotation. In addition, it is not a trivial task to indicate strict spatio-temporal boundaries of the action. E.g., it is an open question whether the bounding box of action "riding a horse" should include horse or not. Eye-gaze circumvents such labeling choices. Moreover, our approach shows superior performance over the methods [9, 12, 18] that were trained with bounding-box hand annotation.

Q2:It is not clear why the authors do not compare their results to ... [Z1] and [Z2]

We would like to point out that both [Z1] and [Z2] evaluate their approach based on the Leave-One-Out cross-validation, which is different from our evaluation which is based on train/test data split [9]. As it is shown in [9], LOO framework tends to memorize the scene of the action rather than learn the action model. Such a behavior leads to higher numerical results in action classification compared to the train/test split based evaluation. Due to such high scene correlation between the actions, we believe that using train/test split for learning and evaluating the method is more “fair” than using Leave-One-Out setup.

Q3:[Z1] and [Z2] use chi-squared kernels for learning, ... not clear the same set of kernels can be applied in the formulation presented here.

For simplicity and in order to compare results with [9], we employ linear kernel for learning. However, our model can be easily extended and include chi-square kernel for learning, e.g. by using approximate kernels as it is done in [12].

Q4: “What is the corresponding localization results of the proposed method on the 3 categories used by Tran and Yuan [18]?”

Average Precision (Fig. 2):
Diving:
Ours: 100; Tran & Yuan [17]: 10; Tran & Yuan [18]: 41.
Running:
Ours: 67; Tran & Yuan [17]: 70; Tran & Yuan [18]: 77.
Horse Riding:
Ours: 100; Tran & Yuan [17]: 42; Tran & Yuan [18]: 100.

Q5: “.. no baselines provided for gaze localization and prediction.”

For the final version of the paper we will compare our results to the bottom-up HoG-MBH saliency detector made publicly available a few weeks ago by S. Mathe and C. Sminchisescu [11].