Paper ID: | 1440 |
---|---|
Title: | Can Peripheral Representations Improve Clutter Metrics on Complex Scenes? |
This is a very interesting study that aims to model gaze contingent clutter, rather than the current approaches of measuring overall clutter in a scene regardless of gaze. This is an important endeavor since we know that the effects of clutter are very much dependent on eccentricity. The authors approach is to take two standard computational models, one to compute clutter in a scene (irrespective of gaze), and one to model the effects of eccentricity (relative to gaze) on spatial pooling, and combine them to provide a measure of gaze contingent clutter around a particular region of interest. The authors clearly demonstrate that gaze contingent clutter provides a better estimate of human performance in detection tasks.
I think this study is important, and potentially very impactful. The following is more a collection of questions, some about technical details and some about particular choices made by the authors, than criticism of the paper. 1. The authors' approach to measuring gaze contingent clutter is rather elegant at it's base, but there are some choices that feel a bit arbitrary: a. It seems kind of ad-hoc to define a 6 deg ROI for the final clutter estimate. If the aim is to truly take eccentricity into account shouldn't the contribution to clutter be a smooth function of distance from the target? b. Can this measure not be extended to a pixel-by-pixel clutter map, similar to Feature congestion? I realize that the method is contingent on the target, but is that really necessary? what happens if you leave the target in the picture? Also, the details of the target may very much impact whether it is easily detectable or not. c. Taking the difference between the two feature maps feels a bit arbitrary. I understand that the difference would be zero if the ROI is foveated, but this is not an if-and-only-if, right? What if the features causing the clutter are at a spatial scale such that the foveal and peripheral maps are similar, in that case the PIFC coefficient would be zero, but detection would still be hindered. 2. I agree that measuring gaze contingent clutter is important, but I wonder what the applications for gaze contingent clutter estimates would be. Standard clutter measures are obviously useful for designing interfaces and layouts, but gaze is not available for most of these applications. I think it might be worth discussing. Minor comments: 1. I think it would be useful to illustrate the fixation point in the relevant figures (using a cross for instance). 2. I don't think I quite understood what to look for in Fig. 6, or what the take home message of the figure is.
2-Confident (read it all; understood it all reasonably well)
The paper describes a method for taking into account the eccentricity of a target when assessing the complexity of a visual search task.
On the positive side, the psychophysics experiments appear to be well conducted. The improvement with the original model is very significant. On the negative side, the proposed PIFC score seems more like a heuristic rather than a proper model of search/crowding. The computations seem ad hoc: the foveated and the regular maps get subtracted and the distance between these maps is computed to yield a score which is then used to modulate the original map(!) The most problematic aspect of the work is that, as acknowledged by the authors, a simple measure of eccentricity seems to correlate with human responses about as well as the proposed PIFC. The authors state that eccentricity alone as opposed to their score 'does not alter the clutter ranks’ and point to Fig 6d but I have to confess that I simply do not understand what they mean. Small comments: It is not clear why the authors focus on the feature congestion model since the proposed approach could be used in conjunction with any of the clutter models reviewed. The paper readability could be improved. For instance: I could not understand how the Feature Congestion score is actually computed. Both the figure legend and the main text say: "A color, contrast and orientation feature map for each spatial pyramid is extracted, and the max value of each is computed as the final feature map.” Are these the raw RGB values etc or are these center-surround channels? How exactly is the max computed? Figure 1 does not contain any legend or colorbar so it does not really help... The previous work section could also use some reorganization. It seems that the literature review should focus on related work. Related work should include previous clutter models but also previous work on modeling cortical magnification/eccentricity. Clutter metrics and evaluations seem to belong to a methods section more that a background section...
2-Confident (read it all; understood it all reasonably well)
The paper proposes a new clutter model that integrates peripheral information on top of an existing clutter model (Feature Congestion). Their model was developed for predicting target detection rate, with targets situated at different eccentricities of a fixation point. The paper also proposed an experimental design for collecting the behavioral data in a forced fixation search experiment. The results showed that the foveated version of the feature congestion model was able to better explain the target detection rates at different eccentricities than the original feature congestion model.
This work suffers from several issues in my opinion: problem statement, novelty and technical contribution, and experiments. Problem statement: The author states that most clutter models produce a global metric to measure clutter perception, but the fovea effect have not been taken into account. Then they argue that the fovea effect should be a part of the clutter model, where the problem is framed as a target hit-rate modeling study. I wonder, why is this an interesting problem? As the current clutter models (other than feature congestion) have achieved high correlations with human clutter rankings, why would it even be important to add in the peripheral structure into an existing model, where such existing model is a good margin behind the current state-of-the-art (feature congestion). Also, the problem is centered around the “forced fixation search” experiment, which means the subject is fixating at a specified location while the target is somewhere in the image. This leads to the foveated clutter model to take in a fixation-location input parameter, in order to generate the peripheral structure as part of the pipeline. However, this is disturbing because I do not see under any circumstance would a viewer being forced to fixate at some part of an UI, control panel, advertisement, or any visual display, because in a real world scenario people perform free viewing on the displays. Therefore, in terms of the application of clutter perception models, the forced-fixation points should at least be chosen meaningfully, such as with a fixation-prediction algorithm, or obtain human fixations from an initial free-viewing experiment, then use those as the forced-fixation locations for the actual search experiment. Novelty and technical contribution: This work has very limited technical scope, as it is addressing a very specific scope of clutter perception modeling: forced fixation search problem. The work combines an existing peripheral structure work onto an existing clutter model (feature congestion) in a very limited scope of clutter perception. The proposed forced-fixation search experiment is also very specific that does not offer sufficient general contribution to the NIPS community. In addition, the feature congestion model is a good margin behind the current state-of-the-art in predicting global human clutter perception, it would be more desirable for the authors to come up with an approach to base their method on the state-of-the-art clutter model, such as the proto-object model. Experiments: As mentioned previously, if the authors want to build on top of an existing work, it would be more desirable to be adopting the state-of-the-art. For the experiments to be more convincing, the authors should try to propose a more general framework that is able to be combined with most of the clutter model, so that the experiments can include the foveated feature congestion, foveated proto-object model, foveated scale invariance model, etc. With the current experiments of comparing the proposed model against the original feature congestion model, the results are underwhelming because the authors are comparing both methods to the forced-fixation search data, where the original feature congestion model was never designed for that specific task anyway, so of course it would not do well as compared to the proposed method that is designed specifically to solve this issue. A better claim should be showing the foveated model can do better than other clutter models in terms of the general clutter ranking task, not at a task that is specifically designed for the foveated model. Therefore, the comparison part is weak and should include more models in a more general scope.
2-Confident (read it all; understood it all reasonably well)
This paper studies clutter perception research by introducing a new foveated clutter model to predict the detrimental effects in target search utilizing a forced fixation search task.
The authors propose a peripheral representation to improve clutter metrics on complex scenes. The approach taken is rather straightforward. The results show that Foveate Feature Congestion clutter scores correlate better with target detection than regular Feature Congestion. The immediate implications for vision problems are not evident. It is not clear whether the paper is really suitable for NIPS or it should be submitted to a conference such as VSS.
1-Less confident (might not have understood significant parts)
The authors claim that they found a score for clutter based on where the fixation is relative to the target location, and that their work is the first one to give such a score which takes into account the eccentricity of the target position relative to the gaze. However, in reviewing previous works, they stated that van der Berg et al also had a clutter score that took into the consideration of the eccentricity. The paper is poorly written, the explanations are not clear enough, there are typos in the math formula, the math formulation was loose (such that easy improvements were not made), there erroneous punctuations in English, etc.
It is nice to take eccentricity into account for a clutter score. However, the paper is poorly written so that even if the underlying work had good quality it could not show. For example, on page 4, you stated "Example: session 1 had the target at 12 o'clock, while session 2 had the target at 3 o'clock)". What does this mean? If the target in all trials were at the same direction from fixation, what does it mean to have subject search for the target? A subsequent sentence stated "all subjects had a unique fixation point for every trial for the same eccentricity values" --- what does this mean? Equation (2) had a typo, and the variable "t'_0" is obviously not needed --- these suggest that the work was done in a hurry and not enough time was used to clean up and improve the obvious bits. The topic has potential, so perhaps this paper could be improved for the future.
2-Confident (read it all; understood it all reasonably well)
The authors describe a novel measure of object clutter in a scene based on the limitations and capabilities of the human visual system (e.g., foveal vs peripheral vision), validate this novel model based on a laboratory experiment and conclude that their model better assesses the impact of clutter than the models they compare it to.
This is a well-written paper, examining an interesting question about the limits of human perception and how they can be computationally quantified on natural images. My only concern with the paper is that the current state of the art for models of peripheral vision and clutter (often referred to as crowding in the human vision literature) is that it has moved on since many of the papers the authors cite. In particular, the authors may want to examine Rosenholtz' Texture Tiling Model (Rosenholtz et.al, 2012) and more recent work from the group using the Texture Tiling Model to explain a wide variety of crowding effects across a range of stimuli (Keshvari and Rosenholtz, 2016).
3-Expert (read the paper in detail, know the area, quite certain of my opinion)