
Submitted by
Assigned_Reviewer_2
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
UPDATED AFTER AUTHOR FEEDBACK AND OTHER REVIEWS:
Based on the author rebuttal and the comments of the other
reviewers, I still believe the paper is worthy of acceptance. However, I
agree with the other reviewers that the paper merits a ``1'' (incremental)
rather than a ``2'' (major novelty) in the impact score.
ORIGINAL
REVIEW: Summary: This paper proposes an imagebased model for
visual clutter perception (``a crowded, disorderly state''). For a given
image, the model begins by applying an existing superpixel clustering then
computing the intensity, colour and orientation histograms of pixels
within each superpixel. Boundaries between adjacent superpixels are then
retained or merged to create ``protoobjects''. The novel merging
algorithm acts on the Earth Movers Distance (EMD), a measure of the
similarity between two histograms. The distribution of histogram distances
in each image for each image feature is modeled as a mixture of two
Weibull distributions. The crossover point between the two distributions
(or a fixed cumulative percentile if a single distribution is preferred by
model selection) is used as the threshold point for merging: an edge is
labelled ``similar'', and the superpixels merged, if the pair of
superpixels exceed the threshold point for all three features. The clutter
value for each image is the ratio of the final number of protoobjects to
the initial number of superpixels (i.e. 0 = no protoobjects, not
cluttered; 1 = all superpixels are protoobjects). The model is
validated by comparing to human clutter rankings of a subset of an
existing image database. Human observers rank images from least to most
cluttered, then the median ranking for each image is used as the ground
truth for clutter perception. The new model correlates more highly with
human rankings of clutter than a number of previous clutter perception and
image segmentation models (including human object segmentation from a
previous study).
Strengths: In my opinion, scene clutter (a
``crowded, disorderly state'') is not the most interesting quality of
perception, but a workable definition of ``set size'' in complex scenes is
a very important problem. I think this paper is the precursor to an
important contribution: namely, whether this clutter metric relates to
performance in behavioural tasks such as search. The method for clustering
seems to do a good job and to my knowledge is a sufficiently different
approach to previous methods as to offer a novel contribution.
The
behavioral ranking task is neat and the dataset will be of interest to the
community for future model testing. The model correlates admirably with
clutter rankings. I believe the paper to be a good contribution to the
NIPS community.
Weaknesses: It is not clear from the
manuscript how strongly the success of the clutter metric depends on the
type and parameters of the superpixel preprocessing. I assume this is
done to make computation more tractable, but more detail should be
provided in the manuscript as to why this does / does not matter.
The paper makes the theoretical (based on previous work) and
empirical (based on three histograms) claim that the similarity distances
follow Weibull distributions. I am not convinced beyond doubt that this is
necessarily true. For example, I wonder whether the histograms in Figure 2
would be equally well fit by other mixture models using e.g. Gamma
distributions. However, I don't think this distinction is important for
the applicability of the method.
Clarity: The manuscript is
generally clearly written, but I believe some reordering of
methodological details could save the reader some effort. For example, we
are told in Section 2.1 that superpixels are extracted, but not until line
336 is the actual method referenced. Line 149: ``However, there is no
evidence showing that similarity distances follow symmetrical or normal
distributions. Therefore, we argue that the similarity distances xf
computed by EMD follow a Weibull distribution, which is a distribution of
the Exponential family that is skewed in shape.'' The ``therefore''
implies that the Weibull distribution is a logical given, but it's not at
all clear from this how a Weibull distribution was selected. The choice is
justified more in subsequent paragraphs but the paper would be more clear
if these paragraphs were rearranged: first tell us why you selected
Weibull then tell us ``for these reasons we argue... Weibull''.
Minor points / suggestions:  for the initial MLE guess (line
190): I understand that this would be set separately for every image. Is
this correct?
 Line 269: ``making p and epsilon the only model
parameters in our framework'': Should this read ``tau'', not ``p''? If
not, what does ``p'' refer to here?
 is it correct that each
image will contain a potentially different threshold point, depending on
the fitting of the mixture model to the distance histograms from that
image?
 section 4.1 is unclear to me: does ``image features''
here refer to the similarity distance metrics in each of intensity, color
and orientation, or the intensity, color and orientation histograms
themselves? In the latter case, is it right to say that a circular
statistic such as orientation is ``upper bounded''?
 lines
347350: ``visual clutter perception may ignore feature dissimilarity on
the order of 14% when deciding whether two adjacent regions are similar
and should be merged.'' Given that the largest value of the epsilon
parameter tested was 15%, I would want to see some dramatic drop of
correlation within that 1415% range to believe this more. A plot of the
average correlations against epsilon (with five curves, one for each tau)
would be informative here.
 it is interesting that human object
count is such a poor predictor of clutter. Perhaps the authors could
provide example images from the database to show where clutter rankings
and object counts diverge.
 is it meaningful to ask why the
correlations of the clutter model with the median ranking is higher than
the interobserver agreement? On line 309 you present the average
correlation between pairs of participants. What is the correlation of each
participant with the average of all other participants? That is, does the
clutter model predict the average ranking better than any individual
human?
 the reader is often provided with an average rank order
correlation and a pvalue (presumably telling us that this correlation is
significantly different to zero). I would find it more informative to be
given the mean correlation and also the range of observed correlations.
 First sentence of section 2.4 has an incorrect figure
reference; should be Fig 2d?  Line 399 missing capital letter to
start sentence.  Line 401 "our work suggests a mean" should be
"means"
Q2: Please summarize your review in 12
sentences
The paper presents a model of clutter perception using
image feature distance metrics that is validated against human clutter
rankings. The model and data set is novel enough to be of interest to the
NIPS community.
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides a new clutter quantifying method,
which is still based on low level cues but uses intermediate level
descriptors (protoobjects) between fragments and objects to capture the
clutter in a scene.
The strength of the paper is that it is a
complete work, from the definition of the problem to the presentation of
the method, results and comparison to other methods. It is also a solid
technical work that provides an algorithm that is superior to the other
stateoftheart clutter perception and image segmentation methods. In
addition the paper is clearly written.
The main question I have is
about originality of the work. The authors emphasize that the main
original idea of the paper is the proposal that feature histogram
similarity based on Earth Movers Distance measure should be modeled by the
Weibull function. However, it seems to me that the Yanulevskaya &
Geusebroek 2009 paper (which the authors do cite) essentially presents
this idea. It is definitely the authors achievement that they put the idea
into a working method, but they cannot claim full ownership of the
original idea.
They also basically choose the next most obvious
step to solve a problem and combine offtheshelf methods to move to the
next level. This is decent craftsmanship but it lends an incremental feel
to the paper.
Typo: p. 5 last line: the first parameter is τ and
not p.
Quality: Good. Clarity: Good Originality: Fair
Significance: Fair.
Q2: Please summarize your
review in 12 sentences
Good quality incremental
craftsmanship. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a method to estimate how much
clutter there is in a given image. The method presented in the paper first
extract superpixels from the image and then estimates the normalized Earth
Mover's Distance (EMD) distance between neighboring superpixels with three
modalities histograms  intensity, color and orientation. For each
adjacent pair of superpixels it is estimated whether they are "similar" in
which case they belong at the same cluster (called protoobject here) or
"dissimilar" in which case they do not. The EMD distances distribution is
modeled by a mixture of Weibull densities which is used to find the
threshold of similarity. Finally, the measure for the clutterness is set
to be the ratio of clusters (protoobjects) and the number of superpixels.
The method is shown to perform well on a new dataset of clutter perception
ground truth data.
Quality: The suggested method is nice,
though I feel the paper is not very interesting, not from the problem
setting and not very a technical perspective. Much of the paper is devoted
to how to learn a mixture of Weibull distributions, but there is not
explanation as to why two methods are used, what are the fundamental
differences (not the technical ones) and most importantly  how do they
affect the final outcome. Moreover, I am wondering why EM was not used to
estimate the mixture components (as it is the most popular choice in the
literature). Furthermore, I have several concerns about the model
evaluation section (4.2) and specifically  it is not mentioned if the
model parameters were trained over a separate training set, or over the
same images used as a test set (in which case the results are quite
invalid).
Clarity: THe paper is nicely written and presented
Originality and significant: The contributions of this paper
are somewhat underwhelming. Using Weibull statistics for EMD similarity
distances is nice, but it is quite a small step forward, and I feel that
the analysis of how to fit a mixture of Weibull densities is somewhat out
of place for this paper (as I said before, it is technical and there is no
analysis of how this affects the results other than numbers).
Q2: Please summarize your review in 12
sentences
In summary, this is a nicely written paper which
suffers from low significant and somewhat lacking
analysis. Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
paper 116
This paper parses scenes into
similar regions (called 'protoobjects') based on the similarity of their
feature (color, intensity, orientation) histograms, and finds that the
number of protoobjects discovered correlates with human rank orderings of
visual clutter. This measure is appealing in that it is fairly intuitive
and fast to compute, and the results (compared to other models) is
impressive.
I think this paper misses the task dependence of
visual clutter  depending on what the visual search target is, different
types of clutter will be more or less influential. A forest is a
relatively uncluttered environment in which to search for a honda prius,
but is very cluttered when searching for a pine tree (and vice versa for a
parking lot). As I understand, visual clutter is a noteworthy measure
insofar as it relates to search efficiency. I believe that the current
measure of clutter might predict subjective rank orderings of clutter, but
would fail to predict search efficiency because it does not respect the
targetdependence of clutter measures.
Another intuitive measure
of clutter is 'how easily could I insert an object that would be salient?'
This is the intuition that ref [25] offers for their feature congestion
model. Again, for this measure, counting the number of fairly homogenous
regions in a display misses the degree to which some set of features would
be salient in that image.
So, the current model seems very good at
predicting subjective ratings of clutter, but I think it would be much
worse than other clutter measures at predicting the magnitude of more
practical implications of 'clutter' in a scene (which is what at least
some of the prior models were designed to do).
At the very least,
some discussion of this point is necessary. However, I think the current
paper needs further experiments and measurements to assess whether the
subjective ratings of clutter (which the current model predicts) have any
relationship to the practical implications of clutter that researchers
have previously been interested in. Q2: Please summarize
your review in 12 sentences
The current protoobject parsing model predicts
subjective measures of visual clutter, but I believe it is not wellsuited
to predict the practical implications of clutter: search difficulty and
the ease of inserting a salient object into a scene. I think further
experiments/measures are necessary to show that what the current model can
predict relates meaningfully to practical consequences of visual clutter
for search tasks.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for the feedback. We will
address as many of the comments as possible in the final copy.
First, we’d like to point out the technical significance of our
proposed parametric modeling of Earth Mover's Distance (EMD) using Weibull
distribution. Parametric models often suffer from difficulties in
satisfying strict data assumptions, while nonparametric models do not.
However, parametric models are more precise and efficient if the
assumptions can be satisfied: we established the link between EMD and
Weibull, and showed that our features satisfy the theoretical assumptions.
Therefore, we believe it is also a more general contribution in all areas
that require working with similarity distances, as we introduced a
parametric way of working with the popular EMD similarity distance.
R2: We appreciate the positive comments and feedback. According to
our experiments, different superpixel algorithms did not affect results
much (except run time), while parameters are not an issue since we always
used default settings. Gamma may also be useful empirically (also a
member of exponential family). We think it is more theoretically
appropriate to use Weibulls due to the connection that we have established
between Weibull and Lpdistance statistics, on the specifics of [7].
Furthermore, we showed that our features do satisfy the selfcorrelated
and nonidentically distributed assumptions which justified the connection
(Sec. 4.1). Minor points: 1. The automatic initial MLE guesses are
different for every image, but based on the same method (line 184~195).
2. Should be “tau”, thanks! 3. Yes, each image contains a
different similarity threshold, automatically computed. 4. Image
features in Section 4.1 refer to the actual intensity, color, and
orientation histograms. Upperbounded means that each feature’s values
have a finite upperbound (intensity: 255, hue: 1, orientation: 359).
5. The correlation steadily drops further from r=0.8094 to r=0.7572 as
epsilon goes from 0.14 to 0.2, at a 0.01 interval. We will try to make
space for this plot. 6. Our intuition was that human object count does
not account for complex vs simple objects, and thus is not a reliable
setsize measure. 7. True, but it is not unprecedented as [25] also
reported their model (r=0.83) correlated higher than intersubject
correlation (r=0.72). 8. Aside from the full range of parameters, it
would be more meaningful to compute mean correlation from more appropriate
subset of parameter values, i.e. tau=0.6~0.9, epsilon=0.05~0.15 results in
mean r=0.7298; or r=0.7787 when tau=0.6~0.9 and epsilon=0.10~0.15.
R5: It is crucial to make the distinction between actual image
statistics, and similarity distance statistics (line 165~167), which are
two very different quantities. Here we stress our use of Weibull to model
EMD (Lpbased similarity distance) being completely different from using
Weibull to model actual image features (whole image and local patch edge
distributions) as done in [35]: similarity distance values are the results
of computing the similarity between pairs of actual feature distributions.
We provided the theoretical justification for using Weibull to model EMD
(Section 2.2 and 4.1), hence Weibull is the appropriate distribution for
EMD. This conclusion is derived from a theoretical basis that is
completely different from image features being Weibull ([35]). This
theoretical justification of parametric modeling similarity distance is
novel to the best of our knowledge.
R8: We point the Reviewer to
the comments of R2 on why the problem setting is interesting. Regarding
the technical contributions: 1. MLE and NLS are two different,
widelyused classes of optimization methods; our motivation was to test
how different optimization methods affect the final model predictions.
Both methods result in high correlation of our model to human clutter
rating (table 1), indicating that our approach is very robust and mostly
invariant to different optimization methods. 2. EM is a general class
of methods for finding the MLE, where the M step usually requires taking
the derivative of the likelihood function. In our case, the likelihood
function of the 7parameter Weibull mixture is very complex (Eq. 3) and
taking the gradient of such complex likelihood function is often avoided
(Kaylan and Harris, 1982, Mumford 1997). Therefore, we adopted the
derivativefree method of NelderMead algorithm for obtaining the MLE. In
fact, the algorithm is very similar to EM in that its step 1 and 2 (Sup
Material Sec. 1) is essentially the Estep, while step 3 is the Mstep in
a nonderivative way. Fig 2 shows that the NelderMead algorithm worked
well empirically. 3. We have validated in TWO ways: We followed the
convention for reporting the optimal dataset parameters and correlation by
other state of the art clutter models ([6][19][25][18]) using the entire
dataset (r=0.8094). We also applied 10fold cross validation (line
340~345) resulted in an average unseen testset correlation of r=0.7891,
such high correlation indicates that our model is very generalizable and
robust to unseen data.
R9: The goal of our work is to propose a
quantification of setsize that can be used in modeling clutter
perception. It is the first step leading up to a larger exploration of the
relationship between clutter and search behavior as the first Reviewer
(R2) also points out. On the other hand, Asher et al, Jrnl of Vision 2013
recently found that existing clutter models ([6][19][25]) were rather weak
in predicting search performance using 120 natural scenes (r~0.2), with
mostly nonsignificant pvalues. Our preliminary experiments using Asher’s
human search dataset resulted in r=0.29 and p=0.0015, which is the highest
correlation with the most significant p comparing to existing clutter
models that Asher reported. We anticipate conducting a similar study to
evaluate our model in this regard in a followup journal paper.
 