
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes an algorithm for normalized cuts
of hypergraphs by formulating the cut as the minimization of a ratio
of two convex functions which can be solved using existing methods
(RatioDCA, with an inner problem solved using a primaldual method).
Semisupervised learning on a hypergraph is formulated as a related
optimization problem and solved with a similar primaldual method. The
proposed approach is shown on several datasets to outperform an
alternative technique based on a transformation of the hypergraph to a
regular graph for a semisupervised learning, a clustering and a cut
objective.
The paper is clear and well written. It is
technically sound and provides a significant contribution to the
problem of hypergraph cut, and possibly to semisupervised learning
and clustering  assuming a hypergraph based approach is relevant to
the problem.
Concerning this last point, not much is said about
the relevance of the hypergraph approach. In all examples, the
hypergraph is not provided as a separate structure, but is built from
the covariates by making one hyperedge for all samples which share the
same value of one covariate. if the data points are originally
represented as vectors of features, other semisupervised and
clustering techniques (eg based on scalar products or on a regular
graph built from the features) would make sense and should be compared
against.
Similarly, the relevance of semisupervised learning to
the datasets is not discussed: what kind of performance would be
obtained without using the unlabeled samples? Admittedly, the point of
this paper is how the new formulation improves semisupervised, not
whether semisupervised is relevant.
A few minor points:
 Total variation seems to be accessory in the paper, whose main
achievement is to provide a better algorithm for hypergraph
normalized cuts. The current title is a little misleading from this
point of view.
 The semisupervised problem (3) is itself a
prox of \lambda\Omega at point Y. There may not be a simpler way to
compute the prox than the proposed algorithm but it could be useful to
point it out in order to avoid confusion between prox (3) and the prox
of its first term and the conjugate of the second term which are used
to compute it.
Q2: Please summarize your review in
12 sentences
 Clear presentation, well written paper. 
Significant contribution, technically sound.  Little discussion of
the relevance of using hypergraphs to represent this data (as opposed
to vectors or graphs).
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Graph cut methods for semisupervised classification
and clustering are dominant in the last decade. Hypergraphs can
incorporate higher order information about data than ordinary graphs and
thus should be more preferable. Existing methods all have their own sets
of limitation, as discusses in Section 1.
In contrast, this paper
studies how to directly deal with the hypergraph cut using the total
variation of the Lovasz extension. Two frameworks are proposed in Sections
3 and 4, and an algorithm for solving the involved problems is presented
in Section 5. Experimental results in Section 6 are promising.
The
idea of this paper sounds quite interesting. However, the derivation is
not easy to follow. I have not checked the technical details since I am
not familiar with these optimization problems in Sections 4 and 5. Anyway,
the extension in Section 3 for semisupervised learning from graph
Laplacian matrices to the proposed functionals is natural, and the
theoretical results seem correct.
I have two minor questions.
Firstly, in Definition 2.1, it is required that f_1<=...<=f_n. It
seems later in the optimizations f as an optimization variable may not
always satisfy this constraint. So when is it valid? Secondly, in the
experiments, numerical features are converted into categorical by 10 equal
size bins. Will this cause certain loss of information?
** Comment
after authors' feedback **
The authors have clarified my concern
about the experimental setup. Q2: Please summarize your
review in 12 sentences
This paper studies how to directly deal with the
hypergraph cut using the total variation of the Lovasz extension. The idea
is quite interesting, and the experimental results are
promising. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper defines a notion of total variation on
hypergraphs, proves its convexity and relation to cuts, discusses
applications to learning and normalized cuts (for clustering), computes
proximity operators as part of optimization algorithms, and presents
numerical experiments comparing with a popular method of Zhou from 2006.
Overall, the paper tours several subdisciplines. There are a lot of ideas
and a few theorems (though of course they cannot say much about a global
solution in the case of the normalized cut problem, but no one can). The
paper stays focused despite the amount of material and it is well written,
with only occasional grammar issues.
Section 5 falls most into my
area of expertise, and I am satisfied with it. I am less familiar with
existing work on spectral clustering and existing approaches for
hypergraphs. This paper is basically claiming to be the first major
advance in 7 years (since Zhou [11]), and I am inclined to believe this,
but not completely certain. If it is true, then this paper should
definitely be accepted and will likely become wellknown. But of course I
am a bit skeptical that the authors have not found any more recent work
than [11], given the amount of attention to the subject (or is it because
everyone is working on tensors and ignoring hypergraphs?)
Other
comments:  Comparisons were always with [11], which uses the same
hypergraph framework, and then reduces it with the CE technique to a graph
problem. But these problems (SSL, clustering) can be attacked from quite
different perspectives. A comparison with an "outside" approach would make
this paper stronger, even a simple algorithm (such as kmeans for
clustering... or explain why it is not applicable). Your clustering
approach which recursively splits clusters is clearly not optimal (though
I do not doubt that it is a standard technique).
 Line 121: "the
optimal cut". Shouldn't this be "an optimal cut", since it is symmetric so
and the mirrorimage cut would work as well?
 The section around
Thm 4.1 seemed vague to me and could have benefited from more explanation.
For example, it's not clear to me how to apply the second equation from
the theorem.
 I'm not sure what the authors mean by "tight
relaxation" (section 4, and mentioned in the intro). Is it the "the
tightest"? (in that case, define what is your class of "relaxations",
since it's not the class of convex functions in this case). Or do you mean
"tighter" than the linear eigenproblem relaxation? (in this case, how do
you justify this? For graphs, it is justified by empirical evidence, so
for hypergraphs, you should derive the linear relaxation and test it).
 Section 5. The method in [24] has since been extended a lot;
see, for example, Condat 2011
(http://www.optimizationonline.org/DB_HTML/2011/12/3284.html). In
particular (and I think this was already in [24]), it can exploit smooth
functions. Usually, it is better (in terms of convergence rates) to take
derivatives of smooth functions when possible, as opposed to calculating
their proximity operators; so for the first G term in Table 1, why not
treat it as the smooth term? For line 276, "the main idea" (also, note
that this phrase is repeated three times on this page) is not really
described that well, since this is not unique to the method of [24].
Rather [24] allows one to separate many terms.
 I am quite
familiar with TV on a grid (for 2D). In this case, people prefer using the
isotropic TV, but this does not have a closedform proximity operator nor
an efficient algorithm to compute it. Alternatively, the most common form
of anisotropic TV is much simpler to work with. It appears that your TV
definition generalizes this anisotropic TV. It's not clear, but it would
be very interesting to explore, if it is possible to define an isotropic
TV on a hypergraph.
 Prop 5.1. This is O( n log n) due to the
sort. Is it not possible to avoid the sort and just find the largest and
smallest entries of the input? This would bring it down to O(n). I realize
you may have to worry about cases when the answer will have several values
at the max/min, but it still may be possible to deal with this.

Experiments. Overall, well done. For making the hypergraph edges using
data points that have the same value of a feature, is it not beneficial to
make edges (but with much lower weight) when data points have *similar*
values?
 The table on page 8: what are the entries? Some kind of
error? This was not explained
 Line 415: "Our method... minimizes
the normalized cut on the hypergraph." Unless I have misunderstood, this
is not true. You apply a convex method to a nonconvex problem, so in the
best case (and this you would need to show) you reach a stationary point
or local minima. So at least say that your method "attempts" to
minimize...
Grammar:  Line 276 and 301, "proximum" and
"proxima" are not commonly used; just use "proximity operator".

Line 4167 is awkward.
 The authors use the construction "allow
to" a lot, as well as variants "suggest/recommend to use", "favor to split
off", "propose to solve", (e.g., line 17, 69, 107, 192, 201, 385). Since
the English is otherwise nearly perfect, it's worth correcting this. In
these cases, the infinite should be a gerund, e.g. "suggest to use" should
be "suggest using". Rarely, both forms are correct (e.g. "I like to play
tennis" and "I like playing tennis" are both fine). In the case of
"Hypergraphs allow to encode", it should be "Hypergraphs allow one to
encode".
 "However" is not used correctly a few times. E.g. line
199, it should be "Hence" not "However", and line 408 it should be
"Thus". Q2: Please summarize your review in 12
sentences
This seems like a significant paper. It introduces a
new topic and explores it quite a bit, with good results. The numerics are
done well, and it compares quite favorably over a similar approach from
2006. My only reservation is whether they have missed some recent work
from the past 7 years that has further explored hypergraphs and would
improve on [11].
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for their suggestions and
comments. We will incorporate them into the final version.
Please note that we answer the questions and comments in a
thematic order.
Reviewer 1  Relevance of the hypergraph
approach
The main goal of the experiments is to compare to the
method of Zhou et al [11]. Thus we basically repeated the experiments
in the same setup (datasets and hypergraph construction). However, the
relevance of hypergraphs has been pointed out, e.g., for motion
segmentation [2], subspace clustering [9] and analysis of cellular
networks [3]. We agree with the reviewer that for the chosen datasets
other clustering approaches would be possible  we are not claiming
that this is the best possible way to solve these clustering problems.
If space allows we will try to incorporate one of the above problems
in the final version.
Reviewer 2  'in Definition 2.1, it is
required that f_1<=...<=f_n. It seems later in the optimizations
f as an optimization variable may not always satisfy this constraint.'
We are abusing notation a bit in the definition of the Lovasz
extension. What we mean is the following: Let pi be a permutation of
{1,...,V} such that f_{pi_1} <= f_{pi_2} <= ... <= f_{pi_n},
then replace in the definition of S(f) f_i by f_{pi_i}. In the
algorithm, this does not play a role because we derive an analytical
form of the Lovasz extension of the hypergraph cut in Proposition 2.2.
Hence, there is no need for an additional ordering of the input vector.
Reviewer 2  'numerical features are converted into
categorical by 10 equal size bins. Will this cause certain loss of
information?'
Yes, even though one can argue that for this
particular dataset the loss is small. However, we are following the
experiment design of [11] in order to have a fair comparison (please see
the above comment to Reviewer 1). We are not arguing that this is
necessarily the best way to solve these clustering problems. However,
the hypergraph approach definitely makes sense for categorical
features.
Reviewer 3  'This paper is basically claiming to be
the first major advance in 7 years (since Zhou [11])'
In the
meantime there have been advances in the tensor approach to tackle
kuniform hypergraphs e.g. [9] is from AISTATS 2012. But for general
hypergraphs we have found after an extensive literature review no
significant step after [10,11,12] e.g. paper [2] from CVPR 2012 uses [11]
but modifies the weight definition of the cliques.
Reviewer 3  'A comparison with an "outside" approach would
make this paper stronger, even a simple algorithm (such as kmeans for
clustering... or explain why it is not applicable)'
Please see our
comments to Reviewer 1 and Reviewer 2  we will include a comparison to
other clustering methods but kmeans cannot be applied in a
straightforward way to categorical features
Reviewer 3  'Your
clustering approach which recursively splits clusters is clearly not
optimal '
It is not optimal but it is the standard way to get a
multicut.
Reviewer 3  'The section around Thm. 4.1 seemed
vague to me and could have benefited from more explanation. For
example, it's not clear to me how to apply the second equation from the
theorem.' 'I'm not sure what the authors mean by "tight relaxation"'
Thm 4.1. states an exact relaxation of the balanced hypergraph cut
problem, that is the discrete combinatorial problem is equivalent to
the continuous optimization problem stated in Thm 4.1. (in this sense
tight). The inequality in Thm 4.1. is very useful as finally for the
clustering we need a way to go from the continuous problem back to the
combinatorial one. This is done by thresholding: Thm 4.1. says that
thresholding of the continuous solution can only improve the balanced
graph cut. Note that for f=1_C the continuous objective equals
cut(C,\overline{C})/\hat{S}(C).
Reviewer 3  Better
optimization techniques beyond [23,24]
We are grateful to the
reviewer for making us aware of the recent improvements in primaldual
methods. We will explore this for further speedup.
Reviewer 3
 'is it possible to define an isotropic TV on a hypergraph'
As
isotropy is directly related to invariances with respect to the Euclidean
group, we doubt that this can be generalized to hypergraphs.
Reviewer 3  'Prop 5.1. This is O( n log n) due to the sort.
Is it not possible to avoid the sort and just find the largest and
smallest entries of the input?'
We are currently exploring
techniques to improve it to O(n).
Reviewer 3  'is it not
beneficial to make edges (but with much lower weight) when data points
have *similar* values'
In general yes, but note that for most
datasets all features are categorical  then one needs a notion of
similarity on these features to do this.
Reviewer 3  'The
table on page 8: what are the entries? Some kind of error? This was not
explained'
It shows the error+standard deviation (10 runs) for
different number of labeled points. We will make this clearer in the final
version.
Reviewer 3  'Our method... minimizes the normalized
cut on the hypergraph." Unless I have misunderstood, this is not true. You
apply a convex method to a nonconvex problem, so in the best case (and
this you would need to show) you reach a stationary point or local minima.
So at least say that your method "attempts" to minimize...'
We
will clarify this. We wanted to emphasize again the difference between
minimizing normalized hypergraph cut and normalized graph cut for the
clique expansion but we are clearly not claiming to have guarantees to
find the global optimum.
Reviewer 3  Grammar mistakes
As nonnative speakers we are very grateful for such hints 
Thanks a lot !
 