Paper ID: 1319
Title: Spectral Norm Regularization of Orthonormal Representations for Graph Transduction
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)


The paper starts by introducing the background and showing the contributions. The authors then use the Zhang and Ando result showing that SSL reduces to an equivalent kernel-based supervised learning problem for the rest of the paper. In section 2 and appendix, they provide an error bound showing that minimising the spectral norm of the graph kernel matrix is a good way to improve generalisation. In section 3, they add a spectral norm regularization term to the Zhang and Ando formulation, and provide links between the error convergence and the Lovasz number of the data graph. Section 4 proposes a proximal solver, where the specificity is that it deals with the projection step approximately to handle the constraint of a cone-polytope intersection. Section 5 proposes an MKL extension, and section 6 concludes with experimental results on several datasets.

Overall, the paper is interesting and contains a lot of material (maybe even a bit dense for a single paper), with several important contributions. The link between large-scale topological properties of the data graph (Lovasz number), which reflect its organisation in some space or manifold, and the convergence of the embedding estimation (eq. 5) is particularly enlightening and bridges several fields.

In section 1 after equation 3 I would perhaps make explicit that the correspondance between eqs. (3) and (1) means that a semi-supervised learning problem becomes a supervised learning problem, so that the learning is on labelled samples, otherwise it is a bit surprising to see indices over S only (with no \bar{S}) in the sums in equation (3), whereas in eq. 1 the regulariser is also on non-labeled samples while the loss is on labelled data.

In section 2, The notation of equation 5 is confusing - why is there an equal sign just before \omega_C ?

In section 3 around line 162, maybe mention how the embedding (U) is recovered from K ? There are several (practical) choices for the factorisation.

In section 4,

Algorithm 1 step 8-9 - ensuring that K_r is PSD typically quite expensive (for general dense square symmetric matrices), and here an eigendecomposition is done every time (supplementary D.3.2). Is this where the algorithm takes most time? Would it be feasible to just optimise on a factorisation of K (say, Cholesky with K=LL^T), projection to the feasible set would be to a matrix with all-zeros in upper triangular, which would by construction ensure that K is PSD ?

Zhang and Ando is in NIPS 2005, not 2006

Q2: Please summarize your review in 1-2 sentences


This is an interesting paper showing how spectral regularization impacts error bounds on graph transduction using orthonormal embedding, which provides new theoretical insights and convincing experimental results.


Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper addresses the problem of learning labels on the vertices of a graph. It builds on the work of Ando and Zhang which suggested that embedding the graph on a unit sphere, by means of an orthonormal presentation, results in better generalization error, i.e., error on the test set. Ando and Zhang however, did not present a concrete solution on how to choose an optimal embedding. The current paper's contribution is the proposal of a new optimization formulation which jointly optimize the embedding and the learned labels by introducing a regularization term. The regularization term, the spectral norm of the corresponding graph kernel matrix, is chosen based on the insight obtained from Theorem 1 in the paper which states an upper bound on the generalization error. Further results in the paper states asymptotic upper bound on the generalization error of the proposed method which has favorable scaling compared to previous results. This upper bound can also be translated to an upper bound for labeled sample complexity. While the proposed optimization can be cast an SDP the complexity of solving SDPs are not desirable. Hence, the paper suggest optimizing this problem using inexact proximal method which can scale to graphs with 1000s of vertices. Some experimental results indicate the superiority of the proposed method compared to previous methods.

The reviewer was not able to understand the main result of the paper, i.e., Theorem 4, at the high level. In the paragraph preceding Theorem 4 the paper states "in the presence of unlabeled nodes, without any assumption on the data, it is impossible to learn labels. Following existing literature [1, 12], we assume an edge links similar instances." However, there is no definition of what \emph{similar} means and there is no mention of it in the assumptions of the Theorem. The reviewer could not find any mention or use of such assumption in the proof of the Theorem either. Despite this, the claim of the theorem can be roughly stated, in the case of random graphs, as

er^{0-1}[\hat{y}]= O( 1/m (n^{3/4} + \log{1/\delta} )^{1/2} ).

This, as far as the reviewer understands, means that if vertices of a large random graph are {0,1} labeled at random and then a diminishing fraction, e.g., \omega{n^{3/4}}, of these labels are observed the algorithm is guaranteed with high probability to find the remaining labels with small error.
Q2: Please summarize your review in 1-2 sentences
The reviewer was not able to understand the main result (see below) and can't make any judgment.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors look of optimizing the orthonormal embedding of a graph and the resulting impact of learning functions on the graph.

The authors start off by reviewing material from prior work using a Laplacian and the corresponding generalization bound (eqn. 2). They then discuss their bound based on the maximum eigenvalue of the kernel matrix (eqn. 4). This bound leads to an alternate optimization criterion--equation (5); \lambda_1(K) is not differentiable, but the optimization criterion is convex (sum of convex functions).

The authors use inexact proximal methods for the solution of SEER.

In the experiments section, the authors apply SEER and MKL-SEER to multiple problems and show significant performance improvements.

Overall, this paper is very solid.

There's good theoretical justification.

Also, there are multiple intuitive leaps as the paper progresses that show cleverness in the approach--the new optimization criterion, the willingness to tackle the difficult optimization problem, and the satisfying application to real problems.

This paper is a nice advancement over the prior approaches.

Q2: Please summarize your review in 1-2 sentences
The authors look at the problem of finding an optimal embedding for graph transduction.

They derive

a novel method and demonstrate it works well in practice -- this paper is complete and well-explained.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a new framework for spectral regularized embedding, applied to graph transduction. The framework itself, and in particular the benefits for having an orthonormal embedding might be described more convincingly, but this is not a serious issue.

The authors then propose an interesting and elegant algorithm to solve the SEER problem approximately. This is an original contribution and it seems to provide a good alternative to solve the complex learning problem.

Experimental results tend to demonstrate the benefits of the proposed solution. It would be great to show results that demonstrate the benefits of the SEER framework, or of the orthonormal representation defined by (3) at first, and also to show the approximation accuracy of the IIP solution to the SEER problem. This is not visible in the current results, and it would be of high interest to position the actual contribution in a more global context.

The multiple graph transduction is unfortunately described at a very high level. It is definitely an interesting extension, but more details, motivation and intuitions are necessary to understand clearly how and why the proposed method is also interesting for multiple graph, and especially to understand how that algorithm is implement in such settings. In the current form, it is difficult to grasp the main ideas in the multiple graph case, and especially to really appreciate the experimental results in this case. In that respect, the last column of Table 4 (noisy case) is not very intuitive - it seems that one graph is better than multiple ones, which would contradict the motivation for dealing with multiple graphs.

Finally, the paper is not always easy to read and pretty dense at places. It also refers at too many places to the supplementary material. It is probably a sign that the authors wants to give too much information in a single paper, which actually penalises the accessibility and probably visibility of the work.
Q2: Please summarize your review in 1-2 sentences
The paper proposes a new framework, SEER for orthonormal embeddings that is applied to graph transduction. An interesting algorithm based on IIP is proposed to solve the learning problem in SEER, and relatively convincing results are proposed.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Masked Reviewer ID: Assigned_Reviewer_1

Q1) The paper is interesting and contains a lot of material (maybe even a bit dense for a single paper), with several important contributions.
R1) Agreed, we plan to write an extended journal version.

Q2) In section 1, after (3) I would perhaps make explicit that the correspondence between (3) and (1) means that a semi-supervised learning problem becomes a supervised learning problem, so that the learning is on labelled samples, otherwise it is a bit surprising to see indices over S only (with no \bar{S}) in the sums in (3), whereas in eq. 1 the regularizer is also on non-labeled samples while the loss is on labelled data.
R2) Good suggestion, will make this explicit in the paper.

Q3) In section 2, the notation of equation 5 is confusing - why is there an equal sign just before \omega_C?
R3) The function g(K) is defined as \omega_C plus the spectral norm regularizer, we will rephrase and clarify.

Q4) In section 3 around line 162, maybe mention how the embedding (U) is recovered from K? There are several (practical) choices for the factorization.
R4) Yes, we will make a note in the paper.

Q5) In section 4, Algorithm 1 step 8-9 - ensuring that K_r is PSD quite expensive, and here an eigendecomposition is done every time (supp D.3.2). Is this where the algo takes most time? Could optimize on a factorization of K (say, Cholesky with K=LL^T), projection to the feasible set would be to a matrix with all-zeros in upper triangular?
R5) Yes, this is our performance bottleneck. The suggested alternative is an interesting idea, we are not sure at this point. Will explore more in this direction.

Q6) Zhang and Ando is in NIPS 2005, not 2006.
R6) Apologies, will fix in the final manuscript.

------------------
Masked Reviewer ID: Assigned_Reviewer_2

Q1) The reviewer was not able to understand Theorem 4 at the high level.
R1) The main implication of the result is that we relate the error bound to graph topological measure - theta, which then leads to connecting to structural properties of the graph, lines 204 - 206.

Q2) "in the presence of unlabeled nodes, without any assumption on the data, it is impossible to learn labels. Following existing literature [1, 12], we assume an edge links similar instances." - there is no definition of what \emph{similar} means and there is no mention of it in the assumptions of the Theorem. The reviewer could not find any mention or use of such assumption in the proof of the Theorem either.
R2) We are working with unweighted graphs, and presence of an edge would mean two nodes are similar. Optimal Orthonormal embeddings tend to embed vertices to nearby regions if they have edges in between. Hence, the notion of similarity is implicit in the embedding. Will make a note in the paper.

------------------
Masked Reviewer ID: Assigned_Reviewer_4

Q1) Experimental results - it would be great to show results that demonstrate the benefits of the SEER framework, or of the orthonormal representation defined by (3) at first, and also to show the approximation accuracy of the IIP solution to the SEER problem.
R1) Preliminary experiments have shown that the SDP solver takes a lot of time for SEER to converge. However, we have shown a detailed comparison for Lovasz-theta in Table 1 (supplementary material), illustrating the convergence rate and accuracy of the proposed solution. Will include more details in the final draft.

Q2) Multiple graph transduction - more details, motivation and intuitions are necessary to understand clearly how and why the proposed method is also interesting for multiple graph, and to understand how that algorithm is implemented. Noisy case, Table 4 is not very intuitive - it seems that one graph is better than multiple ones, which would contradict the motivation for dealing with multiple graphs.
R2) Agreed, we have moved some of the details to the supplementary material, Section G due to space constraint. Will incorporate the suggested modifications in the paper.
The aim of the noisy case experiment is to show that MKL can extract the correct graph, and hence the labels even in the presence of noisy graphs; and in the absence of noise (Columns 1 and 2, Table 4), MKL strictly improves.

Q3) The paper is not always easy to read and pretty dense at places, refers at too many places to the supplementary material. Too much information in a single paper, which actually deserves accessibility and visibility of the work.
R3) Agreed, we plan to write an extended journal version.

------------------
Masked Reviewer ID: Assigned_Reviewer_5

Q1) A minus is lack of comparison of computation time in the experiments.
R1) Table 1 (supplementary material) shows a detailed comparison in time and accuracy between an SDP solver and the proposed IIP method for Lovasz-theta. Will include more details in the final manuscript.