
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper studies gradient descent algorithms for rank minimisation subject to affine constraints. The main claim to fame is an analysis of number the number of samples required for exact recovery in a of the Gaussian ensemble setting (which resembles http://arxiv.org/abs/1212.0467).
I am in two minds, when I read the paper. One one hand, the paper is well written  perhaps the bestwritten out of those I have reviewed this year  and I trust all the claims. On the other hand, neither the theoretical nor the implementational aspects are particularly strong.
The theoretical results are based on the work of Jain (going all the way to http://arxiv.org/abs/1212.0467) and Candes (http://arxiv.org/abs/1407.1065) and are of incremental nature. The paper does attribute the credit, at least to Candes.
The implementation seems awful. When compared to recent implementations, e.g. http://arxiv.org/abs/1408.2467 the performance seems orders of magnitude away from the state of the art  and being an order of magnitude faster than generalpurpose SDP solver on the nuclear norm does not make it any better. The authors should acknowledge that and compare the results with other codes on some established benchmark (e.g. Lenna), so as to show that the price in terms of runtime brings about much better performance in terms of objective function values (SNR, RMSE)  which is plausible, but far from certain.
Minor comments:
Introduction is misleading. The technical results come with many ifs and buts and the introduction is not making that clear enough.
Section 6.1 is misleading, because the sole fact that one method is more expensive periteration does not make it "least efficient", unless there is a very tight analysis of the rates of convergence parametrised by the same m, n, p, r (which there isn't), which the section does not make clear enough.
Q2: Please summarize your review in 12 sentences
The paper is well written  perhaps the bestwritten out of those I have reviewed this year  and I trust all the claims, but neither the theoretical nor the implementational aspects are strong enough for a "strong accept".
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I was asked to provide a "light review" of this work.
The paper presents results on recovery of lowrank semidefinite matrices from linear measurements, using nonconvex optimization. The approach is inspired by recent work on phase retrieval, and combines spectral initialization with gradient descent. The connection to phase retrieval comes because measurements which are linear in the semidefinite matrix X = Z Z' are quadratic in the factors Z. The paper proves recovery results which imply that correct recovery occurs when the number of measurements m is essentially proportional to n r^2, where n is the dimensionality and r is the rank. The convergence analysis is based on a form of restricted strong convexity (restricted because there is an r(r1)/2dimensional set of equivalent solutions along which the objective is flat). This condition also implies linear convergence of the proposed algorithm.
The paper contains novel results on rank minimization, and a novel technical approach. Understanding the properties of rank minimization problems that enable nonconvex methods to succeed is certainly of interest. However, the results are not completely decisive. The main issue is that the paper does not obtain orderoptimal sample complexity  the results are suboptimal by a factor of r. Hence, the theoretical results are not really on par with the current best known results for efficient algorithms for this problem. Also, the results are restricted to semidefinite matrices, and hence misses many compelling machine learning applications of rank minimization.
The results could be compared to those of De Sa, Olukotun and Re (Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems). This submission has a better dependence on the rank parameter, but requires spectral initialization.
In the experiments on nuclear norm minimization, the positive semidefinite constraint should not be dropped.
Q2: Please summarize your review in 12 sentences
The paper contains novel results on rank minimization, and a novel technical approach. Understanding the properties of rank minimization problems that enable nonconvex methods to succeed is certainly of interest. However, the results are not completely decisive. The main issue is that the paper does not obtain orderoptimal sample complexity  the results are suboptimal by a factor of r. Hence, the theoretical results are not really on par with the current best known results for efficient algorithms for this problem. Also, the results are restricted to semidefinite matrices, and hence misses many compelling machine learning applications of rank minimization.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
++++++++++ Summary:
Pros:
Interesting approach and techniques Well written paper with clear exposition Overall idea could spur further improvements
Cons: References/comparisons to some important existing approaches are missing.
++++++++++ Detailed:
The authors propose an efficient algorithm for the recovery of a lowrank matrix from random affine measurements. The authors maintain a factorized representation of the target matrix; pose the recovery problem as a nonconvex optimization involving the factored representation; and solve the optimization using a technique that they call "Wirtinger Flow". Essentially, the approach consists of a careful initialization followed by a series of gradient descent iterations. The descent converges linearly and the overall algorithm exhibits a similar asymptotic runtime as singular value projection (SVP). Numerical experiments support the authors' claim that their approach is faster than existing methods.
The paper is well written and the exposition is clear. The paper adds to the growing body of work in the algorithmic learning literature that all revolve around the "spectral initialization + gradient descent" type of idea. While the theoretical improvement over existing methods (such as SVP) is only by a constant, the empirical gains are evocative and the techniques developed in the paper could lead to significantly faster algorithms in the future.
One concern regarding prior work: the authors present their algorithm as a considerable improvement over nuclear norm minimization or SVP primarily due to the benefit of not having to incur an SVD per iteration.
However, they do not compare theoretical/numerical performance with the lowrank SDP approaches of Burer and Monteiro [05] and numerous followup works. Like the authors, these approaches also do not require an expensive SVD in each iteration; perform gradient descent; and converge rapidly to the optimum. How does the authors' technique improve over these methods?
Some other comments:  Consider also comparing with the Admira approach of LeeBresler '09.  The variable p is used to denote matrix dimension in (1), and sparsity of the measurement matrices in Table 1.
 Why state the results of Table 1 in terms of p? For the GOE (which is what the authors consider in their theoretical exposition), p = n^2 almost surely.
Q2: Please summarize your review in 12 sentences
The paper proposes a scalable algorithm for recovering a lowrank matrix from affine measurements. The algorithm is similar to gradient descent with a careful spectral initialization, and exhibits provable linear convergence. While certain comparisons to the literature are missing, the contributions of the paper are compelling and the techniques developed here could spur several followup ideas.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a gradient descent type algorithm for rank minimization in the space of symmetric positive semidefinite matrices with linear equality constraints. It was motivated by the recent work by Candes et al [5]. The basic algorithm as in [5] is based on the Wirtinger flow algorithm. Convergence analysis and sample complexity analysis are also provided for the proposed algorithm. The paper is wellwritten with consistent notations and a high level outline of the convergence analysis. These make this paper more accessible to readers. In addition, with a nice initialization strategy, it converges to the global optimum under some assumptions.
This is a good paper. I was pleased to review this paper.
My general comments and suggestion are the following.
L177 : Authors claim that the method generalizes Wirtinger's flow algorithm which I'm not sure is right because the proposed method is defined with only real values (NOT complex); z and ai's. Please clarify that.
Please briefly explain about a "phase retrieval" for readers who are not familiar with the previous work by Candes et al [5].
Assumption on the true rank:
L183 : The paper assumes that the true rank is given prior to the algorithm. It might be a quite strong assumption in some sense and the nuclear norm method does not assume this. Discuss the effect of misspecification of rank with the proposed method.
L253 : As authors suggested, if one wants to search for the proper rank by running the algorithm multiple times, is there a systematic way of choosing r after running the algorithm say n number of times and looking at the value f(Z) after each r?
Eventually, the problem solved by the method is reduced to an optimization for a fixed (low) rank solution. In literature, it was addressed by multiple methods especially using manifold frameworks, which are not discussed in the paper. You may want to refer relevant works (R1R3) for readers.
R1. Journee, Michel, et al. "Lowrank optimization on the cone of positive semidefinite matrices." SIAM Journal on Optimization 20.5 (2010): 23272351. R2. Vandereycken, Bart, and Stefan Vandewalle. "A Riemannian optimization approach for computing lowrank solutions of Lyapunov equations." SIAM Journal on Matrix Analysis and Applications 31.5 (2010): 25532579. R3. Vandereycken, Bart. "Lowrank matrix completion by Riemannian optimization."SIAM Journal on Optimization 23.2 (2013): 12141236.
Q2: Please summarize your review in 12 sentences
This paper presents an efficient/scalable lowrank optimization algorithm on symmetric positive semidefinite matrices with linear convergence to the global optimum under some assumptions. I think this paper is a good paper and it would inspire further research on this topic.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their constructive
comments, and respond briefly to the main comments and
criticisms:
Reviewer 1 ==========
1. The reviewer
comments that the implementation may be far from state of the art. Please
note that we have taken great care to make an accurate and fair comparison
between the relevant methods discussed in the paper. In particular,
Figure 2 shows a runtime versus accuracy comparison of SVP, gradient
descent and nuclear norm. The implementations may not be the best
possible, but they fairly compare the competing algorithms on moderately
sized problems, supporting and illustrating the analysis. The codes and
benchmarks such as Lenna mentioned by the reviewer do not appear to be
directly relevant, as they target different problems.
2. The
reviewer suggests that Section 6.1 about runtime may be misleading. We
agree with the comment that complexity per iteration does not tell the
whole story; thus the runtime comparison in Section 6.2 is more
germane.
Reviewer 2 ==========
3. We thank the reviewer
for mentioning the papers of Burer and Monteiro (2005) and Lee and Bresler
(2009). Both papers are certainly relevant related work, and should be
discussed. The Burer and Monteiro (B&M) paper (with which we were
previously familiar but neglected to cite) is important, and gives a
helpful traceback of the factorization and nonconvex optimization idea in
the optimization literature. While related, our algorithm and analysis
are substantially different than these works. Essentially, B&M target
the general semidefinite programming problem and have a more complex set
of first order techniques for nonconvex optimization (BFGS and augmented
Lagrangian techniques, etc.) It would not be easy to do a direct numerical
comparison; but we would expect our methods to perform comparably. In
contrast, our method is clean and simple, targets a more limited class of
problems, and correspondingly allows us to obtain a strong theoretical
convergence analysis (the pesky extra factor of r notwithstanding). As
stated by Burer and Monteiro (2003) "Although we are able to derive some
amount of theoretical justification for [global convergence], our belief
that the method is not strongly affected by the inherent nonconvexity [of
the objective function] is largely experimental." We hope that our work
will contribute to and help spur the further development of this important
class of techniques.
4. The sparsity parameter should be \rho
rather than p. As the reviewer points out, the GOE corresponds to dense
matrices. However, as shown in the empirical results, the computational
advantages of our algorithm are magnified for sparse measurement matrices,
which is why we include this parameter in Table 1.
Reviewer
3 ==========
5. As the reviewer points out, the "Wirtinger flow"
algorithm is for complex values in the phase retrieval setting. Our
algorithm is thus not strictly speaking a generalizationwe adapt the
core idea of factoring the matrix variable and carrying out gradient
descent on the factor.
6. Regarding the choice of rank, when the
rank is incorrectly specified, our algorithm still quickly converges, but
to a value that has relatively high error. When the correct rank is
selected, the algorithm converges to a very low error. This results in a
convenient, practical procedure for selecting the rank, as discussed.
Further simulations can be included to clarify and reinforce this
point.
7. We thank the reviewer for several relevant references.
These papers present related techniques and should be included for
completeness.
Reviewer 5 ==========
8. The reviewer
correctly points out that the analysis in the paper is suboptimal, with an
extra factor of the rank r in the sample complexity. Our experiments
strongly suggest that this extra factor is due to our currently loose
analysis and is not intrinsic to the algorithm. As much as we would have
liked to remove this factor, in our modest opinion, this does not lessen
the potential impact of the paper. Rather, it simply points to a specific
challenge for future work. A long and distinguished line of work in sparse
data analysis (compressed sensing, lasso, etc.) has progressed through a
series of refinements in sample complexity and assumptions.
9. The
positive semidefinite constraint was dropped for nuclear norm minimization
following the discussion in Section 2. If this constraint is imposed in
the optimization the runtime of the nuclear norm method will degrade
considerably.
Reviewer 6 ==========
10. We respectfully
point out that reference [i] mentioned by the reviewer is the very paper
that our work is based on, and is discussed extensively. Paper [ii]
addresses a different problem and uses a different method, though it could
be listed in the previous work section. 
