Paper ID: | 2254 |
---|---|

Title: | Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent |

The paper proposes a provable SGD algorithm for the matrix completion problem.

The paper is nicely written. It has novel results and should be of interest to many in the community. I have mostly minor comments. 1) Line 206: “a an” ---> “an”. 2) Line 272: “saddle point” ---> “saddle points”, “satisfies” ---> “satisfy”.

2-Confident (read it all; understood it all reasonably well)

This paper proposed an online algorithm for solving matrix completion problem using stochastic gradient descent (SGD) method. The authors proved that with a good initialization, the proposed algorithm is guaranteed to converge linearly to the true matrix.

1. This paper considered the problem of online matrix completion using stochastic gradient descent (SGD). The key idea of the proposed method is to combine a warm start phase and a SGD process for new observations. In order to establish the theoretical guarantees of the proposed method, the authors adopted martingale convergence technique, which has been used in some previous work for analyzing SGD, like: [7]. The proofs of this paper are clear and correct. 2. There are some drawbacks of the current paper. First, at each iteration, the proposed algorithms require the singular value decomposition (SVD), which is very expensive, especially for large scale problems. Even Algorithm 3 suffers from additional computational head. Second, due to the nature of stochastic gradient descent, the proposed algorithm explicitly does resampling in the optimization procedure, which introduces additional \log(1/\epsilon) term in the sample complexity (See Table 1). This extra term is not appealing, because the authors failed to perform uniform convergence argument (They used SGD, which is essentially resampling). There is a related work [**] that resolves this drawback by using uniform convergence argument and give some improved results in terms of sample complexity [**] Zheng, Q. and Lafferty, J. (2016). Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051 3. Another issue of the current paper is that it lacks of experiments to support the analytical results. It would be great if the authors could add some runtime comparisons among different methods to demonstrate the efficiency of the proposed method. 5. The writing of this paper is great, but there are some typos that undermine the readability of this paper, for example: In line 227, the condition of Lemma 4.1 is incomplete. In line 270, it has two U_t. In appendix, line 452, \Sigma ->S. The authors should carefully check and correct such typos.

2-Confident (read it all; understood it all reasonably well)

The paper studies a stochastic gradient descent algorithm for the problem of matrix completion. With a good initialization, it is guaranteed to converge to the true matrix with a geometric rate under standard assumptions. The algorithm is simple and easy to implement. The authors also make efforts to sketch the high-level idea for the proof.

The main contribution in this work is showing that in the context of matrix completion, when equipped with a good initialization, the sequence of the solutions produced by a widely used (but without theoretical guarantee) SGD algorithm converges linearly to the true matrix. The paper reads well and most of theoretical result looks sound. But I have some concerns as follows. - To guarantee a good initial solution, it (Theorem 3.1) requires a sample complexity "O(mu * d * k^2 * log(d))", which is of the same order as the batch solvers. Is it possible to access fewer samples, e.g., "O(mu * d * k * log(d))" while still ensuring global convergence? - The vanishing probability of success with respect to iteration counter is strange (it does not conform the intuition). Theorem 3.1 says "for any fixed T >= 1, with probability at least 1 - T/(d^10), we have linear convergence". That means, if we run the algorithm (i.e., repeatedly reveal the samples) for a longer time, we have a LOWER confidence to obtain the true matrix. For example, picking "T = 0.9 * d^10", then we are just guaranteed to obtain convergent solutions with probability at least "10%". If we pick "T" much larger, say "T = O(d^20)", it is not clear whether the algorithm succeeds. However, in practice, one should expect that if the algorithm continues to update the solution, asymptotically, the optimum is obtained. ---------------------------- Updates: I slightly increase my score on "technical quality" from "2" to "3", given the considerable amount of work involved. I believe the techniques here are interesting to the community, and many people will pick it up. Yet, it would be grateful if the authors could present a more careful proof technique to get rid of the dependency of "T", or comment on that "T" is always very small. A minor comment: It appears that another parallel work [A] also studies alternating minimization in online setting. It would be grateful if the authors could comment on the difference. [A] Fast Algorithms for Robust PCA via Gradient Descent, arXiv:1605.07784

2-Confident (read it all; understood it all reasonably well)

In this paper, the authors studied the problem of online matrix completion. The authors analyzed the stochastic minimization technique for performing matrix completion when the entries arrive one-by-one. They proved that, with good initialization, the algorithm provably converges to the optimal solution with linear convergence rate. The main results (Theorem 3.1 3.3) both require the target matrix to be truly low-rank, a.k.a. there is no noise. And the quality of initialization is quantified by the number of observed entries at the beginning. In particular, the required number of initial entries are on the order of dk^2, which almost matches the requirement of exact nuclear norm recovery (although the initialization algorithm is much easier than the nuclear norm minimization framework).

My first concern is that: the community is filled with similar results under slightly different settings. With the close relation between sparse-coding, SVD, the presented results might not make a significant difference to justify its value to NIPS. My second concern, which is my main concern, is that the results is not strong enough. First, the results are under noiseless cases. Other results usually study the noisy case for more practical value. Noiseless case, such as used in Tao’s paper, are mostly for theoretical interests. And there are some fundamental conflictions with the noiseless case and the online learning in this setting. There is a clear cut-off in the number of data (much before we observed all the entries) that the matrix is uniquely determined and there are algorithms that can help you recover that target matrix. It seems as a toy setting with less practical value or insight. However, when measured in theory value, the paper also requires an extremely good initialization, where exact recovery is already possible. The results are solid but the online component seems to be artificial. My suggestion is that the authors should improve the strength of the theory results. Noisy case should be included for study. And the weaker (or simply different) initialization requirement should be used.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)