Review for NeurIPS paper: Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method

NeurIPS 2020

Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method

Review 1

Summary and Contributions: This paper considers the minimization of a nonsmooth Lipschitz continuous function over a convex and compact domain. The prototypical method for solving such problems is the projected subgradient method (PGD). At each iteration, PGD queries the subgradient oracle (FO) and the projection oracle (PO) for once. The iteration complexity of PGD for finding an eps-suboptimal solution is in the order of eps^{-2}, which is known to be optimal in terms of the FO calls. In this setting, the fundamental question that this paper asks is whether we can economize on the PO calls. The paper provides an affirmative answer to this question. Here is a summary of the contributions: - The paper introduces a new subgradient method (MOPES) that economizes the PO-calls. The method essentially solves a smooth approximation of the original problem based on the Moreau envelope by using a proximal accelerated gradient method, where the prox subproblems (strongly convex) are solved in an approximate sense. The resulting method has O(1/eps^2) FO complexity (which is optimal and already achieved by PGD), but it enjoys O(1/eps) PO complexity (in comparison to O(1/eps^2) for state-of-the-art). - The paper introduces a new projection-free subgradient method (MOLES). MOLES is a modification of MOPES where the projection oracle is replaced by an inexact projection oracle that can be implemented by the Frank-Wolfe method. MOLES has O(1/eps^2) linear minimization oracle (LMO) complexity and FO complexity. MOLES is the first method that achieves this (optimal) complexity. - The paper extends the proposed methods for the stochastic setting (unbiased estimator of the subgradient with bounded variance). ---- Update after the author feedback: I was concerned that the technical error in Lemma 1 is a fundamental flaw for the analysis, but I am happy that it turned out to be a lack of precision in the presentation. I adjusted my score to reflect the novelty and the significance that I underlined already in my original review, but please do not neglect to revise the writing.

Strengths: The novelty of the paper is its main strength. I am not aware of any prior work that investigates the fundamental question asked in this paper: Can we design subgradient methods that economize the calls to the projection oracles. I am also not aware of any projection-free method that achieves optimal rates for the nonsmooth problems. Projection-free methods have received an adequate level of attention from the ML community. A major part of this literature appeared in top ML conferences in the last decade. Along this line, I would expect this work to attract one attention too. Since there are not many alternatives for the projection-free subgradient methods in the literature, the paper might have some practical significance too. However, it is difficult to judge this aspect of the work due to the lack of empirical evidence.

Weaknesses: My primary concern is the correctness of the results. See Section Correctness for details. I can raise my score if the authors clarify my concern. The numerical experiments are weak. There is only a single experiment, on the low-rank SVM problem. The data used in this experiment is very small in size (d=64 dimensional problem), so it is not an interesting setting to be solved by using a projection-free method. There are many parameters involved in the algorithm and the experimental setup. The results are shown only for a particular choice. It is not clear how sensitive the proposed method’s performance to the choice of these parameters. Overall, there is not enough empirical evidence to comment on the practical significance of the proposed method. The quality of the presentation is weak, the mathematical writing of the paper makes it difficult to follow. (See Section Clarity).

Correctness: Lemma 1, introduced without the proof or any reference, plays an important role in the proof. This lemma presents standard properties of the Moreau envelope, and it is easy to find the proof for the case “Xbmm” is the Euclidean vector space. However, in the proposed approach, “Xbbm” is a Euclidean norm-ball. In this case, I do not think that property (c) in Lemma 1 is true. In particular, if you consider f as a constant function, then G=0, while hatxlambda(x) is the projection of x onto Xbbm, so this result does not hold. In the algorithm, it looks like “Xbbm” is the Euclidean norm-ball (step 1.13 is the projection). I tried to track “Xbbm” in the analysis. At line 557, it says “for simplicity we assume that … Xbbm is the whole vector space.” I do not see another remark in the rest of the paper why this assumption is -without loss of generality-.

Clarity: The paper requires polishing. The high-level claims of the paper are clear, but the mathematical writing is not. The authors drop mathematical details in the definitions and theorems, such as “for all x in X” etc, and this makes the reading cumbersome. The lack of these details can easily mislead the reader, I have personally found myself confused about the notation many times in my first read. You can find examples starting from Definitions 1 and 2 (Lipschitz continuity, strong convexity), where x and y are not defined, and similar issues exist in many parts of the paper. Most importantly, “Xbbm” is defined in Section 2 as a Euclidean norm ball around the origin with a fixed radius. In many other places, “Xbbm” is considered as a Euclidean vector space (e.g., Definition 3, and in many parts of the proof). This discrepancy also causes me to question the correctness of the results presented in the paper (see Section Correctness).

Relation to Prior Work: The paper presents the prior work in Section 1.1 under 5 sub-categories. The topic of “nonsmooth convex minimization” in general is too broad to cover exhaustively in one page. In my opinion, this section provides a good overview of this broad topic. The novelty of the proposed work (efficiency in terms of the PO-oracles, optimal LMO complexity of the projection-free version, the fundamental idea of smoothing and splitting behind the framework) is delivered in a clear way. The authors compare the guarantees of the proposed method with state of the art in Table 1.

Reproducibility: Yes

Additional Feedback: Some other minor comments: - Typo@line114: sest’s -> set’s - line 183: “PO calls are required in inner steps” -> is it “PO calls are NOT required in inner steps” ? - Typo@equation (22) - Typo@line667: “Similar to prosposition” -> proposition - Step 3.16 in prox-slide: What is “Xbbm” in step 3.15? If it is the Euclidean-norm ball, then step 3.16 is redundant. - Section D: Consider putting a reference or clarifying the procedure with numbered steps. Also, I did not understand why this section is titled “additional experimental details”.

Review 2

Summary and Contributions: This paper provides an improved algorithm for constrained minimization of non-smooth functions. The improvement is in the number of calls to the projection oracle, which may be vey expensive in general. For non-smooth optimization eps^(-2) iterations are required if no further structure is present — and this bound is very easily achieved via projected subgradient descent. T The algorithm presented here has the same iteration bound. However, under the very mild assumption that the function to minimize f is defined in a region slightly outside the constraint domain, it improves the number of calls to the projection oracle to eps^(-1). In addition to this, the authors also present an algorithm for the case where rather than a projection oracle, they have access to a linear optimization oracle for the domain. This algorithm is tight both in the number of iterations and the number of oracle calls. The main idea behind these algorithms is nice and clean. A first attempt would be to smoothen the function and solve using an accelerated first order method. This would not improve the convergence, since each iteration of the accelerated method would now require solving a nontrivial quadratic optimization problem, which adds an extra number of iterations that bring the iteration bound back to eps^(-2). However, this optimization step does not require projections, and this is where the saving comes from. ---- Update after the author feedback: The authors seem to have addressed all the reviewer concerns, and I maintain my score.

Strengths: The paper studies a fundamental problem in optimization, and the result looks quite surprising. It’s an important result that deserves acceptance.

Weaknesses: No major weaknesses.

Correctness: Everything checks out.

Clarity: The paper is well written and reads smoothly.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Minor comments: line 114: “sest’s functional form” what does this mean? line 115: Wouldn’t it make more sense to have E[g^ | x] = g? I think this is a more realistic model, and the same proofs should go through.

Review 3

Summary and Contributions: This paper considers non-smooth constrained convex optimization which provides an algorithm with optimal first-order oracle complexity, eps^{-2}, and projection oracle complexity only eps^{-1}, which improves over existing methods which required eps^{-2} projection oracle calls. Similarly, when a linear minimization oracle is available instead of a projection oracle, their method requires only eps^{-2} LMO calls rather than the previously known eps^{-4}. ------- update after author response -------- After reading the author's response, I maintain my score and still recommend accepting the paper. I agree with Reviewer 1 that Lemma 1 should be stated more carefully, but I do not see this as a fatal flaw.

Strengths: This is a nice paper which clearly presents the problem and its solution. Reducing the number of projection oracle calls is very important and can lead to much faster optimization algorithms (e.g. for nuclear norm constraints), and the method is novel to the best of my knowledge. The ideas used for the algorithm and its proof are quite straightforward, which is another strength of the paper.

Weaknesses: This is a strong paper with few weaknesses. The main theoretical limitation seems to be the requirement that the function/first order oracle be available outside the constraint set. While I generally agree with the authors that this is often reasonable, I think it would be useful to think about (and comment on) cases in which this might fail. It seems that two things might go wrong here: either (1) the Lipschitz constant could explode or (2) the function could be undefined. This might be an issue if the objective involves an entropic regularizer, or something like that? Again, I don't think this is a huge issue, but thinking through these issues would be good. This is especially true since I suspect that it is probably not necessary to have \mathbb{X} be so much larger than \mathcal{X}, I suspect that a much smaller expansion of the constraint set should still be workable.

Correctness: To the best of my knowledge the claims and methods are correct.

Clarity: Yes, the paper is clearly written. There are a couple of minor typos: Line 114: "sest's" -> set's Line 183-184: It should be "...PO calls are NOT required..." Pseudocode 1.14: u_{k,t} is not defined/used elsewhere. I think this should just be u_t?

Relation to Prior Work: Yes, the paper includes a thorough comparison with prior work and clearly explains its additional contribution.

Reproducibility: Yes

Additional Feedback: