NeurIPS 2020

A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems

Review 1

Summary and Contributions: This paper considers min max optimization problems with a particular focus on problems where the maximum is over a finite set of nonconvex functions. They show that adding a quadratic term to the objective dragging the minimization variables towards a weighted average of their past values gives a converge. ================ EDIT: I stand by my assessment that this paper should be an accept. I feel the authors' rebuttal and addresses all of my complaints and trust a small revision could handle them.

Strengths: This paper proposes a smoothed objective similar in theme to Moreau envelops that occur in typical minimization problems by adding a quadratic term to the minimization. Running GDA with this simple modification to the objective gives an algorithm which converges at the typical rate of O(\epsilon^-4) that one expects for GDA with some form of averaging. The core strength of this paper is in showing that a strict complementarity condition on the dual problem improves this rate to O(\epsilon^-2). This is a substantial improvement. The area of minimax optimization is certainly of great modern importance and even the restricted finite maximum form considered here has many practical applications.

Weaknesses: The arguments given against multi-loop methods seem quite weak. The paper says (without proof) that any multiloop method must have at least O(1/\eps^2) outer iterations. I see no reason why this must be the case. Its true that the multiloop methods referenced have this many outer iterations required, but nothing seems fundamental about that limit. A more clever choice of subproblem could resolve that issue. As written the algorithm requires knowledge of the optimization problems smoothness constant L to select the quadratic added and stepsizes used. An adaptive scheme for removing this dependence would improve the applicability and likely the performance of this method. The core result of the paper relies on strict complementarity holding and the primal iterates staying bounded. Neither of these assumptions are given much motivation in the main text. The key ideas of the argument in favor of complementarity should be included from Appendix E. Some motivation for the boundedness assumption is much needed. In particular, the claim in the contribution section that the paper avoids needing to assume the X domain is compact is undermined by subsequently assuming the primal iterates stay bounded. Once this has been assumed, one could intersect X with a large enough ball to make it compact without changing the iterates, so compactness is effectively assumed. This claim should be removed or put in full context of what will be later assumed.

Correctness: The proofs all appear to be correct and the experiments a reasonable justification of the proposed method's effectiveness.

Clarity: The paper is well-written, but the appendix is difficult to read (there is minimal prose describing what is being done in the proofs and connecting the ideas together).

Relation to Prior Work: The method gives adequate discussion of the recent works in nonconvex concave minimax optimization. Potentially more references could be given to the traditional smoothing literature in optimization (Moreau smoothing and Nesterov smoothing both being related to the technique used here).

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The authors propose a simple single loop scheme for solving non-convex/concave min-max problems which achieves a O(1/epsilon^4) convergence rate for attaining an epsilon-stationary point. Subsequently, this paper focuses on a characteristic class of non-convex/concave (or more precisely non-convex/linear) problems with many important applications. In the latter class an O(1/epsilon^2) rate is achieved, which to the best of my knowledge is the best rate to date for this class of problems. Further, the utility of the algorithm is supported by experiments on a robust neural network training problem on MNIST.

Strengths: Among the elements that add value to this work is the fact that the proposed algorithm involves a single loop. This is important since a large number of known min-max algorithms entail solving sub-problems in each iteration, which leads to complicated double and triple loop schemes. Also, I believe that the focus (and rate improvement) on problem class (1.2) is worthwhile since it captures a wide range of important applications (e.g robust optimization over multiple domains, problems in fair machine learning). Finally, the relaxation of the compactness assumption for the constraint set X is appreciated since that would extend the applicability of the results to unconstrained problems. In conclusion, this work develops a simple single loop algorithm for non-convex/concave problems which improves the state of the art convergence rate for an important subclass of problems. However, the practicality of the assumptions (i.e., if they hold in a number of useful machine learning problems under reasonable scenarios) underlying the latter result is not completely clear. Therefore, I would suggest the authors to revise their work and resubmit it.

Weaknesses: 1. There is a recent work in single loop algorithms for non-convex/concave problems [A]. The latter work appeared on Arxiv on 06/03 and thus it is understandable that the authors were not aware of it at the time of submission. However, I think that it is important to see now how the proposed algorithm and the results of this paper compare to the ones given in [A]. Does this work offer any advantages compared to [A]? 2. One of my main concerns is about the strict complementarity assumption in the results of problem (1.2). The O(1/epsilon^2) result for the problem class (1.2) is the central one and so it is important to establish that this assumption is reasonable and holds in practice in a number of machine learning problems. The effort made on the supplementary material to show that a weaker version of strict complementarity (which still ensures the same convergence results) holds for a typical machine learning problem (robust regression) is appreciated. However, I am still not completely persuaded about the practicality of this assumption (either the strict complementarity or its weaker version). I believe that a wider range of examples (e.g machine learning problems) are required in order to illustrate the rationality of this assumption in practice. 3. The experiments are performed over the MNIST dataset. In my opinion, this does not offer sufficiently strong indications about the superiority of the proposed algorithm. Additional experiments potentially over more complex datasets are needed in order to support the utility and the advantages of the new algorithm.

Correctness: yes

Clarity: yes

Relation to Prior Work: Please see my comments above.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: In this paper, the authors propose a single loop algorithm, they denote as Smoothed-GDA, for solving non-convex concave min-max optimization problems. The proposed method alternatively performs gradient descent and gradient ascent steps on the objective function with an added quadratic proximal term. The authors show that the algorithm finds an epsilon-stationary solution in at least O(epsilon^{-4}). Once applied to minimizing the pointwise maximum of a finite collection of non-convex functions, the algorithm achieves an O(epsilon^{-2}) iteration complexity. The authors further extend the Smoothed-GDA to the multi-block setting while achieving the same convergence rates.

Strengths: The authors used a smoothing technique to propose a variant of gradient descent-ascent (GDA) algorithm for solving non-convex concave min-max optimization problems. Due to its useful applications, solving the former class of problems in the non-convex setting has recently gained significant attention. Hence, the topic under study belongs to a vibrant field of research. In contrast to the (non)-convex strongly-concave case, GDA exhibits an oscillating behavior when the inner maximization problem is concave. To stabilize this oscillating behavior, the authors introduced a smoothing technique by adding a quadratic proximal term to the objective that uses an auxiliary sequence that is updated at every iteration. The authors established convergence of the algorithm and computed the iteration complexity in non-convex concave settings. To show convergence, the paper proposed a novel potential function and showed sufficient decrease at every iteration. When applied to the problem of minimizing the pointwise maximum of a finite collection of non-convex functions, the algorithm converges with optimal rate. The claims and theoretical results seem to be correct. Moreover, the proof technique used to establish the theoretical results are involved. The authors managed to present the ideas in a concise and organized manner.

Weaknesses: 1. The title used for the paper is misleading since the rate O(epsilon^{-2}) can only be achieved for a special class of non-convex concave min-max problems. More specifically, this rate is attained when using the algorithm to minimize the pointwise maximum of finite collection of non-convex functions. For general non-convex concave the rate is O(epsilon^{-4}). This is clarified in the body of the paper, but the title is clearly misleading. 2. To prove the result for the pointwise maximum case, the authors require strict complementarity assumption (Assumption 3.4). It is not clear in the paper when this assumption (or even its relaxed replacement in the appendix) holds in practice. I believe the authors should discuss methods to check whether this assumption holds for a given objective function. 3. Also looking at assumption 3.5, does it imply that the domain X is bounded? It is stated in the contributions that the compact of the domain X is relaxed. However, it is not clear whether assumption 3.5 affects that claim. 4. The authors mentioned that relaxing the compactness of the domain X significantly extends the applicability of the algorithm. I believe this statement should be clarified through practical examples. 5. The proof of lemma B.8 is missing. Below are some minor comments: 1. The authors state that multi-block algorithms cannot be easily adapted to solve problems with the multi-block structure due to the acceleration steps. The acceleration steps in such algorithms are mainly used to improve the iteration complexity and hence can be relaxed. If relaxed, existing algorithms can be easily adapted to solve problems with multi-block structure. 2. It is not clear how B.10 is attained. 3. Line 21 (Typo): convex in x and concave in y. 4. Line 107 (Typo): category instead of categories. 5. Algorithms 1, 2 and 3: the projection operator is not defined in the main body. 6. The proximal function can be confused with the projection operator. 7. Line 378: missing bracket ')'. 8. Lemma B.8: continuous 'in' instead of 'of. 9. Lemma B.10: missing bracket '('. 10. Lemma B.11: lambda should be replaced with \bar{lambda}. 11. Proof of Lemma B.11: in the first expression we should have 2c instead of c. Also the expression under line 417 has a typo. 12. Expression B.48: the label covers part of the inequality. 13. New line not needed on line 445. 14. Line 458: additional 'the'. 15. Line 512 (typo): M(x) instead of M(x)(x). 16. Line 513: should we have I{y^star} instead of I. 17. (D.24): Should we have an added vector [0, ..., 0, 1]. 18. Line 542: Should lambda in sigma_5 be in the numerator?

Correctness: The claims and theoretical results seem to be sound and correct. The empirical efficiency was illustrated by applying the algorithm to a robust neural network training application. In comparing the convergence speed of the proposed algorithm compared to [20], the authors did not mention whether the same step sizes were used in both methods which raises a concern on the viability of the comparison. To illustrate the efficiency of the algorithm, the authors applied the

Clarity: The main body of the paper is concise and clearly written. However, the proof in the supplementary material has many typos and one proof is missing.

Relation to Prior Work: The related material are referenced and well-discussed in the paper. The authors clearly positioned their work in the related field and discussed their contributions in comparison to other similar works.

Reproducibility: Yes

Additional Feedback: Post Rebuttal: Thank your for the author's response. After reading the rebuttal, my main concerns on the title name and compactness of the domain were alleviated.