Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
In this paper, the convergence of the gradient descent-ascent algorithm is studied for solving min-max optimization problems, including nonconvex PL games and nonconvex-concave games. In general, the paper is well written. My comments are listed as below. 1) Table 1 should be better presented by elaborating on the differences between this work and the existing ones in the problem setup of (1), e.g., convexity, and unconstrained sets (namely, $\mathcal \Theta$ and $\mathcal A$). 2) I am not convinced about Remark 3.8: "one can easily extend the result of Theorem 3.4 to the stochastic setting". Will the stochastic gradient apply to both inner maximization and outer minimization steps? I assumed that the proposed analysis only applies to SGD, right? Will the analysis apply to the variant using Adam? If no, why was Adam used in Table 3? If yes, please elaborate on it. 3) In (4) and (5), is the radius $1$ chosen for ease of analysis? #################Post-feedback################## My questions have been answered in the rebuttal. Please be sure to make additional clarification and experiments in the final version.
Originality. The paper seems to be a simple extension of known results for non-convex optimization to saddle-point problems. I think, it is obvious that since the saddle-point problem is concave or restricted strongly concave in the variable w.r.t. which we maximize, we can use some method to find the approximate gradient w.r.t. variable in which we minimize. Then a gradient method with inexact oracle, e.g. http://papers.nips.cc/paper/6565-learning-supervised-pagerank-with-gradient-based-and-gradient-free-optimization-methods, can be applied to find an approximate stationary point. It is quite strange that the classical paper https://epubs.siam.org/doi/pdf/10.1137/S1052623403425629 is not cited in line 31. Also it seems that the general algorithm https://link.springer.com/article/10.1007/s10589-014-9673-9 can be applied to the setting considered in this submission. Probably, the complexity would be the same. Quality. As far as I see, the proof of Lemma A.5 is not complete. To use the Taylor expansion for g(.) one first needs to prove that it is a smooth function. Moreover, the expansion is used up to the second-order term and more smoothness is needed. Thus, the contribution for the PL-games is questionable. Other proofs seem to be correct. Nevertheless, it would be nice to add some more details on how the first inequality in line after line 530 in the supplementary is obtained. The same is for line after line 449 in the supplementary. In the experiments it would be nice to compare the results with the results of . Does CNN have advantage in comparison with convex classifier? Clarity. Except the above things the paper is well written and easy to follow. Significance. Despite the optimization contribution seems to be marginal, the algorithm can be interesting to practitioners in the ML community. =============After rebuttal============= The authors have mostly resolved my concerns by their rebuttal. I still think that the paper is borderline in terms of the contribution. Not much is new made in terms of the optimization, but the resulting algorithms can be interesting to the ML community. I'm increasing my score.
The basic idea behind their algorithms is that of a projected gradient descent for the min player and inside the loop, one performs accelerated or simple gradient ascent respectively for non-convex concave or non-convex- PL games. For the case of the non-convex concave games, due to non-smoothness implications, the authors introduce some regularization to make the function smooth. The techniques are quite standard, nevertheless the result is quite decent.
This is a good paper that studies an interesting problem. The paper is well-written with good literature review and exposition. The idea of using the PL condition and the smoothing approach are well-motivated from optimization, but the application to game play is interesting. The technical results and analysis seem solid. The experiments are rather limited, it would be interesting to verify the results on more varied examples.