Paper ID: | 7788 |
---|---|

Title: | An Inexact Augmented Lagrangian Framework for Nonconvex Optimization with Nonlinear Constraints |

The authors propose to solve a non-convex optimization problem with non-linear constraints, which is a common template for a variety of problems in machine learning. The authors solve the primal problem inexactly, controlled by error \eps_k, which is gradually decreased, as penalty \beta_k is increased and the iterates approach stationarity. The inexact problem is solved with the help of both first order and second order solvers and this paper analyzes overall computational complexity of iALM under both kinds of substitutions, which is a first. The paper is well written, with good references and presentation. The experimental analysis gives useful insights into the performances of iALM with both Accelerated Proximal Gradient Method, a first order approach and 1BFGS.

Applying ALM to the Burer-Monterio problem and to nonlinear programs in general is natural and well summarized in the monograph Ref [8]. Allowing first-order and second-order approximate solvers for the primal subproblems is also classic, and can be found in, e.g., Ch 8 & 9 of Ref [8]. I think the main novelties here lie at the nonsmooth, convex term g(x) and the convergence rate results. Sec 5 of the paper has provided a comprehensive review of pertinent results under different assumptions. I have several concerns that I hope the authors can address: * The BM example does not quite justify the inclusion of the possibly nonsmooth term g in (1). The authors may want to balance out and briefly discuss other examples as appearing in the experiments. * Suppose there is no g, are there already existing convergence results of similar nature? Also, would the idea of proof be substantially different? If possible, it may help to sketch the high level idea of the proof in the main text. * Compared to the convex case, Alg 1 for the nonconvex case is obviously more complicated. I think for the benefit of the machine learning community which are mostly only familiar with the convex case, the authors may want to explain the main changes, e.g., why \sigma_k needs to be updated. * For the presentation in Sec 5, would it be better to tabulate the main results rather than the current form? It cost a bit of effort to really digest the fragmented pieces. * Maybe I'm missing something---the paper has mentioned twice that the regularity condition in (15) will be checked in Sec 6, i.e., the experiment. But I've not detected the presence. * I was reminded of a possible dual submission of this paper with paper # 6483, which I also reviewed. It seems to me that the linearized ADMM can indeed also be applied to the current model problem (1), although I'm not sure if a convergence rate result can be obtained. On the reverse direction, it's not clear if one can apply modified version of inexact ALM to the problem in # 6483 can still obtain the global results therein. Maybe the authors want to clarify on these.

I have read the rebuttal and some of my concerns are raised. I admit that I have missed the verification for (15) for 3 specific cases in the appendix in E&F, mostly since in the original paper it incorrectly says it will be mentioned in section 6. But anyway overall the paper has some values in the convergence analysis of some specific problems including clustering etc. Therefore I have modified my score. However, I still think the contribution of this paper is overstated and the statements for corollary 4.2/4.3 are not entirely valid. Normally for optimization results, given the function f, and g this work is targeting to optimize upon, the assumption should only rely on the geometry of f and g. Therefore one could tell if the functions satisfy this assumption or not, and whether the algorithm would work. Of course it could also include some tolerance on the optimality of the primal progress as the equation in step 2 of algorithm 1 does. It is valid since given enough optimization steps, it could be achieved. However, equation (15) is some contrived condition that depends on the intermediate x_k and the beta_k that is not clearly what it means *in general*. I agree that the author have discussed and verified (15) for 3 cases all happened *when g=0 or is an indicator function*. And in this scenario, the function almost has nothing to do with g and it only relies on the geometry of f on the valid domain, which is fine. But clearly the paper is overstating the contribution for their conclusion for general convex g. Plus the assumption is simply omitted in the corrollary 4.2/4.3 without verification. This makes it looks like they show such convergence rate for general f and g, which is not correct. I'm fine with accepting the paper, but at least the authors should improve the presentation. Please move some of the valuable results from the appendix to the main text, and modify the corollaries so that they're self-content with the proper assumptions included. ======================================================== 1. My biggest concern is that the theoretical guarantee is not at all rigorous. It uses a weird and contrived assumption (15), assuming the function satisfies some property with parameter beta_k that depends on the number of iteration k. Normally a proper assumption should be only on the geometry of the loss, and be independent with the intermediate iterates. Meanwhile, no explanation is provided for this beta_k. Later the authors simply assume beta_k to be some b^k, still without explainations. For such assumptions, the authors have not even shown any real problems that meet the requirements. 2. The major content of the paper is about literature review. For the real contribution part, it fails to provide any intuition on how the method works, and why does it propose such a weird assumption. 3. The mathematical language is not compact and weird. The theorems are written in a sloppy way, with some key terms only defined in the appendix. 4. Finally, the experimental results are not convincing either. The only example in the main text shows improved efficiency on a single dataset for a single task, while all other experiments in the appendix show negative results.