Paper ID: | 3908 |
---|---|

Title: | ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization |

The paper proposed a novel zeroth-order optimization algorithm inspired by the success of adaptive gradient methods (such as ADAM, NADAM, AMSGRAD, etc) in first-order optimzation. It is the first work leveraging this into block-box optimization which is a very important topic in machine learning and I believe this work has potential for a high impact. The theoretical analysis shows a convergence rate which is poly(d) worse than first-order AdaMM methods, which is acceptable for zeroth-order algorithms. The paper is well and clearly written.

The context of this work is derivative-free optimization (DFO). The proposed method is an adaptation of the AdaMM first-order algorithm. However, the DFO literature has been ignored. Many existing algorithms may be applied to solve the blackbox optimization problem. It is not clear to what the proposed method is compared to. It seems that all competitors fall into the same category and may have been recoded by the authors. However many DFO solvers are available. In these conditions, it is not possible to assess the quality of the proposed algorithm. UPDATE: The authors responded very positively to my comments. They are eager to include au augmented literature review on DFO methods.

This paper proposes a zeroth-order adaptive momentum method for black-box optimization, by approximating the stochastic gradient using the forward difference of two function values at a random unit direction. The paper also shows the convergence analysis in terms of Mahalanobis distance for both unconstrained and constrained nonconvex optimization with the ZO-AdaMM, which results in sublinear convergence rates that are roughly a factor of the square root of dimension worse than that of the first-order ZO-AdaMM, as well as for constrained convex optimization. The proposed scheme is quite interesting, which solves the (non)convex optimization in a new perspective, and somewhat provides new insight to the adaptive momentum methods. In particular, the paper provides a formal conclusion that the Euclidean projection may results in non-convergence issue in stochastic optimization. The paper also shows the applications to black-box adversarial attacks problems and validate the method by comparing it with other ZO methods. Overall, the paper is well written and easy to follow, including the proof I have checked. Although this work is technically sound, the authors may consider the following points to improve the quality and significance on top of the current presence. 1. As the first-order adaptive momentum methods are still quite popular and in practice, gradients are not quite difficult to be obtained, what is the practical advantage of ZO-AdaMM compared with the first-order adaptive momentum method? Although the authors provide a table to show the comparison clearly, unfortunately, in the experiments, no first-order methods are used for comparison. 2. I believe the ZO-AdaMM is not only proposed for adversarial learning, like the experiments shown in the paper. Why do the authors not show any experiments for regular image classification to check the performance of the proposed approach? Or ZO-AdaMM is better for the applications to black-box adversarial attacks? If the ZO-AdaMM was proposed as a generic optimizer for most of deep learning problems, I think it should be good to see more applications to different datasets. 3. In the paper, the authors say the tolerance on the smoothing parameter \mu is an important factor to indicate the convergence performance of ZO optimization methods. I just wonder in the paper how the authors would select \mu such that a good gradient approximation can be made. 4. In the Eq. (65), the definition A has an equal sign, what does it mean in the proof? 5. After Proposition 1, the authors introduced Mahalanobis distance, it would be better if the authors could give more motivation and intuition on this metrics and why this is a good choice for ZO-AdaMM, other than just showing the limitedness of the Euclidean projection 6. In the proof of Proposition 1, when the authors show the optimality condition of (26), does it mean an inequality? The same to (27). 7. In the supplementary materials, in Section 4.2, the authors set the choice of step size by greedy search over the interval [10^{-2}, 10^{-4}] (This interval seems inverse?). However, in the table A1 and A2, a couple of learning rates are out of this interval. Why is that? Also, how to relate the converged objective values to successes of attack in the paper?