{"title": "ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 7204, "page_last": 7215, "abstract": "The adaptive momentum method (AdaMM), which uses past gradients to update descent directions and learning rates simultaneously, has become one of the most popular first-order optimization methods for solving machine learning problems. However, AdaMM is not suited for solving black-box optimization problems, where explicit gradient forms are difficult or infeasible to obtain. In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that generalizes AdaMM to the gradient-free regime. We show that the convergence rate of ZO-AdaMM for both convex and nonconvex optimization is roughly a factor of $O(\\sqrt{d})$ worse than that of the first-order AdaMM algorithm, where $d$ is problem size. In particular, we provide a deep understanding on why Mahalanobis distance matters in convergence of ZO-AdaMM and other AdaMM-type methods. As a byproduct, our analysis makes the first step toward understanding adaptive learning rate methods for nonconvex constrained optimization.Furthermore, we demonstrate two applications, designing per-image and universal adversarial attacks from black-box neural networks, respectively. We perform extensive experiments on ImageNet and empirically show that ZO-AdaMM converges much faster to a solution of high accuracy compared with $6$ state-of-the-art ZO optimization methods.", "full_text": "ZO-AdaMM: Zeroth-Order Adaptive Momentum\n\nMethod for Black-Box Optimization\n\nXiangyi Chen1,\u2217 Sijia Liu2,\u2217 Kaidi Xu3,\u2217 Xingguo Li4,\u2217\n\nXue Lin3 Mingyi Hong1 David Cox2\n\n1University of Minnesota, USA\n\n2MIT-IBM Watson AI Lab, IBM Research, USA\n\n3Northeastern University, USA\n\n4Princeton University, USA\n\nAbstract\n\nThe adaptive momentum method (AdaMM), which uses past gradients to update\ndescent directions and learning rates simultaneously, has become one of the most\npopular \ufb01rst-order optimization methods for solving machine learning problems.\nHowever, AdaMM is not suited for solving black-box optimization problems,\nwhere explicit gradient forms are dif\ufb01cult or infeasible to obtain. In this paper,\nwe propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that generalizes\nAdaMM to the gradient-free regime. We show that the convergence rate of ZO-\n\nAdaMM for both convex and nonconvex optimization is roughly a factor of O(\u221ad)\nworse than that of the \ufb01rst-order AdaMM algorithm, where d is problem size. In\nparticular, we provide a deep understanding on why Mahalanobis distance matters\nin convergence of ZO-AdaMM and other AdaMM-type methods. As a byproduct,\nour analysis makes the \ufb01rst step toward understanding adaptive learning rate\nmethods for nonconvex constrained optimization. Furthermore, we demonstrate\ntwo applications, designing per-image and universal adversarial attacks from black-\nbox neural networks, respectively. We perform extensive experiments on ImageNet\nand empirically show that ZO-AdaMM converges much faster to a solution of high\naccuracy compared with 6 state-of-the-art ZO optimization methods.\n\n1\n\nIntroduction\n\nThe development of gradient-free optimization methods has become increasingly important to solve\nmany machine learning problems in which explicit expressions of the gradients are expensive or\ninfeasible to obtain [1\u20137]. Zeroth-Order (ZO) optimization methods, one type of gradient-free\noptimization methods, mimic \ufb01rst-order (FO) methods but approximate the full gradient (or stochastic\ngradient) through random gradient estimates, given by the difference of function values at random\nquery points [8, 9]. Compared to Bayesian optimization, derivative-free trust region methods,\ngenetic algorithms and other types of gradient-free methods [10\u201313], ZO optimization has two main\nadvantages: a) ease of implementation, via slight modi\ufb01cation of commonly-used gradient-based\nalgorithms, and b) comparable convergence rates to \ufb01rst-order algorithms.\n\nDue to the stochastic nature of ZO optimization, which arises from both data sampling and random\ngradient estimation, existing ZO methods suffer from large variance of the noisy gradient compared\nto FO stochastic methods [14]. In practice, this causes poor convergence performance and/or function\nquery ef\ufb01ciency. To partially mitigate these issues, ZO sign-based SGD (ZO-signSGD) was proposed\nby [14] with the rationale that taking the sign of random gradient estimates (i.e., normalizing gradient\nestimates elementwise) as the descent direction improves the robustness of gradient estimators\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto stochastic noise. Although ZO-signSGD has faster convergence speed than many existing ZO\nalgorithms, it is only guaranteed to converge to a neighborhood of a solution. In the FO setting,\ntaking the sign of a stochastic gradient as the descent direction gives rise to signSGD [15]. The use\nof sign of stochastic gradients also appears in adaptive momentum methods (AdaMM) such as Adam\n[16], RMSProp [17], AMSGrad [18], Padam [19], and AdaFom [20]. Indeed, it has been suggested\nby [21] that AdaMM enjoy dual advantages of sign descent and variance adaption.\n\nConsidering the motivation of ZO-signSGD and the success of AdaMM in FO optimization, one\nquestion arises: Can we generalize AdaMM to the ZO regime? To answer this question, we develop\nthe zeroth-order adaptive momentum method (ZO-AdaMM) and analyze its convergence properties\nin both convex and nonconvex settings for constrained optimization.\n\nContributions Theoretically, for both convex and nonconvex optimization, we show that ZO-\n\nAdaMM is roughly a factor of O(\u221ad) worse than that of the FO AdaMM algorithm, where d is the\n\nnumber of optimization variables. We also show that the Euclidean projection based AdaMM-type\nmethods could suffer non-convergence issues for constrained optimization. This highlights the\nnecessity of Mahalanobis distance based projection. And we establish the Mahalanobis distance\nbased convergence analysis, which makes the \ufb01rst step toward understanding adaptive learning rate\nmethods for nonconvex constrained optimization.\n\nPractically, we formalize the experimental comparison of ZO-AdaMM with 6 state-of-the-art ZO\nalgorithms in the application of black-box adversarial attacks to generate both per-image and universal\nadversarial perturbations. Our proposal could provide an experimental benchmark for future studies\non ZO optimization. Code to reproduce experiments is released at the link https://github.com/\nKaidiXu/ZO-AdaMM.\n\nRelated work Many types of ZO algorithms have been developed, and their convergence rates have\nbeen rigorously studied under different problem settings. We highlight some recent works as below.\nFor unconstrained stochastic optimization, ZO stochastic gradient descent (ZO-SGD) [9] and ZO\n\nstochastic coordinate descent (ZO-SCD) [22] were proposed, which have O(\u221ad/\u221aT ) convergence\nrate, where T is the number of iterations. Compared to FO stochastic algorithms, ZO optimization\nsuffers a slowdown dependent on the variable dimension d, e.g., O(\u221ad) for ZO-SGD and ZO-SCD.\nIn [23], the tightness of the dimension-dependent factor O(\u221ad) has been proved in the framework\n\nof ZO stochastic mirror descent (ZO-SMD). In order to further improve the iteration complexity of\nZO algorithms, the technique of variance reduction was applied to ZO-SGD and ZO-SCD, leading\nto ZO stochastic variance reduced algorithms with an improved convergence rate in T , namely,\nO(d/T ) [24\u201326]. This improvement is aligned with ZO gradient descent (ZO-GD) for deterministic\nnonconvex programming [8]. Moreover, ZO versions of proximal SGD (ProxSGD) [27], Frank-Wolfe\n(FW) [28, 2, 29], and online alternating direction method of multipliers (OADMM) [1, 30] have been\ndeveloped for constrained optimization. Aside from the recent works on ZO algorithms mentioned\nbefore, there is rich literature in derivative-free optimization (DFO). Traditional DFO methods can\nbe classi\ufb01ed into direct search-based methods and model-based methods. Both the two types of\nmethods are mostly iterative methods. The difference is that direct search-based methods re\ufb01ne their\nsearch directions based on the queried function values directly, while a model-based method builds\na model that approximates the function to be optimized and updates the search direction based on\nthe model. Representative methods developed in DFO literature include NOMAD [31, 32], PSWarm\n[33], Cobyla [34], and BOBYQA [35]. More comprehensive discussions on DFO methods can be\nfound in [36, 37].\n\n2 Preliminaries: Gradient Estimation via ZO Oracle\n\nThe ZO gradient estimate of a function f is constructed by the forward difference of two function\nvalues at a random unit direction:\n\n\u02c6\u2207f (x) = (d/\u00b5)[f (x + \u00b5u) \u2212 f (x)]u,\n\n(1)\n\nwhere u is a random vector drawn uniformly from the sphere of a unit ball, and \u00b5 > 0 is a small step\nsize, known as the smoothing parameter. In many existing work such as [8, 9], the random direction\nvector u was drawn from the standard Gaussian distribution. Here the use of uniform distribution\nensures that the ZO gradient estimate (1) is de\ufb01ned in a bounded space rather than the whole real\nspace required for Gaussian. As will be evident later, the boundedness of random gradient estimates\nis one of important conditions in the convergence analysis of ZO-AdaMM.\n\n2\n\n\fThe rationale behind the ZO gradient estimate (1) is that although it is a biased approximation to\nthe true gradient of f , it is unbiased to the gradient of the randomized smoothing version of f with\nparameter \u00b5 [23, 24, 30], i.e.,\n\nf\u00b5(x) =Eu\u223cUB [f (x + \u00b5u)],\n\n(2)\n\nwhere u \u223c UB denotes the uniform distribution over the unit Euclidean ball B. We review properties\nof the smoothing function (2) and connections to the ZO gradient estimator (1) in Appendix 1.\n\n3 AdaMM from First to Zeroth Order\n\nConsider a stochastic optimization problem of the generic form\n\nmin\nx\u2208X\n\nf (x) = E\u03be[f (x; \u03be)],\n\n(3)\n\nwhere x \u2208 Rd are optimization variables, X is a closed convex set, f is a differentiable (possibly\nnonconvex) objective function, and \u03be is a certain random variable that captures environmental\nuncertainties. In problem (3), if \u03be obeys a uniform distribution built on empirical samples {\u03bei}n\ni=1,\nthen we recover a \ufb01nite-sum formulation with the objective function f (x) = 1\ni=1 f (x; \u03bei).\n\nn Pn\n\nFirst-order AdaMM in terms of AMSGrad [18]. We specify the algorithmic framework of\nAdaMM by AMSGrad [18], a modi\ufb01ed version of Adam [16] with convergence guarantees for\nboth convex and nonconvex optimization. In the algorithm, the descent direction mt is given by\nan exponential moving average of the past gradients. The learning rate rt is adaptively penalized\nby a square root of exponential moving averages of squared past gradients. It has been proved in\n[18, 20, 38, 39] that AdaMM can reach O(1/\u221aT )2 convergence rate. Here we omit its possible\ndependency on d for simplicity, but more accurate analysis will be provided later in Section 4 and 5.\n\nZO-AdaMM. By integrating AdaMM with\nthe random gradient estimator (1), we obtain\nZO-AdaMM in Algorithm 1. Here the square\nroot, the square, the maximum, and the divi-\nsion operators are taken elementwise. Also,\n\u03a0X ,H(a) denotes the projection operation un-\nder Mahalanobis distance with respect to H,\ni.e., arg minx\u2208X k\u221aH(x \u2212 a)k2\n2. If X = Rd,\nthe projection step simpli\ufb01es to xt+1 = xt \u2212\nt mt. Clearly, \u03b1t \u02c6V\u22121/2\n\u03b1t \u02c6V\u22121/2\nand mt can be\ninterpreted as the adaptive learning rate and the\nmomentum-type descent direction, which adopt\nexponential moving averages as follows,\n\nt\n\nAlgorithm 1 ZO-AdaMM\n\nInput: x1 \u2208 X , step sizes {\u03b1t}T\n(0, 1], and set m0, v0 and \u02c6v0\nfor t = 1, 2, . . . , T do\n\nt=1, \u03b21,t, \u03b22 \u2208\n\nlet \u02c6gt = \u02c6\u2207ft(xt) by (1), ft(xt) := f (xt; \u03bet)\nmt = \u03b21,tmt\u22121 + (1 \u2212 \u03b21,t)\u02c6gt\nvt = \u03b22vt\u22121 + (1 \u2212 \u03b22)\u02c6g2\n\u02c6vt = max(\u02c6vt\u22121, vt), and \u02c6Vt = diag(\u02c6vt)\nxt+1 = \u03a0\n\nt mt)\n\nt\n\nX ,\u221a \u02c6Vt\n\n(xt \u2212 \u03b1t \u02c6V\u22121/2\n\nend for\n\nmt =\n\nt\n\nXj=1\" t\u2212j\nYk=1\n\n\u03b21,t\u2212k+1! (1 \u2212 \u03b21,j)\u02c6gj# , vt = (1 \u2212 \u03b22)\n\nt\n\nXj=1\n\n(\u03b2t\u2212j\n\n2\n\n\u02c6g2\nj ).\n\n(4)\n\nHere we assume that m0 = 0, v0 = 0 and 00 = 1 by convention, and let \u02c6gt = \u02c6\u2207ft(xt) by (1) with\nft(xt) := f (xt; \u03bet).\nMotivation and rationale behind ZO-AdaMM. First, gradient normalization helps noise reduction\nin ZO optimization as shown by [6, 14]. In the similar spirit, ZO-AdaMM also normalizes the descent\n\nt mt = mt/\u221avt = \u02c6gt/p\u02c6g2\n\ndirection mt by \u221a\u02c6vt. Particularly, compared to AdaMM, ZO-AdaMM prefers a small value of\n\u03b22 in practice, implying a strong favor to normalize the current gradient estimate; see Fig A1 in\nAppendix. In the extreme case of \u03b21,t = \u03b22 \u2192 0 and \u02c6vt = vt, ZO-AdaMM could reduce to ZO-\nsignSGD [14] since \u02c6V\u22121/2\nt = sign(\u02c6gt) known from (4). However, the\ndownside of ZO-signSGD is its worse convergence accuracy than ZO-SGD, i.e., it only converges to\na neighborhood of a stationary point even for unconstrained optimization. Compared to ZO-signSGD,\nZO-AdaMM is able to cover ZO-SGD as a special case when \u03b21,t = 0, \u03b22 = 1, v0 = 1 and \u02c6v0 \u2264 1\nfrom Algorithm 1. Thus, we hope that with appropriate choices of \u03b21,t and \u03b22, ZO-AdaMM could\nenjoy dual advantages of ZO-signSGD and ZO-SGD. Another motivation comes from the possible\npresence of time-dependent gradient priors [40]. Given this, the use of past gradients in momentum\nalso helps noise reduction.\n\n2In the paper, we could omit log(T ) in Big O notation.\n\n3\n\n\fWhy is ZO-AdaMM dif\ufb01cult to analyze? The convergence analysis of ZO-AdaMM becomes\nsigni\ufb01cantly more challenging than existing ZO methods due to the involved coupling among\nstochastic sampling, ZO gradinet estimation, momentum, adaptive learning rate, and projection\noperation. In particular, the use of Mahalanobis distance in projection step plays a key role on\nconvergence guarantees. And the conventional variance bound on ZO gradient estimates is insuf\ufb01cient\nto analyze the convergence of ZO-AdaMM due to the use of adaptive learning rate. In the next\nsections, we will carefully study the convergence of ZO-AdaMM under different settings.\n\n4 Convergence Analysis of ZO-AdaMM for Nonconvex Optimization\n\nIn this section, we begin by providing a deep understanding on the importance of Mahalanobis\ndistance used in ZO-AdaMM (Algorithm 1), and then introduce the Mahalanobis distance based\nconvergence analysis for both unconstrained and constrained nonconvex optimization. Our analysis\nmakes the \ufb01rst step toward understanding adaptive learning rate methods for nonconvex constrained\noptimization. Throughout the section, we make the following assumptions.\nA1: ft(\u00b7) := f (\u00b7; \u03bet) has Lg-Lipschitz continuous gradient, where Lg > 0.\nA2: ft has \u03b7-bounded stochastic gradient k\u2207ft(x)k\u221e \u2264 \u03b7.\n\n4.1\n\nImportance of Mahalanobis distance based projection operation\n\n(\u00b7) onto the con-\nRecall from Algorithm 1 that ZO-AdaMM takes the projection operation \u03a0\nstraint set X under Mahalanobis distance with respect to (w.r.t.) \u02c6Vt. In some recent adversarial\nlearning algorithms [41, 42], the Euclidean projection \u03a0X (\u00b7) was used in both FO and ZO AdaMM-\ntype methods rather than the Mahalanobis distance based projection in Algorithm 1. However, such\nan implementation could lead to non-convergence: Proposition 1 shows the non-convergence issue of\nAlgorithm 1 using the Euclidean projection operation when solving a simple linear program subject\nto \u21131-norm constraint. This is an important point which is ignored in design of many algorithms on\nadversarial training [43].\n\nX ,\u221a \u02c6Vt\n\nProposition 1 Consider the following problem\nx=[x1,x2]T \u22122x1 \u2212 x2;\nminimize\n\nsubject to |x1 + x2| \u2264 1,\n\n(5)\n\nthen Algorithm 1, initialized by x = [0.5, 0.5]T , using the Euclidean projection \u03a0X (\u00b7) converges to a\n\ufb01xed point [0.5, 0.5]T rather than a stationary point of (5).\n\nProof: The proof investigates a special case of Algorithm 1, projected signSGD; See Appendix 2.1.\n\nProposition 1 indicates that replacing the Mahalanobis distance based projection in Algorithm 1\nwith Euclidean projection will lead to a divergent algorithm, highlighting the importance of using\nMahalanobis distance. However, the use of Mahalanobis distance based projection complicates the\nconvergence analysis, especially in constrained optimization. Accordingly, we de\ufb01ne a Mahalanobis\nbased convergence measure that can simplify the analysis and can be converted into the traditional\nconvergence measure.\n\nLet x+ = xt+1, x\u2212 = xt, g = mt, \u03c9 = \u03b1t and H = \u02c6V\nbe written in the generic form\n\n1/2\nt\n\n, the projection step of Algorithm 1 can\n\nx+ = arg min\n\nx\u2208X {hg, xi + (1/\u03c9)DH(x, x\u2212)},\n\n(6)\n\nwhere DH(x, x\u2212) = kH1/2(x\u2212 x\u2212)k2/2 gives the Mahalanobis distance w.r.t. H, and k\u00b7k denotes\n\u21132 norm. Based on (6), the concept of gradient mapping [27] is given by\n\nPX ,H(x\u2212, g, \u03c9) := (x\u2212 \u2212 x+)/\u03c9.\n\n(7)\n\nThe gradient mapping PX ,H(x\u2212, g, \u03c9) yields a natural interpretation: a projected version of g at the\npoint x\u2212 given the learning rate \u03c9, yielding x+ = x\u2212 \u2212 \u03c9PX ,H(x\u22121, g, \u03c9). We note that different\nfrom [27, 44], the gradient mapping in (7) is de\ufb01ned on the projection under the Mahalanobis distance\nDH(\u00b7,\u00b7) rather than the Euclidean distance.\n\n4\n\n\fWith the aid of (7), we propose the Mahalanobis distance based convergence measure for ZO-AdaMM:\n\nkG(xt)k2 := k \u02c6V1/4\n\nt PX , \u02c6V\n\n1/2\nt\n\n(xt, \u2207f (xt), \u03b1t)k2.\n\nIf X = Rd, then the convergence measure (8) reduces to\nt \u2207f (xt)k2,\n\nk \u02c6V\n\n\u22121/4\n\n(8)\n\n(9)\n\nwhich corresponds to the squared Euclidean norm of gradient in a linearly transformed coordinate\n1/4\nsystem yt = \u02c6V\nt xt. As will be evident later, the measure (9) can be transformed to the conventional\nmeasure k\u2207f (xt)k2 for unconstrained optimization.\n\nx-descent step given by Algorithm 1 as \u03b21,t = 0 and X = Rd: xt+1 = xt \u2212 \u03b1 \u02c6V\u22121/2\nthat the ZO case is more involved but follows the same intuition. Upon de\ufb01ning yt , \u02c6V\nthe x-update can then be rewritten as the update rule in y: yt+1 = yt \u2212 \u03b1 \u02c6V\u22121/4\n\u2207yt f (xt) = ( \u2202xt\n\nWe remark that Mahalanobis (M-) distance facilitates our convergence analysis in an equivalently\ntransformed space, over which the analysis can be generalized from the conventional projected\ngradient descent framework. To get intuition, let us consider a simpler \ufb01rst-order case with the\nt \u2207f (xt). Note\n1/4\nt xt,\nt \u2207f (xt). Since\nt \u2207f (xt), the y-update, yt+1 = yt \u2212 \u03b1\u2207yf (xt), obeys the\ngradient descent framework. In the constrained case, a similar but more involved analysis can be\nmade, showing that the M-projection in the x-coordinate system is equivalent to the Euclidean\nprojection in the y-coordinate system which makes projected gradient descent applicable to the\nupdate in y. By contrast, the direct use of Euclidean projection in the x-coordinate system leads to\ndivergence in ZO-AdaMM (Proposition 1).\n\n)T\u2207f (xt) = \u02c6V\u22121/4\n\n\u2202yt\n\n4.2 Unconstrained nonconvex optimization\n\nWe next demonstrate the convergence analysis of ZO-AdaMM for unconstrained nonconvex opti-\nmization. In Proposition 2, we begin by exploring the relationship between the convergence measure\n(9) and ZO gradient estimates; See Appendix 2.2 for proof.\n\nProposition 2 Suppose that A1-A2 hold and let X = Rd, \u02c6v\n0 \u2265 c1, f\u00b5(x1) \u2212 minx f\u00b5(x) \u2264 Df ,\n\u03b21,t = \u03b21, \u03b3 := \u03b21/\u03b22 < 1, \u00b5 = 1/\u221aT d, and \u03b1t = 1/\u221aT d in Algorithm 1, then ZO-AdaMM yields\n\n1/2\n\n\u02c6V\n\n\u22121/4\n\nE(cid:20)(cid:13)(cid:13)(cid:13)\n\n\u221ad\n\u221aT\n\u03b7 maxt\u2208[T ]{k\u02c6gtk\u221e}\nE(cid:20)2\u03b72 +\nwhere xR is picked uniformly randomly from {xt}T\n\nR \u2207f (xR)(cid:13)(cid:13)(cid:13)\n\n2(cid:21) \u2264\n\nLg(4 + 5\u03b22\n\n1 \u2212 \u03b21\n\n+ 2Df\n\nL2\ng\n2c\n\nd\nT\n\n+\n\n2\nc\n\n+\n\n2(1 \u2212 \u03b21)2(1 \u2212 \u03b22)(1 \u2212 \u03b3)\n\n1 )(1 \u2212 \u03b21)\n(cid:21) d\n\nT\n\n,\n\nt=1, and \u02c6gt = \u02c6\u2207ft(xt) by (1).\n\n\u221ad\n\u221aT\n\n(10)\n\nProposition 2 implies that the convergence rate of ZO-AdaMM has a dependency on ZO gradient\nestimates in terms of Gzo := maxt\u2208[T ]{k\u02c6gtk\u221e}. Moreover, if we consider the FO AdaMM [20, 38]\nin which the ZO gradient estimate \u02c6gt is replaced with the stochastic gradient, then one can simply\nassume maxt\u2208[T ]{kgtk\u221e} to be a dimension-independent constant under A2. However, in the\nZO setting, Gzo is no longer independent of d. For example, it could be directly bounded by\nk \u02c6\u2207f (x)k2 \u2264 (d/\u00b5)kf (x + \u00b5u) \u2212 f (x)k2 \u2264 dLc under the following assumption:\nA3: ft is Lc-Lipschitz continuous.\n\nIn Proposition 3, we show that the dimension-dependency of Gzo can be further improved by using\nsphere concentration results; See Appendix 2.3 for proof.\n\nProposition 3 Under A3, max{d, T} \u2265 3, and given \u03b4 \u2208 (0, 1), then with probability at least 1\u2212 \u03b4,\n\nmax\n\nt\u2208[T ]{k\u02c6gtk\u221e} \u2264 2Lcpd log(dT /\u03b4).\n\n(11)\n\nHere we provide some insights on Proposition 3. Since the unit random vector used to de\ufb01ne\n\nThis is a tight bound since when the function difference is a constant, the lower bound satis\ufb01es\n\n\u02c6gt is uniformly sampled on a sphere, k\u02c6gtk\u221e can be improved to O(\u221ad) with high probability.\nk\u02c6gtk\u221e = \u2126(\u221ad) by sphere concentration. It is also not surprising that our bound (11) grows with T\n\n5\n\n\fsince we bound the maximum k\u02c6gtk\u221e over T realizations with high probability. The time-dependence\nis required to compensate the growth of the probability that there exists an estimate with the extreme\n\u2113\u221e value versus time. Note that as long as T has polynomial rather than exponential dependency on d,\nwe then always have maxt\u2208[T ]{k\u02c6gtk\u221e} = O(pd log (d)). Based on Proposition 2 and Proposition 3,\n\nthe convergence rate of ZO-AdaMM is provided by Theorem 1; See Appendix 2.4 for proof.\n\nTheorem 1 Suppose that A1 and A3 hold. Given parameter settings in Proposition 2 and 3, then\n\nwith probability at least 1 \u2212 1/(T\u221ad), ZO-AdaMM yields\n\nE(cid:20)(cid:13)(cid:13)(cid:13)\n\n\u02c6V\n\n\u22121/4\n\nR \u2207f (xR)(cid:13)(cid:13)(cid:13)\n\n2(cid:21) = O(cid:16)\u221ad/\u221aT + d1.5/T(cid:17) .\n\n(12)\n\nWe can also extend the convergence rate of ZO-AdaMM in Theorem 1 using the measure\nt,ii \u2265 1/maxt\u2208[T ]{k\u02c6gtk\u221e} (by the update rule), we obtain from (11) that\n\nE[k\u2207f (xR)k2]. Since \u02c6V \u22121/2\n\nE(cid:2)k\u2207f (xR)k2(cid:3) \u22642Lcpd log(dT /\u03b4)E(cid:20)(cid:13)(cid:13)(cid:13)\n\n\u02c6V\n\n\u22121/4\n\nR \u2207f (xR)(cid:13)(cid:13)(cid:13)\n\nTheorem 1, together with (13), implies O(d/\u221aT + d2/T ) convergence rate of ZO-AdaMM under\nthe conventional measure. We remark that compared to the FO rate O(\u221ad/\u221aT + d/T ) [38] of\nAdaMM for unconstrained nonconvex optimization under A1-A2, ZO-AdaMM suffers O(\u221ad) and\nO(d) slowdown on the rate term O(1/\u221aT ) and O(1/T ), respectively. This dimension-dependent\n\n2(cid:21) .\n\n(13)\n\nslowdown is similar to ZO-SGD versus SGD shown by [9]. We also remark that compared to\nFO-AdaMM, ZO-AdaMM requires additional A3 to bound the \u2113\u221e norm of ZO gradient estimates.\n\n4.3 Constrained nonconvex optimization\n\nTo analyze ZO-AdaMM in a general constrained case, one needs to handle the coupling effects from\nall three factors: momentum, adaptive learning rate, and projection operation. Here we focus on\naddressing the coupling issue in the last two factors, which yields our results on ZO-AdaMM at\n\u03b21,t = 0. This is equivalent to the ZO version of RMSProp [17] with Reddi\u2019s convergence \ufb01x in [18].\nWhen the momentum factor comes into play, the scenario becomes much more complicated. We leave\nthe answer to the general case \u03b21,t 6= 0 for future research. Even for SGD with momentum, we are\nnot aware of any successful convergence analysis for stochastic constrained nonconvex optimization.\n\nIt is known from SGD [27] that the presence of projection induces a stochastic bias (independent of\niteration number T ) for constrained nonconvex optimization. In Theorem 2, we show that the same\nchallenge holds for ZO-AdaMM. Thus, one has to adopt the variance reduced gradient estimator,\nwhich induces higher querying complexity than the estimator (1); See Appendix 2.5 for proof.\n\nTheorem 2 Suppose that A1-A2 hold, \u02c6v\n\u00b5 = 1\u221aT d\n\n0 \u2265 c1, f\u00b5(x1) \u2212 minx f\u00b5(x) \u2264 Df , \u03b1t = \u03b1 \u2264 c\n,\nLg\n, and \u03b21,t = 0 in Algorithm 1, then the convergence rate of ZO-AdaMM under (8) satis\ufb01es\n\n1/2\n\n6Df\n\u03b1T\n\n3L2\ngd\n4cT\n\n6\u03b72\nc4T\n\n+\n\nE[kG(xR)k2] \u2264\nwhere xR is picked uniformly randomly from {xt}T\nsmoothing function of f de\ufb01ned in (2).\n\n(max\nt\u2208[T ]\n\nE[k\u02c6gt \u2212 f\u00b5(xt)k2] + d\u03b72) +\n\nE[k\u02c6gt \u2212 f\u00b5(xt)k2],\nt=1, G(x) has been de\ufb01ned in (8), and f\u00b5 is the\n\nmax\nt\u2208[T ]\n\n+\n\nc\n\n3c + 9\n\nTheorem 2 implies that regardless of the number of iterations T , ZO-AdaMM only converges to\na solution\u2019s neighborhood whose size is determined by the variance of ZO gradient estimates\n\nmaxt\u2208[T ] E[k\u02c6gt \u2212 f\u00b5(xt)k2]. To make this term diminishing, we consider the following variance\n\nreduced gradient estimator built on multiple stochastic samples and random direction vectors [14],\n\n\u02c6gt =\n\nq\n\n1\n\nbq Xj\u2208It\n\nXi=1\n\n\u02c6\u2207f (xt; ui,t, \u03bej),\n\n\u02c6\u2207f (xt; ui,t, \u03bej) :=\n\nd[f (xt + \u00b5ui,t; \u03bej) \u2212 f (xt; \u03bej)]\n\n\u00b5\n\nui,t,\n\n(14)\n\nwhere It is a mini-batch containing b stocahstic samples at time t, and {ui,t}q\ni=1 are q random\ndirection vectors at time t. We present the variance of (14) in Lemma 1, whose proof is induced from\n[14, Proposition 2] by using k\u2207ftk2\n\n2 \u2264 dk\u2207ftk2\n\u221e\n\n= d\u03b72 in A2.\n\n6\n\n\fLemma 1 Suppose that A1-A2 hold, then for \u00b5 \u2264 1/\u221ad, the variance of (14) yields\n\nE(cid:2)k\u02c6gt \u2212 \u2207f\u00b5(xt)k2\n\n2(cid:3) = O(cid:0)d/b + d2/q(cid:1) .\n\n(15)\n\nBased on Lemma 1, the rate of ZO-AdaMM in Theorem 2 becomes E[kG(xR)k2] = O(d/T + d/b +\nd2/q). Note that if A3 holds, then the dimension-dependency can be improved by O(d) factor\nbased on Lemma 1. To the best of our knowledge, even in the FO case we are not aware of existing\nconvergence rate analysis on adaptive learning rate methods for nonconvex contrained optimization.\n\n5 Extended Analysis of ZO-AdaMM\n\nZO-AdaMM for constrained convex optimization Different from the nonconvex case, the con-\nvergence of ZO-AdaMM for convex optimization is commonly measured by the average regret\nt=1 ft(x\u2217)i [18, 19], where recall that ft(xt) = f (xt; \u03bet), and x\u2217\nRT = Eh 1\nis the optimal solution. We provide the average regret with the ZO gradient estimates by leveraging\nits connection to the smoothing function of ft in Proposition 4; see Appendix 3.1 for proof.\n\nt=1 ft(xt) \u2212 1\n\nT PT\n\nT PT\n\nProposition 4 Suppose that \u03b1t = \u03b1/\u221at, \u03b21,t = \u03b21/t with \u03b21,1 = \u03b21, \u03b21, \u03b22 \u2208 [0, 1), \u03b3 :=\n\u03b21/\u221a\u03b22 < 1 and X has bounded diameter D\u221e, then ZO-AdaMM for convex optimization yields\n\nT\n\nT\n\nRT,\u00b5 := E\" 1\nXt=1\n\u221ePd\n\u03b1(1 \u2212 \u03b21)\u221aT\n\nE[\u02c6v1/2\nT,i ]\n\nD2\n\ni=1\n\n\u2264\n\n+\n\nft,\u00b5(xt) \u2212\n\nT\n\n1\nT\n\nft,\u00b5(x\u2217)#\nXt=1\nXi=1\nXt=1\n2(1 \u2212 \u03b21)T\n\nD2\n\n\u221e\n\nT\n\nd\n\n\u03b21E[\u02c6v1/2\nt,i ]\n\u03b1\u221at\n\n+\n\n\u03b1\u221a1 + log T Pd\nEk\u02c6g1:T,ik\n(1 \u2212 \u03b21)2(1 \u2212 \u03b3)\u221a1 \u2212 \u03b22T\n\ni=1\n\n.\n\n(16)\n\nwhere ft,\u00b5 denotes the smoothing function of f de\ufb01ned by (2), \u02c6vt,i denotes the ith element of the\nvector \u02c6vt de\ufb01ned in Algorithm 1, and \u02c6g1:T,i := [\u02c6g1,i, . . . , \u02c6gT,i]\u22a4.\n\nWe remark that Proposition 4 would reduce to [18, Theorem 4] by replacing ZO gradient estimates\n\u02c6g1:T,i and \u02c6vt,i with FO gradients g1:T and vt. However, it was recently shown by [39] that the\nproof of [18, Theorem 4] is problematic. To address the proof issue, in Proposition 4 we present a\nsimpler \ufb01x than [39, Theorem 4.1] and show that the conclusion of [18, Theorem 4] keeps correct.\nIn the FO setting, the rate of AdaMM under A2 for constrained convex optimization is given by\nO(d/\u221aT ) [19, Corollary 4.4]. Here A2 provides the direct \u03b7-upper bound on |gt,i| and \u02c6v1/2\nt,i , and we\nconsider worst-case rate analysis without imposing extra assumptions like sparse gradients3. In the\nZO setting, we need further bound |\u02c6gt,i| and \u02c6vt,i and link RT,\u00b5 to RT , where the former is achieved\nby Proposition 3 and the latter is achieved by the relationship between ft and its smoothing function\nft,\u00b5 shown in Lemma A1-(a), yielding ft(xt) \u2212 ft(x\u2217) \u2264 ft,\u00b5(xt) \u2212 ft,\u00b5(x\u2217) + 2\u00b5Lc. Thus, given\n\u00b5 \u2264 d/\u221aT and assuming conditions in Proposition 3 hold, then the rate of ZO-AdaMM becomes\nRT \u2264 2\u00b5Lc + RT,\u00b5 = O(d1.5/\u221aT ), which is O(\u221ad) worse than the AdaMM.\n\nComparison with other ZO methods Since the existing convergence analysis for different ZO\nmethods is built on different problem settings and assumptions. The direct comparison over the\nconvergence rates might not be fair enough. Thus, in Table 1 we compare ZO-AdaMM with others ZO\nmethods from 4 perspectives: a) the type of gradient estimator, b) the setting of smoothing parameter\n\u00b5, c) convergence rate, and d) function query complexity.\n\nTable 1 shows that for unconstrained nonconvex optimization, the convergence of ZO-AdaMM\nachieves worse dependency on d than ZO-SGD [9], ZO-SCD [22] and ZO-signSGD [14]. However,\nit has milder choice of \u00b5 than ZO-SGD, less query complexity than ZO-SCD, and no T -independent\nconvergence bias compared to ZO-signSGD. Also, for constrained nonconvex optimization, ZO-\nAdaMM yields the similar rate to ZO-ProxSGD [27], which also implies ZO projected SGD (ZO-\nPSGD). For constrained convex optimization, the rate of ZO-AdaMM is O(d) worse than ZO-SMD\n[23] but ours has the signi\ufb01cantly improved dimension-dependency in \u00b5. We also highlight that at\nthe \ufb01rst glance, ZO-AdaMM has a worse d-dependency (regardless of choice of \u00b5) than ZO-SGD.\nHowever, even in the FO setting, AdaMM has an extra O(\u221ad) dependency in the worst case due to\n\nthe effect of (coordinate-wise) gradient normalization when bounding the distance of two consecutive\n\n3The work [40] showed the lack of sparsity in gradients while generating adversarial examples.\n\n7\n\n\fupdates. Thus, in addition to comparing with different ZO methods, Table 1 also summarizes the\n\nconvergence performance of FO AdaMM. Note that our rate yields O(\u221ad) slowdown compared to\n\nFO AdaMM though bounding ZO gradient estimate norm requires stricter assumption.\n\nRate\n\nQuery\n\nMethod\n\nAssumptions\n\nZO-SGD [9]\n\nNC1, UCons1, A1, A32\n\nZO-SCD [22]\n\nNC, UCons, A1, A32\n\nGradient\nestimator\nGauGE1\n\nCooGE1\n\nSmoothing\nparameter \u00b5\nd\u221aT (cid:17)\nO (cid:16) 1\n+ 1\u221ad(cid:17)\n\nO (cid:16) 1\u221aT\n\nZO-signSGD [14]\n\nNC, UCons, A1, A3\n\nsign-UniGE1\n\nZO-ProxSGD /\nZO-PSGD [27]\n\nZO-SMD [23]\n\nNC, Cons4, A1, A3\n\nGauGE\n\nC, Cons, A3\n\nGauGE/UniGE\n\nAdaMM [20, 38]\n\nNC, UCons, A1, A2\n\nAdaMM [18, 19, 39]\n\nC, Cons, A2\n\nZO-AdaMM\n\nNC, UCons, A1, A3\n\nZO-AdaMM\n\nZO-AdaMM\n\nNC, Cons, A1, A3\n\n\u03b21,t = 0\n\nC, Cons, A3\n\nSGE1\n\nSGE\n\nUniGE\n\nUniGE\n\nUniGE\n\nO (cid:16) 1\u221adT (cid:17)\nO (cid:16) 1\u221adT (cid:17)\nO (cid:0) 1\ndt(cid:1)\nn/a\n\nn/a\n\nO (cid:16) 1\u221adT (cid:17)\nO (cid:16) 1\u221adT (cid:17)\nO (cid:16) d\u221aT (cid:17)\n\nO(\n\nO (T )\n\nO (cid:16) \u221ad\u221aT\nT (cid:17)\n+ d\nO (cid:16) \u221ad\u221aT\nT (cid:17)\n+ d\nO (dT )\n\u221ad\u221aT\n\u221ad\u221ab\n+ d\u221abq )3 O (bqT )\n+\nO (cid:16) d2\nq(cid:17)\nqT + d\nO (qT )\nO (cid:16) \u221ad\u221aT (cid:17)\nO (cid:16) \u221ad\u221aT\nT (cid:17)\n+ d\nO (cid:16) d\u221aT (cid:17)\nT (cid:17)\n+ d2\nq(cid:17)\nb + d\nT + 1\n\u221aT (cid:17)\nO (cid:16) d1.5\n\nO (cid:16) d\u221aT\nO (cid:16) d\n\nO (bqT )\n\nO (T )\n\nO(T )\n\nn/a\n\nn/a\n\nO (T )\n\n1 Abbreviations. NC: Nonconvex; UCons: Unconstrained; GauGE: Gaussian random vector based gradient estimate; UniGE: Uniform\nrandom vector based gradient estimate; CooGE: Coordinate-wise gradient estimate; SGE: stochastic (\ufb01rst-order) gradient estimate\n2 Assumption of bounded variance of stochastic gradients is implied from A3.\n3 Convergence of ZO-signSGD is measured by E[k\u2207f (xT )k2] rather than its square used in other algorithms for nonconvex optimization.\nTable 1: Summary of convergence rate and query complexity of various ZO algorithms given T iterations.\n\n6 Applications to Black-Box Adversarial Attacks\n\nIn this section, we demonstrate the effectiveness of ZO-AdaMM by experiments on generating\nblack-box adversarial examples. Our experiments will be performed on Inception V3 [45] using\nImageNet [46]. Here we focus on two types of black-box adversarial attacks: per-image adversarial\nperturbation [47] and universal adversarial perturbation against multiple images [5, 6, 48, 49]. For\neach type of attack, we allow both constrained and unconstrained optimization problem settings. We\ncompare our propos ed ZO-AdaMM method with 6 existing ZO algorithms: ZO-SGD, ZO-SCD and\nZO-signSGD for unconstrained optimization, and ZO-PSGD, ZO-SMD and ZO-NES for constrained\noptimization. The \ufb01rst 5 methods have been summarized in Table 1, and ZO-NES refers to the\nblack-box attack generation method in [6], which applies a projected version of ZO-signSGD using\nnatural evolution strategy (NES) based random gradient estimator. In our experiments, every method\ntakes the same number of queries per iteration. Accordingly, the total query complexity is consistent\nwith the number of iterations. We refer to Appendix 4 for details on experiment setups.\n\nPer-image adversarial perturbation In Fig. 1, we present the attack loss and the resulting \u21132-\ndistortion against iteration numbers for solving both unconstrained and constrained adversarial attack\nproblems, namely, (94) and (93) in Appendix 4, over 100 randomly selected images. Here every\nalgorithm is initialized by zero perturbation. Thus, as the iteration increases, the attack loss decreases\nuntil it converges to 0 (indicating successful attack) while the distortion could increase. At this sense,\nthe best attack performance should correspond to the best tradeoff between the fast convergence\nto 0 attack loss and the low distortion power (evaluated by \u21132 norm). As we can see, ZO-AdaMM\nconsistently outperforms other ZO methods in terms of the fast convergence of attack loss and\nrelatively small perturbation. We also note that ZO-signSGD and ZO-NES have poor convergence\naccuracy in terms of either large attack loss or large distortion at \ufb01nal iterations. This is not surprising,\nsince it has been shown in [14] that ZO-signSGD only converges to a neighborhood of a solution,\nand ZO-NES can be regarded as a Euclidean projection based ZO-signSGD, which could induce\nconvergence issues shown by Prop. 1. We refer readers to Table A3 for detailed experiment results.\n\nUniversal adversarial perturbation We now focus on designing a universal adversarial perturba-\ntion using the constrained attack problem formulation. Here we attack M = 100 random selected\nimages from ImageNet. In Fig. 2, we present the attack loss as well as the \u21132 norm of universal\nperturbation at different iteration numbers. As we can see, compared with the other ZO algorithms,\nZO-AdaMM has the fastest convergence speed to reach the smallest adversarial perturbation (namely,\nstrongest universal attack). Moreover, in Table 2 we present detailed attack success rate and \u21132 distor-\ntion over T = 40000 iterations. Consistent with Fig. 2, ZO-AdaMM achieves highest success rate\n\n8\n\n\f(a) unconstrained setting\n\n(b) constrained setting\n\nFigure 1: The attack loss and adversarial distortion v.s. iterations. Each box represents results from 100 images.\n\nwith lowest distortion. In Fig. A2 of Appendix A2, we visualize patterns of the generated universal\nadversarial perturbations which further con\ufb01rm the advantage of ZO-AdaMM.\n\nMethods\n\nZO-NES\nZO-PSGD\nZO-SMD\n\nZO-AdaMM\n\nAttack\n\nFinal\n\nsuccess rate\n\n74%\n78%\n79%\n84%\n\nk\u03b4Tk2\n\n2\n\n67.74\n49.92\n47.36\n38.40\n\nFigure 2: Attack loss and distortion of universal attack.\n\n7 Conclusion\n\nTable 2: Summary of attack success rate and\neventual \u21132 distortion for universal attack against\n100 images under T = 40000 iterations.\n\nIn this paper, we propose ZO-AdaMM, the \ufb01rst effort to integrate adaptive momentum methods\nwith ZO optimization. In theory, we show that ZO-AdaMM has convergence guarantees for both\nconvex and nonconvex constrained optimization. Compared with (\ufb01rst-order) AdaMM, it suffers a\n\nslowdown factor of O(\u221ad). Particularly, we establish a new Mahalanobis distance based convergence\n\nmeasure whose necessity and importance are provided in characterizing the convergence behavior of\nZO-AdaMM on nonconvex constrained problems. To demonstrate the utility of the algorithm, we\nshow the superior performance of ZO-AdaMM for designing adversarial examples from black-box\nneural networks. Compared with 6 state-of-the-art ZO methods, ZO-AdaMM has the fastest empirical\nconvergence to strong black-box adversarial attacks that require the minimum distortion strength.\n\nAcknowledgement\n\nThis work is partly supported by National Science Foundation CNS-1932351. M. Hong is supported\nin part by NSF under Grant CMMI-172775, CIF-1910385 and by ARO under grant 73202-CS.\n\nReferences\n\n[1] S. Liu, J. Chen, P.-Y. Chen, and A. O. Hero, \u201cZeroth-order online ADMM: Convergence\nanalysis and applications,\u201d in Proceedings of the Twenty-First International Conference on\nArti\ufb01cial Intelligence and Statistics, April 2018, vol. 84, pp. 288\u2013297.\n\n[2] A. K. Sahu, M. Zaheer, and S. Kar, \u201cTowards gradient free and projection free stochastic\n\noptimization,\u201d arXiv preprint arXiv:1810.03233, 2018.\n\n[3] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, \u201cEf\ufb01cient and\nrobust automated machine learning,\u201d in Advances in Neural Information Processing Systems,\n2015, pp. 2962\u20132970.\n\n9\n\n0100200400800Iteration020406080100Attack lossZO-SCDZO-SGDZO-signSGDZO-AdaMM0100200400800Iteration020406080100DistortionZO-SCDZO-SGDZO-signSGDZO-AdaMM0100200400800Iteration020406080100Attack lossZO-NESZO-PSGDZO-SMDZO-AdaMM0100200400800Iteration020406080100120140160180DistortionZO-NESZO-PSGDZO-SMDZO-AdaMM010000200003000040000Iteration1020304050Attack lossZO-NESZO-PSGDZO-SMDZO-AdaMM010000200003000040000Iteration010203040506070DistortionZO-NESZO-PSGDZO-SMDZO-AdaMM\f[4] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown, \u201cAuto-weka 2.0:\nAutomatic model selection and hyperparameter optimization in weka,\u201d J. Mach. Learn. Res.,\nvol. 18, no. 1, pp. 826\u2013830, Jan. 2017.\n\n[5] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, \u201cZoo: Zeroth order optimization based\nblack-box attacks to deep neural networks without training substitute models,\u201d in Proceedings\nof the 10th ACM Workshop on Arti\ufb01cial Intelligence and Security. ACM, 2017, pp. 15\u201326.\n\n[6] A. Ilyas, K. Engstrom, A. Athalye, and J. Lin, \u201cBlack-box adversarial attacks with limited\nqueries and information,\u201d in Proceedings of the 35th International Conference on Machine\nLearning, July 2018.\n\n[7] C.-C. Tu, P. Ting, P.-Y. Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng, \u201cAutozoom:\nAutoencoder-based zeroth order optimization method for attacking black-box neural networks,\u201d\narXiv preprint arXiv:1805.11770, 2018.\n\n[8] Y. Nesterov and V. Spokoiny, \u201cRandom gradient-free minimization of convex functions,\u201d\n\nFoundations of Computational Mathematics, vol. 2, no. 17, pp. 527\u2013566, 2015.\n\n[9] S. Ghadimi and G. Lan, \u201cStochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming,\u201d SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341\u20132368, 2013.\n\n[10] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, \u201cTaking the human out\nof the loop: A review of bayesian optimization,\u201d Proceedings of the IEEE, vol. 104, no. 1, pp.\n148\u2013175, 2016.\n\n[11] A. R. Conn, K. Scheinberg, and L. Vicente, \u201cGlobal convergence of general derivative-free\ntrust-region algorithms to \ufb01rst-and second-order critical points,\u201d SIAM Journal on Optimization,\nvol. 20, no. 1, pp. 387\u2013415, 2009.\n\n[12] D. Whitley, \u201cA genetic algorithm tutorial,\u201d Statistics and computing, vol. 4, no. 2, pp. 65\u201385,\n\n1994.\n\n[13] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to derivative-free optimization,\n\nvol. 8, Siam, 2009.\n\n[14] S. Liu, P.-Y. Chen, X. Chen, and M. Hong, \u201csignSGD via zeroth-order oracle,\u201d in International\n\nConference on Learning Representations, 2019.\n\n[15] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar, \u201csignsgd: compressed\n\noptimisation for non-convex problems,\u201d arXiv preprint arXiv:1802.04434, 2018.\n\n[16] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d in Proc. 3rd Int. Conf.\n\nLearn. Representations, 2014.\n\n[17] T. Tieleman and G. Hinton, \u201cLecture 6.5-rmsprop: Divide the gradient by a running average of\nits recent magnitude,\u201d COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp.\n26\u201331, 2012.\n\n[18] S. J. Reddi, S. Kale, and S. Kumar, \u201cOn the convergence of adam and beyond,\u201d in International\n\nConference on Learning Representations, 2018.\n\n[19] J. Chen and Q. Gu, \u201cClosing the generalization gap of adaptive gradient methods in training\n\ndeep neural networks,\u201d arXiv preprint arXiv:1806.06763, 2018.\n\n[20] X. Chen, S. Liu, R. Sun, and M. Hong, \u201cOn the convergence of a class of adam-type algorithms\n\nfor non-convex optimization,\u201d arXiv preprint arXiv:1808.02941, 2018.\n\n[21] L. Balles and P. Hennig, \u201cDissecting adam: The sign, magnitude and variance of stochastic\ngradients,\u201d in Proceedings of the 35th International Conference on Machine Learning, Jennifer\nDy and Andreas Krause, Eds., Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018, vol. 80\nof Proceedings of Machine Learning Research, pp. 404\u2013413.\n\n[22] X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, \u201cA comprehensive linear speedup analysis\nfor asynchronous stochastic parallel optimization from zeroth-order to \ufb01rst-order,\u201d in Advances\nin Neural Information Processing Systems, 2016, pp. 3054\u20133062.\n\n10\n\n\f[23] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, \u201cOptimal rates for zero-\norder convex optimization: The power of two function evaluations,\u201d IEEE Transactions on\nInformation Theory, vol. 61, no. 5, pp. 2788\u20132806, 2015.\n\n[24] S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, and L. Amini, \u201cZeroth-order stochastic\nvariance reduction for nonconvex optimization,\u201d Advances in Neural InformationProcessing\nSystems, 2018.\n\n[25] B. Gu, Z. Huo, and H. Huang, \u201cZeroth-order asynchronous doubly stochastic algorithm with\n\nvariance reduction,\u201d arXiv preprint arXiv:1612.01425, 2016.\n\n[26] L. Liu, M. Cheng, C.-J. Hsieh, and D. Tao, \u201cStochastic zeroth-order optimization via variance\n\nreduction method,\u201d arXiv preprint arXiv:1805.11811, 2018.\n\n[27] S. Ghadimi, G. Lan, and H. Zhang, \u201cMini-batch stochastic approximation methods for non-\nconvex stochastic composite optimization,\u201d Mathematical Programming, vol. 155, no. 1-2, pp.\n267\u2013305, 2016.\n\n[28] Krishnakumar Balasubramanian and Saeed Ghadimi, \u201cZeroth-order (non)-convex stochastic\noptimization via conditional gradient and gradient updates,\u201d in Advances in Neural Information\nProcessing Systems, 2018, pp. 3455\u20133464.\n\n[29] J. Chen, J. Yi, and Q. Gu, \u201cA Frank-Wolfe framework for ef\ufb01cient and effective adversarial\n\nattacks,\u201d arXiv preprint arXiv:1811.10828, 2018.\n\n[30] X. Gao, B. Jiang, and S. Zhang, \u201cOn the information-adaptive variants of the ADMM: an\n\niteration complexity perspective,\u201d Optimization Online, vol. 12, 2014.\n\n[31] S\u00e9bastien Le Digabel, \u201cAlgorithm 909: Nomad: Nonlinear optimization with the mads algo-\n\nrithm,\u201d ACM Transactions on Mathematical Software (TOMS), vol. 37, no. 4, pp. 44, 2011.\n\n[32] Charles Audet and John E Dennis Jr, \u201cMesh adaptive direct search algorithms for constrained\n\noptimization,\u201d SIAM Journal on optimization, vol. 17, no. 1, pp. 188\u2013217, 2006.\n\n[33] A Ismael F Vaz and Lu\u00eds N Vicente, \u201cPswarm: a hybrid solver for linearly constrained global\nderivative-free optimization,\u201d Optimization Methods & Software, vol. 24, no. 4-5, pp. 669\u2013685,\n2009.\n\n[34] Michael JD Powell, \u201cA direct search optimization method that models the objective and\nconstraint functions by linear interpolation,\u201d in Advances in optimization and numerical\nanalysis, pp. 51\u201367. Springer, 1994.\n\n[35] Michael JD Powell, \u201cThe bobyqa algorithm for bound constrained optimization without\nderivatives,\u201d Cambridge NA Report NA2009/06, University of Cambridge, Cambridge, pp.\n26\u201346, 2009.\n\n[36] L. M. Rios and N. V. Sahinidis, \u201cDerivative-free optimization: a review of algorithms and\ncomparison of software implementations,\u201d Journal of Global Optimization, vol. 56, no. 3, pp.\n1247\u20131293, 2013.\n\n[37] Charles Audet and Warren Hare, Derivative-free and blackbox optimization, Springer, 2017.\n\n[38] D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu, \u201cOn the convergence of adaptive gradient\n\nmethods for nonconvex optimization,\u201d arXiv preprint arXiv:1808.05671, 2018.\n\n[39] T. T. Phuong and L. T. Phong, \u201cOn the convergence proof of amsgrad and a new version,\u201d arXiv\n\npreprint arXiv:1904.03590, 2019.\n\n[40] A. Ilyas, L. Engstrom, and A. Madry, \u201cPrior convictions: Black-box adversarial attacks with\n\nbandits and priors,\u201d arXiv preprint arXiv:1807.07978, 2018.\n\n[41] A. Kurakin, I. Goodfellow, and S. Bengio, \u201cAdversarial examples in the physical world,\u201d arXiv\n\npreprint arXiv:1607.02533, 2016.\n\n[42] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, \u201cBlack-box adversarial attacks with limited\n\nqueries and information,\u201d arXiv preprint arXiv:1804.08598, 2018.\n\n11\n\n\f[43] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian\nVladu, \u201cTowards deep learning models resistant to adversarial attacks,\u201d arXiv preprint\narXiv:1706.06083, 2017.\n\n[44] S. J. Reddi, S. Sra, B. Poczos, and A. J. Smola, \u201cProximal stochastic methods for nonsmooth\nnonconvex \ufb01nite-sum optimization,\u201d in Advances in Neural Information Processing Systems,\n2016, pp. 1145\u20131153.\n\n[45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, \u201cRethinking the inception\narchitecture for computer vision,\u201d in IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2016, pp. 2818\u20132826.\n\n[46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, \u201cImagenet: A large-scale hierarchical\nimage database,\u201d in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE\nConference on. IEEE, 2009, pp. 248\u2013255.\n\n[47] Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Quanfu Fan, Deniz Erdogmus, Yanzhi\nWang, and Xue Lin, \u201cStructured adversarial attack: Towards general implementation and better\ninterpretability,\u201d in International Conference on Learning Representations, 2019.\n\n[48] F. Suya, Y. Tian, D. Evans, and P. Papotti, \u201cQuery-limited black-box attacks to classi\ufb01ers,\u201d\n\narXiv preprint arXiv:1712.08713, 2017.\n\n[49] M. Cheng, T. Le, P.-Y. Chen, J. Yi, H. Zhang, and C.-J. Hsieh, \u201cQuery-ef\ufb01cient hard-label\nblack-box attack: An optimization-based approach,\u201d arXiv preprint arXiv:1807.04457, 2018.\n\n[50] S. Dasgupta and A. Gupta, \u201cAn elementary proof of a theorem of johnson and lindenstrauss,\u201d\n\nRandom Struct. Algorithms, vol. 22, no. 1, pp. 60\u201365, Jan. 2003.\n\n[51] N. Carlini and D. Wagner, \u201cTowards evaluating the robustness of neural networks,\u201d in IEEE\n\nSymposium on Security and Privacy, 2017, pp. 39\u201357.\n\n12\n\n\f", "award": [], "sourceid": 3908, "authors": [{"given_name": "Xiangyi", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Sijia", "family_name": "Liu", "institution": "MIT-IBM Watson AI Lab, IBM Research AI"}, {"given_name": "Kaidi", "family_name": "Xu", "institution": "Northeastern University"}, {"given_name": "Xingguo", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Xue", "family_name": "Lin", "institution": "Northeastern University"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "University of Minnesota"}, {"given_name": "David", "family_name": "Cox", "institution": "MIT-IBM Watson AI Lab"}]}