{"title": "Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3727, "page_last": 3737, "abstract": "As application demands for zeroth-order (gradient-free) optimization accelerate, the need for variance reduced and faster converging approaches is also intensifying. This paper addresses these challenges by presenting: a) a comprehensive theoretical analysis of variance reduced zeroth-order (ZO) optimization, b) a novel variance reduced ZO algorithm, called ZO-SVRG, and c) an experimental evaluation of our approach in the context of two compelling applications, black-box chemical material classification and generation of adversarial examples from black-box deep neural network models. Our theoretical analysis uncovers an essential difficulty in the analysis of ZO-SVRG: the unbiased assumption on gradient estimates no longer holds. We prove that compared to its first-order counterpart, ZO-SVRG with a two-point random gradient estimator could suffer an additional error of order $O(1/b)$, where $b$ is the mini-batch size. To mitigate this error, we propose two accelerated versions of ZO-SVRG utilizing \n variance reduced gradient estimators, which achieve the best rate known for ZO stochastic optimization (in terms of iterations). Our extensive experimental results show that our approaches outperform other state-of-the-art ZO algorithms, and strike a balance between the convergence rate and the function query complexity.", "full_text": "Zeroth-Order Stochastic Variance Reduction for\n\nNonconvex Optimization\n\nSijia Liu1 Bhavya Kailkhura2 Pin-Yu Chen1 Paishun Ting3 Shiyu Chang1 Lisa Amini1\n\n1MIT-IBM Watson AI Lab, IBM Research\n2Lawrence Livermore National Laboratory\n\n3University of Michigan, Ann Arbor\n\nAbstract\n\nAs application demands for zeroth-order (gradient-free) optimization accelerate,\nthe need for variance reduced and faster converging approaches is also intensifying.\nThis paper addresses these challenges by presenting: a) a comprehensive theoretical\nanalysis of variance reduced zeroth-order (ZO) optimization, b) a novel variance\nreduced ZO algorithm, called ZO-SVRG, and c) an experimental evaluation of\nour approach in the context of two compelling applications, black-box chemical\nmaterial classi\ufb01cation and generation of adversarial examples from black-box deep\nneural network models. Our theoretical analysis uncovers an essential dif\ufb01culty\nin the analysis of ZO-SVRG: the unbiased assumption on gradient estimates no\nlonger holds. We prove that compared to its \ufb01rst-order counterpart, ZO-SVRG with\na two-point random gradient estimator could suffer an additional error of order\nO(1/b), where b is the mini-batch size. To mitigate this error, we propose two\naccelerated versions of ZO-SVRG utilizing variance reduced gradient estimators,\nwhich achieve the best rate known for ZO stochastic optimization (in terms of\niterations). Our extensive experimental results show that our approaches outperform\nother state-of-the-art ZO algorithms, and strike a balance between the convergence\nrate and the function query complexity.\n\n1\n\nIntroduction\n\nZeroth-order (gradient-free) optimization is increasingly embraced for solving machine learning\nproblems where explicit expressions of the gradients are dif\ufb01cult or infeasible to obtain. Recent\nexamples have shown zeroth-order (ZO) based generation of prediction-evasive, black-box adversarial\nattacks on deep neural networks (DNNs) as effective as state-of-the-art white-box attacks, despite\nleveraging only the inputs and outputs of the targeted DNN [1\u20133]. Additional classes of applications\ninclude network control and management with time-varying constraints and limited computation\ncapacity [4, 5], and parameter inference of black-box systems [6, 7]. ZO algorithms achieve gradient-\nfree optimization by approximating the full gradient via gradient estimators based on only the function\nvalues [8, 9].\nAlthough many ZO algorithms have recently been developed and analyzed [5, 10\u201318], they often\nsuffer from the high variances of ZO gradient estimates, and in turn, hampered convergence rates. In\naddition, these algorithms are mainly designed for convex settings, which limits their applicability in\na wide range of (non-convex) machine learning problems.\nIn this paper, we study the problem of design and analysis of variance reduced and faster converging\nnonconvex ZO optimization methods. To reduce the variance of ZO gradient estimates, one can draw\nmotivations from similar ideas in the \ufb01rst-order regime. The stochastic variance reduced gradient\n(SVRG) is a commonly-used, effective \ufb01rst-order approach to reduce the variance [19\u201323]. Due to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fT )1 to O(1/T ), where T is the total number of iterations.\n\n\u221a\nthe variance reduction, it improves the convergence rate of stochastic gradient descent (SGD) from\nO(1/\nAlthough SVRG has shown a great promise, applying similar ideas to ZO optimization is not a trivial\ntask. The main challenge arises due to the fact that SVRG relies upon the assumption that a stochastic\ngradient is an unbiased estimate of the true batch/full gradient, which unfortunately does not hold in\nthe ZO case. Therefore, it is an open question whether the ZO stochastic variance reduced gradient\ncould enable faster convergence of ZO algorithms. In this paper, we attempt to \ufb01ll the gap between\nZO optimization and SVRG.\nContributions We propose and evaluate a novel ZO algorithm for nonconvex stochastic optimization,\nZO-SVRG, which integrates SVRG with ZO gradient estimators. We show that compared to SVRG,\nZO-SVRG achieves a similar convergence rate that decays linearly with O(1/T ) but up to an\nadditional error correction term of order 1/b, where b is the mini-batch size. We show that this\ncorrection term will be eliminated as the full batch of data is used, corresponding to b = n where\nn is the number of data samples. In this scenario, ZO-SVRG would reduce to ZO gradient descent\n(ZO-GD) [13]. However, without a careful treatment, this correction term (e.g., when b is small)\ncould be a critical factor affecting the optimization performance. To mitigate this error term, we\npropose two accelerated ZO-SVRG variants, utilizing reduced variance gradient estimators. These\nyield a faster convergence rate towards O(d/T ), the best known iteration complexity bound for ZO\nstochastic optimization.\nOur work offers a comprehensive study on how ZO gradient estimators affect SVRG on both iteration\ncomplexity (i.e., convergence rate) and function query complexity. Compared to the existing ZO\nalgorithms, our methods can strike a balance between iteration complexity and function query\ncomplexity. To demonstrate the \ufb02exibility of our approach in managing this trade-off, we conduct an\nempirical evaluation of our proposed algorithms and other state-of-the-art algorithms on two diverse\napplications: black-box chemical material classi\ufb01cation and generation of universal adversarial\nperturbations from black-box deep neural network models. Extensive experimental results and\ntheoretical analysis validate the effectiveness of our approaches.\n\n2 Related work\n\n\u221a\nd/\n\nIn ZO algorithms, a full gradient is typically approximated using either a one-point or a two-point\ngradient estimator, where the former acquires a gradient estimate \u02c6\u2207f (x) by querying f (\u00b7) at a single\nrandom location close to x [10, 11], and the latter computes a \ufb01nite difference using two random\nfunction queries [12, 13]. In this paper, we focus on the two-point gradient estimator since it has a\nlower variance and thus improves the complexity bounds of ZO algorithms.\nDespite the meteoric rise of two-point based ZO algorithms, most of the work is restricted to convex\n\u221a\nproblems [5, 14\u201318]. For example, a ZO mirror descent algorithm proposed by [14] has an exact rate\nT ), where d is the number of optimization variables. The same rate is obtained by bandit\nO(\nconvex optimization [15] and ZO online alternating direction method of multipliers [5]. Current\nstudies suggested that ZO algorithms typically agree with the iteration complexity of \ufb01rst-order\nalgorithms up to a small-degree polynomial of the problem size d.\nIn contrast to the convex setting, non-convex ZO algorithms are comparatively under-studied except\na few recent attempts [7, 13, 24\u201326]. Different from convex optimization, the stationary condition\nis used to measure the convergence of nonconvex methods. In [13], the ZO gradient descent (ZO-\nGD) algorithm was proposed for deterministic nonconvex programming, which yields O(d/T )\n\u221a\nconvergence rate. A stochastic version of ZO-GD (namely, ZO-SGD) studied in [24] achieves the\nrate of O(\nT ). In [25], a ZO distributed algorithm was developed for multi-agent optimization,\nleading to O(1/T + d/q) convergence rate. Here q is the number of random directions used to\n\u221a\nconstruct a gradient estimate. In [7], an asynchronous ZO stochastic coordinate descent (ZO-SCD)\nwas derived for parallel optimization and achieved the rate of O(\nT ). In [26], a variant of\n\u221a\nZO-SCD, known as ZO stochastic variance reduced coordinate (ZO-SVRC) descent, improved the\nconvergence rate from O(\nT ) to O(d/T ) under the same parameter setting for the gradient\nestimation. Although the authors in [26] considered the stochastic variance reduced technique, only a\n\n\u221a\nd/\n\n\u221a\n\nd/\n\n\u221a\n\nd/\n\n1In the big O notation, the constant numbers are ignored, and the dominant factors are kept.\n\n2\n\n\fcoordinate descent algorithm using a coordinate-wise (deterministic) gradient estimator was studied.\nThis motivates our study on a more general framework ZO-SVRG under different gradient estimators.\n\n3 Preliminaries\nConsider a nonconvex \ufb01nite-sum problem of the form\n\nminimize\n\nx\u2208Rd\n\nf (x) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(x),\n\n(1)\n\nwhere {fi(x)}n\ni=1 are n individual nonconvex cost functions. The generic form (1) encompasses\nmany machine learning problems, ranging from generalized linear models to neural networks. We\nnext elaborate on assumptions of problem (1), and provide a background on ZO gradient estimators.\n\n3.1 Assumptions\nA1: Functions {fi} have L-Lipschitz continuous gradients (L-smooth), i.e., (cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107)2 \u2264\nL(cid:107)x \u2212 y(cid:107)2 for any x and y, i \u2208 [n], and some L < \u221e. Here (cid:107) \u00b7 (cid:107)2 denotes the Euclidean norm, and\nfor ease of notation [n] represents the integer set {1, 2, . . . , n}.\n2 \u2264 \u03c32. Here\nA2: The variance of stochastic gradients is bounded as 1\n\u2207fi(x) can be viewed as a stochastic gradient of \u2207f (x) by randomly picking an index i \u2208 [n].\nn\nBoth A1 and A2 are the standard assumptions used in nonconvex optimization literature [7, 13, 23\u2013\n26]. Note that A2 is milder than the assumption of bounded gradients [5, 25]. For example, if\n(cid:107)\u2207fi(x)(cid:107)2 \u2264 \u02dc\u03c3, then A2 is satis\ufb01ed with \u03c3 = 2\u02dc\u03c3.\n\n(cid:80)n\ni=1 (cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2\n\n3.2 ZO gradient estimation\nGiven an individual cost function fi (or an arbitrary function under A1 and A2), a two-point random\ngradient estimator \u02c6\u2207fi(x) is de\ufb01ned by [13, 16]\n\n\u02c6\u2207fi(x) = (d/\u00b5) [fi(x + \u00b5ui) \u2212 fi(x)] ui, for i \u2208 [n],\n\n(RandGradEst)\n\nwhere recall that d is the number of optimization variables, \u00b5 > 0 is a smoothing parameter2, and\n{ui} are i.i.d. random directions drawn from a uniform distribution over a unit sphere [10, 15, 16].\nIn general, RandGradEst is a biased approximation to the true gradient \u2207fi(x), and its bias reduces\nas \u00b5 approaches zero. However, in a practical system, if \u00b5 is too small, then the function difference\ncould be dominated by the system noise and fails to represent the function differential [7]. For\n\u00b5 > 0, although the ZO gradient estimate introduces bias to the true gradient, it remains unbiased\nto the gradient of a so-called randomized smoothing function with parameter \u00b5; see Lemma 1 of\nAppendix A.1.\nRemark 1 Instead of using a single sample ui in RandGradEst, the average of q i.i.d. samples\n{ui,j}q\n\nj=1 can also be used for gradient estimation [5, 14, 25],\n\n\u02c6\u2207fi(x) = (d/(\u00b5q))(cid:80)q\n\nj=1 [fi(x + \u00b5ui,j) \u2212 fi(x)] ui,j, for i \u2208 [n],\n\n(Avg-RandGradEst)\n\nwhich we call an average random gradient estimator.\n\n\u02c6\u2207fi(x) =(cid:80)d\n\nIn addition to RandGradEst and Avg-RandGradEst, the work [7, 26, 27] considered a coordinate-wise\ngradient estimator. Here every partial derivative is estimated via the two-point querying scheme under\n\ufb01xed direction vectors,\n\n(cid:96)=1 (1/(2\u00b5(cid:96))) [fi(x + \u00b5(cid:96)e(cid:96)) \u2212 fi(x \u2212 \u00b5(cid:96)e(cid:96))] el, for i \u2208 [n],\n\n(CoordGradEst)\nwhere \u00b5(cid:96) > 0 is a coordinate-wise smoothing parameter, and e(cid:96) \u2208 Rd is a standard basis vector with\n1 at its (cid:96)th coordinate, and 0s elsewhere. Compared to RandGradEst, CoordGradEst is deterministic\nand requires d times more function queries. However, as will be evident later, it yields an improved\niteration complexity (i.e., convergence rate). More details on ZO gradient estimation can be found in\nAppendix A.1.\n\n2The parameter \u00b5 can be generalized to \u00b5i for i \u2208 [n]. Here we assume \u00b5i = \u00b5 for ease of representation.\n\n3\n\n\fAlgorithm 1: SVRG(T, m,{\u03b7k}, b, \u02dcx0)\n1: Input: total number of iterations T , epoch\nlength m, number of epochs S = (cid:100)T /m(cid:101),\nstep sizes {\u03b7k}m\u22121\nk=0 , mini-batch b, and \u02dcx0.\n\n2: for s = 1, 2, . . . , S do\n3:\n4:\n5:\n6:\n\nset gs = \u2207f (\u02dcxs\u22121), xs\nfor k = 0, 1, . . . , m \u2212 1 do\n\n0 = \u02dcxs\u22121,\nchoose mini-batch Ik of size b,\ncompute gradient blending via (2):\nk)\u2212\u2207fIk (xs\nk = \u2207fIk (xs\n0) + gs,\nvs\nk \u2212 \u03b7kvs\nupdate xs\nk,\n\nend for\nset \u02dcxs = xs\nm,\n\n7:\n8:\n9:\n10: end for\n11: return \u00afx chosen uniformly random from\n\nk+1 = xs\n\n{{xs\n\nk}m\u22121\n\nk=0 }S\n\ns=1.\n\nAlgorithm 2: ZO-SVRG(T, m,{\u03b7k}, b, \u02dcx0, \u00b5)\n1: Input: In addition to parameters in SVRG, set\n\nsmoothing parameter \u00b5 > 0.\n\n2: for s = 1, 2, . . . , S do\n3:\n4:\n5:\n6:\n7:\n\ncompute ZO estimate \u02c6gs = \u02c6\u2207f (\u02dcxs\u22121),\nset xs\nfor k = 0, 1, . . . , m \u2212 1 do\n\n0 = \u02dcxs\u22121,\nchoose mini-batch Ik of size b,\ncompute ZO gradient blending (3):\nk) \u2212 \u02c6\u2207fIk (xs\nk = \u02c6\u2207fIk (xs\n\u02c6vs\nk \u2212 \u03b7k \u02c6vs\nupdate xs\nk,\n\n8:\n9:\n10:\n11: end for\n12: return \u00afx chosen uniformly random from\n\nend for\nset \u02dcxs = xs\nm,\n\nk+1 = xs\n\n0) + \u02c6gs,\n\n{{xs\n\nk}m\u22121\n\nk=0 }S\n\ns=1.\n\n4 ZO stochastic variance reduced gradient (ZO-SVRG)\n\ni\u2208I \u2207fi(x)\n\n4.1 SVRG: from \ufb01rst-order to zeroth-order\n\u221a\nIt has been shown in [19, 20] that the \ufb01rst-order SVRG achieves the convergence rate O(1/T ),\nyielding O(\nT ) less iterations than the ordinary SGD for solving \ufb01nite sum problems. The key step\nof SVRG3 (Algorithm 1) is to generate an auxiliary sequence \u02c6x at which the full gradient is used as a\nreference in building a modi\ufb01ed stochastic gradient estimate\n\n\u02c6g = \u2207fI(x) \u2212 (\u2207fI(\u02c6x) \u2212 \u2207f (\u02c6x)), \u2207fI(x) = (1/b)(cid:80)\n\n(2)\nwhere \u02c6g denotes the gradient estimate at x, I \u2286 [n] is a mini-batch of size b (chosen uniformly\nrandomly4), and \u2207f (x) = \u2207f[n](x). The key property of (2) is that \u02c6g is an unbiased gradient\nestimate of \u2207f (x). The gradient blending (2) is also motivated by a variance reduced technique\nknown as control variate [28\u201330]. The link between SVRG and control variate is discussed in\nAppendix A.2.\nIn the ZO setting, the gradient blending (2) is approximated using only function values,\n\n\u02c6g = \u02c6\u2207fI(x) \u2212 ( \u02c6\u2207fI(\u02c6x) \u2212 \u02c6\u2207f (\u02c6x)), \u02c6\u2207fI(x) = (1/b)(cid:80)\n\n(3)\nwhere \u02c6\u2207f (x) = \u02c6\u2207f[n](x), and \u02c6\u2207fi is a ZO gradient estimate speci\ufb01ed by RandGradEst, Avg-\nRandGradEst or CoordGradEst. Replacing (2) with (3) in SVRG (Algorithm 1) leads to a new ZO\nalgorithm, which we call ZO-SVRG (Algorithm 2). We highlight that although ZO-SVRG is similar\nto SVRG except the use of ZO gradient estimators to estimate batch, mini-batch, as well as blended\ngradients, this seemingly minor difference yields an essential dif\ufb01culty in the analysis of ZO-SVRG.\nThat is, the unbiased assumption on gradient estimates used in SVRG no longer holds. Thus, a careful\nanalysis of ZO-SVRG is much needed to ensure its optimization performance.\n\ni\u2208I \u02c6\u2207fi(x),\n\n4.2 Convergence analysis\nIn what follows, we focus on the analysis of ZO-SVRG using RandGradEst. Later, we will study\nZO-SVRG with Avg-RandGradEst and CoordGradEst. We start by investigating the second-order\nmoment of the blended ZO gradient estimate \u02c6vs\nProposition 1 Suppose A2 holds and RandGradEst is used in Algorithm 2. The blended ZO gradient\nestimate \u02c6vs\nE[(cid:107)\u02c6vs\n\nk in the form of (3); see Proposition 1.\n\nk in Step 7 of Algorithm 2 satis\ufb01es\n\nE(cid:2)(cid:107)\u2207f (xs\n\nE(cid:2)(cid:107)xs\n\n(6\u03b4n + b)L2d2\u00b52\n\n2]\u2264 4(b + 18\u03b4n)d\n\n6(4d + 1)L2\u03b4n\n\nk \u2212 xs\n\n(cid:3)+\n\n(cid:3)+\n\n72d\u03c32\u03b4n\n\nk)(cid:107)2\n\nk(cid:107)2\n\n0(cid:107)2\n\n2\n\n+\n\n2\n\nb\n\nb\n\nb\n\n,\n\nb\n\n(4)\n\n3Different from the standard SVRG [19], we consider its mini-batch variant in [20].\n4For mini-batch I, SVRG [20] assumes i.i.d. samples with replacement, while a variant of SVRG (called\n\nSCSG) assumes samples without replacement [23]. This paper considers both sampling strategies.\n\n4\n\n\fwhere \u03b4n = 1 if the mini-batch contains i.i.d. samples from [n] with replacement, and \u03b4n = I(b < n)\nif samples are randomly selected without replacement. Here I(b < n) is 1 if b < n, and 0 if b = n.\n(cid:3)\nProof: See Appendix A.3.\nCompared to SVRG and its variants [20, 23], the error bound (4) involves a new error term O(d\u03c32/b)\nfor b < n, which is induced by the second-order moment of RandGradEst (Appendix A.1). With the\naid of Proposition 1, Theorem 1 provides the convergence rate of ZO-SVRG in terms of an upper\nbound on E[(cid:107)\u2207f (\u00afx)(cid:107)2] at the solution \u00afx.\nTheorem 1 Suppose A1 and A2 hold, and the random gradient estimator (RandGradEst) is used.\nThe output \u00afx of Algorithm 2 satis\ufb01es\n\nwhere T = Sm, f\u2217 = minx f (x), \u00af\u03b3 = mink\u2208[m] \u03b3k, \u03c7m =(cid:80)m\u22121\n(cid:1) 4db+72d\u03b4n\n\nL\u00b52\nT \u00af\u03b3\n\n(cid:16)\n\n+\n\n+\n\n2\n\nT \u00af\u03b3\n\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n(cid:3) \u2264 f (\u02dcx0) \u2212 f\u2217\n(cid:17)\n\u03b7k \u2212(cid:0) L\n(cid:17) \u00b52d2L2\n\n\u03b7k +(cid:0) L\n\n2 + ck+1\n\n4\n\n2 + ck+1\n\n(cid:16)\n\n1 \u2212 ck+1\n1 \u2212 ck+1\n\n\u03b2k\n\n\u03b2k\n\n\u03b3k = 1\n2\n\n\u03c7k =\n\nS\u03c7m\nT \u00af\u03b3\n\n,\n\nk=0 \u03c7k, and\n\n(cid:1) (6\u03b4n+b)L2d2\u00b52+72d\u03c32\u03b4n\n\n\u03b72\nk\n\nb\n\nb\n\n\u03b72\nk.\n\nIn (6)-(7), \u03b2k is a positive parameter ensuring \u03b3k > 0, and the coef\ufb01cients {ck} are given by\n\n(cid:20)\n\n(cid:21)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nck =\n\n1 + \u03b2k\u03b7k +\n\n6(4d + 1)L2\u03b4n\u03b72\nk\n\nb\n\nck+1 +\n\n3(4d + 1)L3\u03b4n\u03b72\nk\n\nb\n\n,\n\ncm = 0.\n\n(cid:3)\nProof: See Appendix A.4.\nCompared to the convergence rate of SVRG as given in [20, Theorem 2], Theorem 1 exhibits two\nadditional errors (L\u00b52/(T \u00af\u03b3)) and (S\u03c7m/(T \u00af\u03b3)) due to the use of ZO gradient estimates. Roughly\nspeaking, if we choose the smoothing parameter \u00b5 reasonably small, then the error (L\u00b52/(T \u00af\u03b3))\nwould reduce, leading to non-dominant effect on the convergence rate of ZO-SVRG. For the term\n(S\u03c7m/(T \u00af\u03b3)), the quantity \u03c7m is more involved, relying on the epoch length m, the step size \u03b7k, the\nsmoothing parameter \u00b5, the mini-batch size b, and the number of optimization variables d. In order\nto acquire explicit dependence on these parameters and to explore deeper insights of convergence, we\nsimplify (5) for a speci\ufb01c parameter setting, as formalized below.\nCorollary 1 Suppose we set\n\n\u03b2k = \u03b2 = L, and m = (cid:100) d\nL, and T . Then Theorem 1 implies f (\u02dcx0)\u2212f\u2217\nwhich yields\n\nT \u00af\u03b3\n\n,\n\n\u00b5 =\n\n\u03b7k = \u03b7 =\n\n(9)\n31\u03c1(cid:101), where 0 < \u03c1 \u2264 1 is a universal constant that is independent of b, d,\nT + \u03b4n\n\nT \u00af\u03b3 \u2264 O(cid:0) d\n\n(cid:1), and S\u03c7m\n\n(cid:1),\n\nT 2\n\nT\n\nb\n\n,\n\n1\u221a\ndT\n\n\u03c1\nLd\n\n\u2264 O(cid:0) d\n(cid:1), L\u00b52\nT \u00af\u03b3 \u2264 O(cid:0) 1\n(cid:19)\n(cid:18) d\n(cid:3) \u2264 O\n\n+\n\n.\n\n\u03b4n\nb\n\nT\n\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n\n2\n\n(10)\n\n(cid:3)\nProof: See Appendix A.5.\nIt is worth mentioning that the condition on the value of smoothing parameter \u00b5 in Corollary 1 is less\nrestrictive than several ZO algorithms5. For example, ZO-SGD in [24] required \u00b5 \u2264 O(d\u22121T \u22121/2),\nand ZO-ADMM [5] and ZO-mirror descent [14] considered \u00b5t = O(d\u22121.5t\u22121). Moreover, similar to\n[5], we set the step size \u03b7 linearly scaled with 1/d. Compared to the aforementioned ZO algorithms\n[5, 14, 24], the convergence performance of ZO-SVRG in (10) has an improved (linear rather than\nsub-linear) dependence on 1/T . However, it suffers an additional error of order O(\u03b4n/b) inherited\nfrom (S\u03c7m/(T \u00af\u03b3)) in (5), which is also a consequence of the last error term in (4). We recall from the\nde\ufb01nition of \u03b4n in Proposition 1 that if b < n or samples in the mini-batch are chosen independently\nfrom [n], then \u03b4n = 1. However, the error term is eliminated when Ik = [n] (corresponding to\n\u03b4n = 0). In this case, ZO-SVRG (Algorithm 2) reduces to ZO-GD in [13] since Step 7 of Algorithm 2\nk). A recent work [25, Theorem 1] also identi\ufb01ed the possible side effect\nbecomes \u02c6vs\n\nk = \u02c6\u2207f (xs\n\n5One exception is ZO-SCD [7] (and its variant ZO-SVRC [26]), where \u00b5 \u2264 O(1/\n\n5\n\n\u221a\nT ).\n\n\fO(1/b) for b < n in the context of ZO nonconvex multi-agent optimization using a method of\nmultipliers. Note that a large mini-batch reduces the variance of RandGradEst and improves the\nconvergence of ZO optimization methods. Although the tightness of the error bound (10) is not\nproven, we conjecture that the dependence on T and b could be optimal, since the form is consistent\nwith SVRG, and the latter does not rely on the selected parameters in (9).\nLastly, we highlight that the theoretical analysis of ZO-SVRG is different from ZO-SVRC [26].\nFor the latter, the coordinate-wise (deterministic) gradient estimate is used and hence maintains\nLipschitz continuity, which does not hold for a random gradient estimate. As a result, it becomes\nnontrivial to bound the distance of two random gradient estimates; see Appendix A.3. Moreover,\nreference [26] does not fully uncover the effect of dimension dependency on the convergence of\nZO-SVRC. However, we clearly analyze this effect for ZO-SVRG in Corollary 1. Furthermore,\nour convergence analysis is performed under milder assumptions, while ZO-SVRC requires extra\nassumptions on gradients of coordinate-wise smoothing functions. In Sec. 6, we will compare the\nempirical performance of ZO-SVRC with our method through two real-life applications.\n\n5 Acceleration of ZO-SVRG: Towards improved iteration complexity\n\nIn this section, we improve the iteration complexity of ZO-SVRG (Algorithm 2) by using Avg-\nRandGradEst and CoordGradEst, respectively. We start by comparing the squared errors of different\ngradient estimates to the true gradient \u2207f, as formalized in Proposition 2.\nProposition 2 Consider a gradient estimator \u02c6\u2207f (x) = \u2207f (x) + \u03c9, then the squared error E[(cid:107)\u03c9(cid:107)2\n2]\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nE(cid:2)(cid:107)\u03c9(cid:107)2\nE(cid:2)(cid:107)\u03c9(cid:107)2\n\n2\n\n2\n\n(cid:107)\u03c9(cid:107)2\n\n2 \u2264 O\n\n(cid:3) \u2264 O (d)(cid:107)\u2207f (x)(cid:107)2\n(cid:16) q+d\n(cid:17)(cid:107)\u2207f (x)(cid:107)2\n(cid:3) \u2264 O\n(cid:16)\n(cid:17)\nL2d(cid:80)d\n\n(cid:96)=1 \u00b52\n\nq\n\n(cid:96)\n\n2 + O(cid:0)\u00b52L2d2(cid:1)\n2 + O(cid:0)\u00b52L2d2(cid:1)\n\nfor RandGradEst,\nfor Avg-RandGradEst,\n\nfor CoordGradEst.\n\n(11)\n\n(cid:3)\nProof: See Appendix A.6.\nProposition 2 shows that compared to CoordGradEst, RandGradEst and Avg-RandGradEst involve\nan additional error term within a factor O(d) and O((q + d)/q) of (cid:107)\u2207f (x)(cid:107)2\n2, respectively. Such\nan error is introduced by the second-order moment of gradient estimators using random direction\nsamples [13, 14], and it decreases as the number of direction samples q increases. On the other hand,\nall gradient estimators have a common error bounded by O(\u00b52L2d2), where let \u00b5(cid:96) = \u00b5 for (cid:96) \u2208 [d] in\nCoordGradEst. If \u00b5 is speci\ufb01ed as in (9), then we obtain the error term O(d/T ), consistent with the\nconvergence rate of ZO-SVRG in Corollary 1.\nIn Theorem 2, we show the effect of Avg-RandGradEst on the convergence rate of ZO-SVRG.\nTheorem 2 Suppose A1 and A2 hold, and Avg-RandGradEst is used in Algorithm 2. Then\nck for k \u2208 [m] are modi\ufb01ed by\n\n(cid:3) is bounded in the same way as given in (5), where the parameters \u03b3k, \u03c7k and\n(cid:16)\n(cid:16)\n(cid:104)\n\n(cid:1) (6\u03b4n+b)(q+1)L2d2\u00b52+72(q+d)\u03c32\u03b4n\n\n1 \u2212 ck+1\n1 \u2212 ck+1\n1 + \u03b2k\u03b7k + 6(4d+5q)L2\u03b4n\n\n(cid:17)\n\u03b7k \u2212(cid:0) L\n(cid:17) \u00b52d2L2\n\n(cid:1) (72\u03b4n+4b)(q+d)\n\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n\n\u03b7k +(cid:0) L\n\n2 + ck+1\n\n\u03c7k =\n\n(cid:105)\n\u03b72\nk\nGiven the setting in Corollary 1 and m = (cid:100) d\n55\u03c1(cid:101), the convergence rate simpli\ufb01es to\n(cid:3) \u2264 O\n\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n\nck+1 + 3(4d+5q)L3\u03b4n\n\n(cid:18) d\n\n2 + ck+1\n\n(cid:19)\n\nck =\n\nk, with cm = 0.\n\u03b72\n\n\u03b3k = 1\n2\n\n\u03b72\nk,\n\n\u03b72\nk,\n\n(12)\n\n\u03b4n\n\n\u03b2k\n\n\u03b2k\n\nbq\n\nbq\n\nbq\n\nbq\n\n4\n\n2\n\n2\n\n+\n\nb min{d, q}\n\nT\n\n.\n\n(cid:3)\nProof: See Appendix A.7\nBy contrast with Corollary 1, it can be seen from (12) that the use of Avg-RandGradEst reduces the\ndb \u2264 q \u2264 d, then the convergence\nerror O(\u03b4n/b) in (10) through multiple (q) direction samples. If T\nerror under Ave-RandGradEst will be dominated by O(d/T ). Our empirical results show that a\nmoderate choice of q can signi\ufb01cantly speed up the convergence of ZO-SVRG.\n\n6\n\n\fWe next study the effect of the coordinate-wise gradient estimator (CoordGradEst) on the convergence\nrate of ZO-SVRG, as formalized in Theorem 3.\nTheorem 3 Suppose A1 and A2 hold, and CoordGradEst with \u00b5(cid:96) = \u00b5 is used in Algorithm 2. Then\n\n(13)\nwhere T , f\u2217, \u00af\u03b3 and \u03c7m were de\ufb01ned in (5), the parameters \u03b3k, \u03c7k and ck for k \u2208 [m] are given by\n\n+\n\n,\n\nS\u03c7m\nT \u00af\u03b3\n\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n\n2\n\nT \u00af\u03b3\n\n(cid:3) \u2264 f (\u02dcx0) \u2212 f\u2217\n(cid:17)\n\u03b7k \u2212 4(cid:0) L\n(cid:1) \u03b72\n(cid:17) L2\u00b52d2\n\u03b7k +(cid:0) L\n(cid:17)\n\n2 + ck+1\n\n2 + ck+1\n\n2\n\n\u03b3k = 1\n2\n\n1 \u2212 ck+1\n\n\u03b2k\n\n\u03c7k =\n\nck =\n\n\u03b2k\n\n4 + ck+1\n1 + \u03b2k\u03b7k + 2dL2\u03b4n\u03b72\n\nk\n\nb\n\n(cid:16)\n(cid:16) 1\n(cid:16)\n\nk,\n\n(cid:1) \u00b52L2d2\u03b72\n\nk,\n\nck+1 + dL3\u03b4n\u03b72\n\nk\n\nb\n\n, with cm = 0,\n\nand \u03b2k is a positive parameter ensuring \u03b3k > 0. Given the speci\ufb01c setting in Corollary 1 and\nm = (cid:100) d\n\n3\u03c1(cid:101), the convergence rate simpli\ufb01es to\nE(cid:2)(cid:107)\u2207f (\u00afx)(cid:107)2\n\n(cid:3) \u2264 O\n\n2\n\n(cid:18) d\n\n(cid:19)\n\nT\n\n.\n\n(14)\n\n(cid:3)\nProof: See Appendix A.8.\nTheorem 3 shows that the use of CoordGradEst improves the iteration complexity, where the error of\norder O(1/b) in Corollary 1 or O(1/(b min{d, q})) in Theorem 2 has been eliminated in (14). This\nimprovement is bene\ufb01ted from the low variance of CoordGradEst shown by Proposition 2. We can\nalso see this bene\ufb01t by comparing \u03c7k in Theorem 3 with (7): the former avoids the term (d\u03c32/b).\nThe disadvantage of CoordGradEst is the need of d times more function queries than RandGradEst in\ngradient estimation.\nRecall that RandGradEst, Avg-RandGradEst and CoordGradEst require O(1), O(q) and O(d) func-\ntion queries, respectively. In ZO-SVRG (Algorithm 2), the total number of gradient evaluations\nis given by nS + bT , where T = mS. Therefore, by \ufb01xing the number of iterations T , the func-\ntion query complexity of ZO-SVRG using the studied estimators is then given by O(nS + bT ),\nO(q(nS + bT )) and O(d(nS + bT )), respectively. In Table 1, we summarize the convergence rates\nand the function query complexities of ZO-SVRG and its two variants, which we call ZO-SVRG-Ave\nand ZO-SVRG-Coord, respectively. For comparison, we also present the results of ZO-SGD [24] and\nZO-SVRC [26], where the later updates J coordinates per iteration within an epoch. Table 1 shows\nthat ZO-SGD has the lowest query complexity but has the worst convergence rate. ZO-SVRG-coord\nyields the best convergence rate in the cost of high query complexity. By contrast, ZO-SVRG (with\nan appropriate mini-batch size) and ZO-SVRG-Ave could achieve better trade-offs between the\nconvergence rate and the query complexity.\n\nTable 1: Summary of convergence rate and function query complexity of our proposals given T iterations.\nConvergence rate\n(worst case as b < n) Query complexity\n\nStepsize\n\nMethod\n\nZO-SVRG\n\nZO-SVRG-Ave\nZO-SVRG-Coord\n\nZO-SGD [24]\nZO-SVRC [26]\n\nGrad. estimator\n(RandGradEst)\n\n(Avg-RandGradEst)\n\n(CoordGradEst)\n(RandGradEst)\n(CoordGradEst)\n\n(cid:1)\n\nO(cid:0) 1\n}(cid:17)\n(cid:1), \u03b1 \u2208 (0, 1)\n\nd\nO( 1\nd )\nO( 1\nd )\nmin{ 1\nd ,\n\n1\u221a\ndT\n\nn\u03b1\n\n(cid:16)\nO(cid:0) 1\n\nO\n\nO\n\n(cid:17)\n\n(cid:1)\n\nO(cid:0) d\n(cid:16) d\n(cid:16) \u221a\nO(cid:0) d\n\nT +\n\nT + 1\nb min{d,q}\nO( d\nT )\nd\u221a\nT\n\n(cid:17)\n(cid:1)\n\nO\n\nb\n1\n\nT\n\nO (nS + bT )\n\nO (qnS + qbT )\n\nO(dnS + dbT )\n\nO(bT )\n\nO (dnS + JbT )\n\n6 Applications and experiments\n\nWe evaluate the performance of our proposed algorithms on two applications: black-box classi\ufb01cation\nand generating adversarial examples from black-box DNNs. The \ufb01rst application is motivated by a\nreal-world material science problem, where a material is classi\ufb01ed to either be a conductor or an insu-\nlator from a density function theory (DFT) based black-box simulator [31]. The second application\narises in testing the robustness of a deployed DNN via iterative model queries [1, 3]. Since ZO-SVRG\nbelongs to the class of ZO counterparts of \ufb01rst-order algorithms using random/deterministic gradient\nestimation, we compare it with ZO-SGD and ZO-SVRC, the most relevant methods to ZO-SVRG.\n\n7\n\n\fBlack-box binary classi\ufb01cation We consider a non-linear least square problem [32, Sec. 3.2],\ni.e., problem (1) with fi(x) = (yi \u2212 \u03c6(x; ai))2 for i \u2208 [n]. Here (ai, yi) is the ith data sample\ncontaining feature vector ai \u2208 Rd and label yi \u2208 {0, 1}, and \u03c6(x; ai) is a black-box function that\nonly returns the function value given an input. The used dataset consists of N = 1000 crystalline\nmaterials/compounds extracted from Open Quantum Materials Database [33]. Each compound has\nd = 145 chemical features, and its label (0 is conductor and 1 is insulator) is determined by a DFT\nsimulator [34]. Due to the black-box nature of DFT, the true \u03c6 is unknown6. We split the dataset into\ntwo equal parts, leading to n = 500 training samples and (N \u2212 n) testing samples. We refer readers\nto Appendix A.10 for more details on our dataset and the setting of experiments.\n\n(a) Training loss versus iterations\n\n(b) Training loss versus function queries\nFigure 2: Comparison of different ZO algorithms for the task of chemical material classi\ufb01cation.\n\nTable 2: Testing error for chemical material classi\ufb01cation using 7.3 \u00d7 106 function queries.\n\nZO-SGD [24] ZO-SVRC [26] ZO-SVRG ZO-SVRG-Coord ZO-SVRG-Ave\n\nMethod\n# of epochs\nError (%)\n\n14600\n12.56%\n\n100\n\n23.70%\n\n2920\n\n11.18%\n\n50\n\n20.67%\n\n365\n\n15.26%\n\nIn Fig. 2, we present the training loss against the number of epochs (i.e., iterations divided by the\nepoch length m = 50) and function queries. We compare our proposed algorithms ZO-SVRG,\nZO-SVRG-Coord and ZO-SVRG-Ave with ZO-SGD [24] and ZO-SVRC [26]. Fig. 2-(a) presents the\nconvergence trajectories of ZO algorithms as functions of the number of epochs, where ZO-SVRG\nis evaluated under different mini-batch sizes b \u2208 {1, 10, 40}. We observe that the convergence\nerror of ZO-SVRG decreases as b increases, and for a small mini-batch size b \u2264 10, ZO-SVRG\nlikely converges to a neighborhood of a critical point as shown by Corollary 1. We also note that\nour proposed algorithms ZO-SVRG (b = 40), ZO-SVRG-Coord and ZO-SVRG-Ave have faster\nconvergence speeds (i.e., less iteration complexity) than the existing algorithms ZO-SGD and ZO-\nSVRC. Particularly, the use of multiple random direction samples in Avg-RandGradEst signi\ufb01cantly\naccelerates ZO-SVRG since the error of order O(1/b) is reduced to O(1/(bq)) (see Table 1), leading\nto a non-dominant factor versus O(d/T ) in the convergence rate of ZO-SVRG-Ave. Fig. 2-(b)\npresents the training loss against the number of function queries. For the same experiment, Table 2\nshows the number of iterations and the testing error of algorithms studied in Fig. 2-(b) using 7.3\u00d7 106\nfunction queries. We observe that the performance of CoordGradEst based algorithms (i.e., ZO-SVRC\nand ZO-SVRG-Coord) degrade due to the need of large number of function queries to construct\ncoordinate-wise gradient estimates. By contrast, algorithms based on random gradient estimators\n(i.e., ZO-SGD, ZO-SVRG and ZO-SVRG-Ave) yield better both training and testing results, while\nZO-SGD consumes an extremely large number of iterations (14600 epochs). As a result, ZO-SVRG\n(b = 40) and ZO-SVRG-Ave achieve better tradeoffs between the iteration and the function query\ncomplexity.\nGeneration of adversarial examples from black-box DNNs7\nIn image classi\ufb01cation, adversarial\nexamples refer to carefully crafted perturbations such that, when added to the natural images, are\nvisually imperceptible but will lead the target model to misclassify. In the setting of \u2018zeroth order\u2019\nattacks [2, 3, 35], the model parameters are hidden and acquiring its gradient is inadmissible. Only\n\n6 One can mimic DFT simulator using a logistic function once the parameter x is learned from ZO algorithms.\n7Code to reproduce experiments can be found at https://github.com/IBM/ZOSVRG-BlackBox-Adv\n\n8\n\n\fMethod\nZO-SGD\nZO-SVRG-Ave (q = 10)\nZO-SVRG-Ave (q = 20)\nZO-SVRG-Ave (q = 30)\n\n(cid:96)2 distortion\n5.22\n4.91 (6%)\n3.91 (25%)\n3.67 (30%)\n\nFigure 3: Comparison of ZO-SGD and ZO-SVRG-Ave for generation of universal adversarial perturbations\nfrom a black-box DNN. Left: Attack loss versus epochs. Right: (cid:96)2 distortion and improvement (%) with respect\nto ZO-SGD.\n\nthe model evaluations are accessible. We can then regard the task of generating a universal adversarial\nperturbation (to n natural images) as an ZO optimization problem of the form (1). We elaborate on\nthe problem formulation for generating adversarial examples in Appendix A.11.\nWe use a well-trained DNN8 on the MNIST handwritten digit classi\ufb01cation task as the target black-\nbox model, which achieves 99.4% test accuracy on natural examples. Two ZO optimization methods,\nZO-SGD and ZO-SVRG-Ave, are performed in our experiment. Note that ZO-SVRG-Ave reduces\nto ZO-SVRG when q = 1. We choose n = 10 images from the same class, and set the same\nparameters b = 5 and constant step size 30/d for both ZO methods, where d = 28 \u00d7 28 is the image\ndimension. For ZO-SVRG-Ave, we set m = 10 and vary the number of random direction samples\nq \u2208 {10, 20, 30}.\nIn Fig. 3, we show the black-box attack loss (against the number of epochs) as well as the least\n(cid:96)2 distortion of the successful (universal) adversarial perturbations. We observe that compared to\nZO-SGD, ZO-SVRG-Ave offers a faster iteration convergence to a more accurate solution, and\nits convergence trajectory is more stable as q becomes larger (due to the reduced variance of Avg-\nRandGradEst). Note that the sharp drop of attack loss in each method is caused by the hinge-like loss\nas part of the total loss function, which turns to 0 only if the attack becomes successful. In addition,\nZO-SVRG-Ave improves the (cid:96)2 distortion of adversarial examples compared to ZO-SGD (e.g., 30%\nimprovement when q = 30). We present the corresponding adversarial examples in Appendix A.11.\nIn contrast with the iteration complexity, ZO-SVRG-Ave requires roughly 30\u00d7 (q = 10), 77\u00d7\n(q = 20) and 380\u00d7 (q = 30) more function evaluations than ZO-SGD to reach a neighborhood of the\nsmallest attack loss (e.g., 7 in our example). Furthermore, we present the black-box attack loss versus\nthe number of query counts in Fig. A1 (Appendix A.11). As we can see, ZO-SVRG-Ave requires\nmore queries than ZO-SGD to achieve the \ufb01rst signi\ufb01cant drop in attack loss. However, by \ufb01xing the\ntotal number of queries (107), ZO-SVRG-Ave eventually converges to a lower loss than ZO-SGD:\nthe former reaches the average loss 4.81 with std 0.32 (computed from the last 100 attack losses),\nwhile the latter reaches 6.74 \u00b1 0.46.\n\n7 Conclusion\nIn this paper, we studied ZO-SVRG, a new ZO nonconvex optimization method. We presented\nnew convergence results beyond the existing work on ZO nonconvex optimization. We show that\nZO-SVRG improves the convergence rate of ZO-SGD from O(1/\nT ) to O(1/T ) but suffers a new\ncorrection term of order O(1/b). The is the side effect of combining a two-point random gradient\nestimators with SVRG. We then propose two accelerated variants of ZO-SVRG based on improved\ngradient estimators of reduced variances. We show an illuminating trade-off between the iteration and\nthe function query complexity. Experimental results and theoretical analysis validate the effectiveness\nof our approaches compared to other state-of-the-art algorithms. In the future, we will compare\nZO-SVRG with other derivative-free (non-gradient estimation based) methods for solving black-box\noptimization problems. It will also be interesting to study the problem of ZO distributed optimization,\ne.g., using CoordGradEst under a block coordinate descent framework [36].\n\n\u221a\n\n8https://github.com/carlini/nn_robust_attacks\n\n9\n\n\fAcknowledgments\nThis work was fully supported by the MIT-IBM Watson AI Lab. Bhavya Kailkhura was supported\nunder the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory\nunder Contract DE-AC52-07NA27344 (LLNL-CONF-751658). The authors are also grateful to the\nanonymous reviewers for their helpful comments,\n\nReferences\n[1] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, \u201cPractical black-box attacks\nagainst machine learning,\u201d in Proceedings of the 2017 ACM on Asia Conference on Computer and\nCommunications Security, 2017, pp. 506\u2013519.\n\n[2] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, \u201cTowards deep learning models resistant to\n\nadversarial attacks,\u201d arXiv preprint arXiv:1706.06083, 2017.\n\n[3] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, \u201cZoo: Zeroth order optimization based black-box\nattacks to deep neural networks without training substitute models,\u201d in Proceedings of the 10th ACM\nWorkshop on Arti\ufb01cial Intelligence and Security. ACM, 2017, pp. 15\u201326.\n\n[4] T. Chen and G. B. Giannakis, \u201cBandit convex optimization for scalable and dynamic iot management,\u201d\n\narXiv preprint arXiv:1707.09060, 2017.\n\n[5] S. Liu, J. Chen, P.-Y. Chen, and A. O. Hero, \u201cZeroth-order online admm: Convergence analysis and\napplications,\u201d in Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics, April 2018, vol. 84, pp. 288\u2013297.\n\n[6] M. C. Fu, \u201cOptimization for simulation: Theory vs. practice,\u201d INFORMS Journal on Computing, vol. 14,\n\nno. 3, pp. 192\u2013215, 2002.\n\n[7] X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, \u201cA comprehensive linear speedup analysis for\nasynchronous stochastic parallel optimization from zeroth-order to \ufb01rst-order,\u201d in Advances in Neural\nInformation Processing Systems, 2016, pp. 3054\u20133062.\n\n[8] R. P. Brent, Algorithms for minimization without derivatives, Courier Corporation, 2013.\n\n[9] J. C. Spall, Introduction to stochastic search and optimization: estimation, simulation, and control, vol. 65,\n\nJohn Wiley & Sons, 2005.\n\n[10] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, \u201cOnline convex optimization in the bandit setting:\ngradient descent without a gradient,\u201d in Proceedings of the sixteenth annual ACM-SIAM symposium on\nDiscrete algorithms, 2005, pp. 385\u2013394.\n\n[11] O. Shamir, \u201cOn the complexity of bandit and derivative-free stochastic convex optimization,\u201d in Conference\n\non Learning Theory, 2013, pp. 3\u201324.\n\n[12] A. Agarwal, O. Dekel, and L. Xiao, \u201cOptimal algorithms for online convex optimization with multi-point\n\nbandit feedback,\u201d in COLT, 2010, pp. 28\u201340.\n\n[13] Y. Nesterov and V. Spokoiny, \u201cRandom gradient-free minimization of convex functions,\u201d Foundations of\n\nComputational Mathematics, vol. 2, no. 17, pp. 527\u2013566, 2015.\n\n[14] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, \u201cOptimal rates for zero-order convex\noptimization: The power of two function evaluations,\u201d IEEE Transactions on Information Theory, vol. 61,\nno. 5, pp. 2788\u20132806, 2015.\n\n[15] O. Shamir, \u201cAn optimal algorithm for bandit and zero-order convex optimization with two-point feedback,\u201d\n\nJournal of Machine Learning Research, vol. 18, no. 52, pp. 1\u201311, 2017.\n\n[16] X. Gao, B. Jiang, and S. Zhang, \u201cOn the information-adaptive variants of the admm: an iteration complexity\n\nperspective,\u201d Optimization Online, vol. 12, 2014.\n\n[17] P. Dvurechensky, A. Gasnikov, and E. Gorbunov, \u201cAn accelerated method for derivative-free smooth\n\nstochastic convex optimization,\u201d arXiv preprint arXiv:1802.09022, 2018.\n\n[18] Y. Wang, S. Du, S. Balakrishnan, and A. Singh, \u201cStochastic zeroth-order optimization in high dimensions,\u201d\nin Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics. April\n2018, vol. 84, pp. 1356\u20131365, PMLR.\n\n10\n\n\f[19] R. Johnson and T. Zhang, \u201cAccelerating stochastic gradient descent using predictive variance reduction,\u201d\n\nin Advances in neural information processing systems, 2013, pp. 315\u2013323.\n\n[20] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, \u201cStochastic variance reduction for nonconvex\n\noptimization,\u201d in International conference on machine learning, 2016, pp. 314\u2013323.\n\n[21] A. Nitanda, \u201cAccelerated stochastic gradient descent for minimizing \ufb01nite sums,\u201d in Arti\ufb01cial Intelligence\n\nand Statistics, 2016, pp. 195\u2013203.\n\n[22] Z. Allen-Zhu and Y. Yuan, \u201cImproved svrg for non-strongly-convex or sum-of-non-convex objectives,\u201d in\n\nInternational conference on machine learning, 2016, pp. 1080\u20131089.\n\n[23] L. Lei, C. Ju, J. Chen, and M. I. Jordan, \u201cNon-convex \ufb01nite-sum optimization via scsg methods,\u201d in\n\nAdvances in Neural Information Processing Systems, 2017, pp. 2345\u20132355.\n\n[24] S. Ghadimi and G. Lan, \u201cStochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming,\u201d\n\nSIAM Journal on Optimization, vol. 23, no. 4, pp. 2341\u20132368, 2013.\n\n[25] D. Hajinezhad, M. Hong, and A. Garcia, \u201cZeroth order nonconvex multi-agent optimization over networks,\u201d\n\narXiv preprint arXiv:1710.09997, 2017.\n\n[26] B. Gu, Z. Huo, and H. Huang, \u201cZeroth-order asynchronous doubly stochastic algorithm with variance\n\nreduction,\u201d arXiv preprint arXiv:1612.01425, 2016.\n\n[27] K. M. Choromanski and V. Sindhwani, \u201cOn blackbox backpropagation and jacobian sensing,\u201d in Advances\n\nin Neural Information Processing Systems, 2017, pp. 6524\u20136532.\n\n[28] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein, \u201cRebar: Low-variance, unbiased\ngradient estimates for discrete latent variable models,\u201d in Advances in Neural Information Processing\nSystems, 2017, pp. 2624\u20132633.\n\n[29] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, \u201cBackpropagation through the void:\nOptimizing control variates for black-box gradient estimation,\u201d arXiv preprint arXiv:1711.00123, 2017.\n\n[30] N. S. Chatterji, N. Flammarion, Y.-A. Ma, P. L. Bartlett, and M. I. Jordan, \u201cOn the theory of variance\n\nreduction for stochastic gradient monte carlo,\u201d arXiv preprint arXiv:1802.05431, 2018.\n\n[31] W. Yang and P. W. Ayers, \u201cDensity-functional theory,\u201d in Computational Medicinal Chemistry for Drug\n\nDiscovery, pp. 103\u2013132. CRC Press, 2003.\n\n[32] P. Xu, F. Roosta-Khorasan, and M. W. Mahoney, \u201cSecond-order optimization for non-convex machine\n\nlearning: An empirical study,\u201d arXiv preprint arXiv:1708.07827, 2017.\n\n[33] S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. R\u00fchl, and C. Wolverton,\n\u201cThe open quantum materials database (oqmd): assessing the accuracy of dft formation energies,\u201d npj\nComputational Materials, vol. 1, pp. 15010, 2015.\n\n[34] G. Kresse and J. Furthm\u00fcller, \u201cEf\ufb01ciency of ab-initio total energy calculations for metals and semi-\nconductors using a plane-wave basis set,\u201d Computational materials science, vol. 6, no. 1, pp. 15\u201350,\n1996.\n\n[35] N. Carlini and D. Wagner, \u201cTowards evaluating the robustness of neural networks,\u201d in IEEE Symposium\n\non Security and Privacy, 2017, pp. 39\u201357.\n\n[36] H. Wang and A. Banerjee, \u201cRandomized block coordinate descent for online and stochastic optimization,\u201d\n\narXiv preprint arXiv:1407.0107, 2014.\n\n[37] S. Shalev-Shwartz, \u201cOnline learning and online convex optimization,\u201d Foundations and Trends R(cid:13) in\n\nMachine Learning, vol. 4, no. 2, pp. 107\u2013194, 2012.\n\n[38] L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, \u201cA general-purpose machine learning framework\nfor predicting properties of inorganic materials,\u201d npj Computational Materials, vol. 2, pp. 16028, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1876, "authors": [{"given_name": "Sijia", "family_name": "Liu", "institution": "MIT-IBM Watson AI Lab, IBM Research AI"}, {"given_name": "Bhavya", "family_name": "Kailkhura", "institution": "Lawrence Livermore National Lab"}, {"given_name": "Pin-Yu", "family_name": "Chen", "institution": "IBM Research AI"}, {"given_name": "Paishun", "family_name": "Ting", "institution": "University of Michigan"}, {"given_name": "Shiyu", "family_name": "Chang", "institution": "IBM T.J. Watson Research Center"}, {"given_name": "Lisa", "family_name": "Amini", "institution": "IBM Research"}]}