{"title": "Zeroth-order (Non)-Convex Stochastic Optimization via Conditional Gradient and Gradient Updates", "book": "Advances in Neural Information Processing Systems", "page_first": 3455, "page_last": 3464, "abstract": "In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization. Specifically, we propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. Furthermore, under a structural sparsity assumption, we first illustrate an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step-size. Next, we propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate of convergence depends only poly-logarithmically on the dimensionality.", "full_text": "Zeroth-order (Non)-Convex Stochastic Optimization\n\nvia Conditional Gradient and Gradient Updates\n\nKrishnakumar Balasubramanian\n\nDepartment of Statistics\n\nUniversity of California, Davis\n\nkbala@ucdavis.edu\n\nDepartment of Operations Research and Financial Engineering\n\nSaeed Ghadimi \u21e4\n\nPrinceton University\n\nsghadimi@princeton.edu\n\nAbstract\n\nIn this paper, we propose and analyze zeroth-order stochastic approximation algo-\nrithms for nonconvex and convex optimization. Speci\ufb01cally, we propose general-\nizations of the conditional gradient algorithm achieving rates similar to the standard\nstochastic gradient algorithm using only zeroth-order information. Furthermore,\nunder a structural sparsity assumption, we \ufb01rst illustrate an implicit regularization\nphenomenon where the standard stochastic gradient algorithm with zeroth-order\ninformation adapts to the sparsity of the problem at hand by just varying the step-\nsize. Next, we propose a truncated stochastic gradient algorithm with zeroth-order\ninformation, whose rate depends only poly-logarithmically on the dimensionality.\n\nIntroduction\n\n1\nIn this work, we propose and analyze algorithms for solving the following stochastic optimization\nproblem\n\nx2X\u21e2f (x) = E\u21e0[F (x, \u21e0)] =Z F (x, \u21e0) dP (\u21e0) ,\n\nmin\n\n(1.1)\n\nwhere X is a closed convex subset of Rd. The case of nonconvex objective function f is ubiquitous\nin modern deep learning problems and developing provable algorithms for such problems has been a\ntopic of intense research in the recent years [16, 11], along with the more standard convex case [1].\nSeveral methods are available for solving such stochastic optimization problems under access to\ndifferent oracle information, for example, function queries (zeroth-order oracle), gradient queries\n(\ufb01rst-order oracle), and higher-order oracles. In this work, we assume that we only have access to\nnoisy evaluation of f through a stochastic zeroth-order oracle described in detail in Assumption 1.\nThis oracle setting is motivated by several applications where only noisy function queries of problem\n(1.1) is available and obtaining higher-order information might not be possible. Such a situation\noccurs frequently for example, in simulation based modeling [29], selecting the tuning parameters\nof deep neural networks [32] and design of black-box attacks to deep networks [3]. It is worth\nnoting that recently such zeroth-order optimization techniques have also been applied in the \ufb01eld of\nreinforcement learning [30, 4, 20]. Furthermore, methods using similar oracles have been studied in\nthe literature under the name of derivative-free optimization [33, 5], bayesian optimization [21] and\noptimization with bandit feedback [2].\n\n\u21e4Both authors contributed equally and are listed in alphabetical order.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAlgorithm\n\nZSCG (Alg 1)\n\nModi\ufb01ed ZSCG (Alg 3)\n\nZSGD (Alg 4)\n\nTruncated ZSGD (Alg 5)\n\nZSGD\n\nStructure\nNonconvex\n\nConvex\nConvex\n\nO(d/\u270f4)\nO(d/\u270f3)\nO(d/\u270f2)\n\nNonconvex, s-sparse O(s log d)2/\u270f4\nOs(log d/\u270f)2\n\nConvex, s-sparse\n\nConvex\n\nNonconvex\n\nO(d/\u270f2)\nO(d/\u270f4)\n\nTheorem 2.1\nTheorem 2.2\nTheorem 3.1\nTheorem 3.2\n\n[18, 7, 9]\n\n[9]\n\nFunction Queries\n\nReferences\n\nTable 1: A list of complexity bounds for stochastic zeroth-order methods to \ufb01nd an \u270f-stationary or\n\u270f-optimal (see De\ufb01nition 1.1) point of problem (1.1).\nAlgorithms available for solving problem (1.1) also depend crucially on the constraint set X . First,\nconsider the case of X = Rd. When \ufb01rst-order information is available, the rate of convergence of\nthe standard Gradient Descent (GD) algorithm is dimension-independent [26]. Whereas when only\nthe zeroth-order information is available, any algorithm (with estimated gradients) has (at least) linear\ndependence on d [9, 18, 7]. This illustrates the main difference between the availability of different\noracle information. Next, note that depending on the geometry of the constraint set X , the cost of\ncomputing the projection to the set might be prohibitive. This lead to the re-emergence of Conditional\nGradient (CG) algorithms recently [12, 15]. But the performance of the CG algorithm under the\nzeroth-order oracle is unexplored in the literature to the best of our knowledge, both under convex and\nnonconvex settings. Hence it is natural to ask if CG algorithms, with access to zeroth-order oracle has\nsimilar (or better) convergence rates compared to GD algorithms with zeroth-order information. We\npropose and analyze in Section 2 a classical version of CG algorithm with zeroth-order information\nand present convergence results. We then propose a modi\ufb01cation in Section 2.2 that has improved\nrates, when f is convex.\nNotably, with zeroth-order information, the complexity of CG algorithms also depend linearly on\nthe dimensionality, similar to the GD algorithms. We refer to this situation as the low-dimensional\nsetting in the rest of the paper. This motivates us to examine assumptions under which one can\nachieve weaker dependence on the dimensionality while optimizing with zeroth-order information.\nIn a recent work [34], the authors used a functional sparsity assumption, under which the function\nf : Rd ! R to be optimized depends only on s of the d components, and proposed a LASSO\nbased algorithm that has poly-logarithmic dependence on the dimensionality when f is convex. We\nrefer to this situation as the high-dimensional setting. In this work, we perform a re\ufb01ned analysis\nunder a similar sparsity assumption for both convex and nonconvex objective functions. When the\nperformance is measured by the size of the gradient, we show in Section 3 that zeroth-order GD\nalgorithm (without using thresholding or LASSO approach of [34]), has poly-logarithmic dependence\non the dimensionality thereby demonstrating an implicit regularization phenomenon in this setting.\nNote that this is applicable for both convex and nonconvex objectives. When the performance is\nmeasured by function values (as in the case of convex objective), we show that a simple thresholded\nzeroth-order GD algorithm achieves a poly-logarithmic dependence on dimensionality. This algorithm\nis notably less expensive than the algorithm proposed by [34].\nOur contributions: To summarize the above discussion, in this paper we make the following contri-\nbutions to the literature on zeroth-order stochastic optimization: (i) We \ufb01rst analyze a classical version\nof CG algorithm in the nonconvex (and convex) setting, under access to zeroth-order information\nand provide results on the convergence rates in the low-dimensional setting; (ii) We then propose\nand analyze a modi\ufb01ed CG algorithm in the convex setting with zeroth-order information and show\nthat it attains improved rates in the low-dimensional setting; (iii) Finally, we consider a zeroth-order\nstochastic gradient algorithm in the high-dimensional nonconvex (and convex) setting and illustrate an\nimplicit regularization phenomenon. We also show that this algorithm achieves rates that depend only\npoly-logarithmically on dimensionality. Our contributions extend the applicability of zeroth-order\nstochastic optimization to the constrained and high-dimensional setting and also provide theoretical\ninsights in the form of rates of convergence. A summary of the results is provided in Table 1.\n\n1.1 Preliminaries\nWe now list the main assumptions we make in this work. Additional assumptions will be introduced\nin the appropriate sections as needed. We start with the assumption on the zeroth-order oracle.\n\n2\n\n\f\u21e4] \uf8ff 2, where k\u00b7k \u21e4 denotes the dual norm.\n\nAssumption 1 Let k\u00b7k be a norm on Rd. For any x 2 Rd, the zeroth-order oracle outputs an\nestimator F (x, \u21e0) of f (x) such that E[F (x, \u21e0)] = f (x), E[rF (x, \u21e0)] = rf (x), E[krF (x, \u21e0) \nrf (x)k2\nIt should be noted that in the above assumption, we do not observe rF (x, \u21e0) and we just assume that\nit is an unbiased estimator of gradient of f and its variance is bounded. Furthermore, we make the\nfollowing smoothing assumption about the noisy estimation of f.\n\nAssumption 2 Function F has Lipschitz continuous gradient with constant L, almost surely for any\n\u21e0, i.e., krF (y, \u21e0) rF (x, \u21e0)k\u21e4 \uf8ff Lky xk, which consequently implies that\n|F (y, \u21e0) F (x, \u21e0) hrF (x, \u21e0), y xi| \uf8ff L\nIt is easy to see that the above two assumptions imply that f also has Lipschitz continuous gradient\nwith constant L since\n\n2 ky xk2.\n\nkrf (y) rf (x)k\u21e4 \uf8ff E [krF (y, \u21e0) rF (x, \u21e0)k\u21e4] \uf8ff Lky xk\n\n(1.2)\ndue the Jensen\u2019s inequality for the dual norm. We now collect some facts about a gradient estimator\nbased on the above zeroth-order information. Let u \u21e0 N (0, Id) be a standard Gaussian random vector.\nFor some \u232b 2 (0,1) consider the smoothed function f\u232b(x) = Eu [f (x + \u232bu)]. Nesterov [27] has\nshown that rf\u232b(x) =\nEu\uf8ff f (x + \u232bu)\n\n2 du.\n(1.3)\nThis relation implies that we can estimate gradient of f\u232b by only using evaluations of f. In particular,\none can de\ufb01ne stochastic gradient of f\u232b(x) as\n\nu = Eu\uf8ff f (x + \u232bu) f (x)\n\n1\n\n(2\u21e1)d/2Z f (x + \u232bu) f (x)\n\nu =\n\nu e kuk2\n\n2\n\n\u232b\n\n\u232b\n\n\u232b\n\nF (x + \u232bu,\u21e0 ) F (x, \u21e0)\nwhich is an unbiased estimator of rf\u232b(x) under Assumption 1 since\n\nG\u232b(x, \u21e0, u) =\n\n\u232b\n\nu,\n\n(1.4)\n\nEu,\u21e0[G\u232b(x, \u21e0, u)] = Eu[ f (x+\u232bu)f (x)\n\n\u232b\n\nu] = rf\u232b(x).\n\nWe leverage some properties of f\u232b due to Nesterov [27] in our proofs later, that we replicate in the\nsupplementary material (Section A) for convenience. Finally, we de\ufb01ne the following criterion which\nare used to analyze the complexity of our proposed algorithms.\nDe\ufb01nition 1.1 Assume that a solution \u00afx 2X as output of an algorithm and a target accuracy \u270f> 0\nare given. Then: (i) If f is nonconvex, \u00afx is called an \u270f-stationary point of the unconstrained variant\nof problem (1.1) if E[krf (\u00afx)k\u21e4] \uf8ff \u270f. For the constrained case, \u00afx should satis\ufb01es E[hrf (\u00afx), \u00afx \nui] \uf8ff \u270f for all u 2X ; (ii) If f is convex, \u00afx is called an \u270f-optimal point of problem (1.1) if\nE[f (\u00afx)] f (x\u21e4) \uf8ff \u270f, where x\u21e4 denotes an optimal solution of the problem.\nIt should be pointed out that while the above performance measures are presented in expectation\nform, one can also use their high probability counterparts. Since, convergence results in this case\ncan be obtained by making sub-Gaussian tail assumptions on the output of the zeroth-order oracle\nand using the standard two-stage process presented in [9, 19], we do not elaborate more on this\napproach. Furthermore, note that the aforementioned measures for evaluating the algorithms are from\nthe derivative-free optimization point of view. In the literature on optimization with bandit feedback,\nthe preferred performance measure is the so-called regret of the algorithm [2, 31] which may have a\ndifferent behavior than our performance measures.\n2 Zeroth-order Stochastic Conditional Gradient Type Method\nIn this section, we study zeroth-order stochastic conditional gradient (ZSCG) algorithms in the\nlow-dimensional setting for solving constrained stochastic optimization problems. In particular, we\nincorporate a variant of the gradient estimate de\ufb01ned in (1.4) into the framework of the classical CG\nmethod and provide its convergence analysis in Subsection 2.1. We also present improved rates for a\nvariant of this method in Subsection 2.2 when f is convex. Throughout this section, we assume that\nRd is equipped with the self-dual Euclidean norm i.e., k\u00b7k = k\u00b7k 2. We also make the following\nnatural boundedness assumption.\n\n3\n\n\fAlgorithm 1 Zeroth-order Stochastic Conditional Gradient Method\n\nInput: z0 2X , smoothing parameter \u232b> 0, non-negative sequence \u21b5k, positive integer sequence\nmk, iteration limit N 1 and probability distribution PR(\u00b7) over {1, . . . , N}.\nfor k = 1, . . . , N do\n\n1. Generate uk = [uk,1, . . . , uk,mk ], where uk,j \u21e0 N (0, Id), call the stochastic oracle to\n\naccording to (1.4) and take their average:\n\ncompute mk stochastic gradient Gk,j\n\u232b\n\n\u00afGk\n\u232b \u2318 \u00afG\u232b(zk1,\u21e0 k, uk) =\n\n2. Compute\n\n1\nmk\n\nmkXj=1\n\nF (zk1 + \u232buk,j,\u21e0 k,j) F (zk1,\u21e0 k,j)\n\n\u232b\n\nuk,j.\n\n(2.1)\n\nxk = argmin\n\nu2X h \u00afGk\n\n\u232b, ui,\n\nzk = (1 \u21b5k)zk1 + \u21b5kxk.\n\nend for\nOutput: Generate R according to PR(\u00b7) and output zR.\n\n(2.2)\n\n(2.3)\n\nAssumption 3 The feasible set X is bounded such that maxx,y2X ky xk \uf8ff DX for some DX > 0.\nMoreover, for all x 2X , there exists a constant B > 0 such that krf (x)k \uf8ff B.\nWe should point out that under Assumptions 1 and 2, the second statement in Assumption 3 follows\nimmediately by the \ufb01rst one and choosing B := LDX + krf (x\u21e4)k. However, we just use B in our\nanalysis for simplicity.\n\n2.1 Zeroth-order Stochastic Conditional Gradient Method\nThe vanilla ZSCG method is formally presented in Algorithm 1 and a few remarks about it follows.\nFirst, note that this algorithm differs from the classical CG method in estimating the gradient using\nzeroth-order information and in outputting a random solution from the generated trajectory. This\nrandomization scheme is the current practice in the literature to provide convergence results for\nnonconvex stochastic optimization (see e.g., [9, 28]). Second, \u00afGk\n\u232b is the averaged variant of the\ngradient estimator presented in Subsection 1.1 and is still an unbiased estimator of rf\u232b(zk1).\nMoreover, it can be easily seen that it has a reduced variance with respect to the individual estimators\ni.e.,\n\nE[k \u00afGk\n\n\u232b rf\u232b(zk1)k2] \uf8ff\n\nE[kGk,j\n\n\u232b rf\u232b(zk1)k2].\n\n1\nmk\n\n(2.4)\n\nWe emphasize that the use of the above variance reduction technique in stochastic CG methods is\nstandard and has been previously proposed and leveraged in several works (see e.g., [19, 13, 28,\n22, 23, 10]). Indeed, when exact gradient is not available, an error term appears in the convergence\nanalysis which should converge to 0 at a certain rate as the algorithm moves forward. Hence, the\nchoice of mk plays a key role in the convergence analysis of Algorithm 1. \u00afGk\n\u232b can be also viewed\nas a biased estimator for rf (zk1). Finally, since f is possibly nonconvex, we need a different\ncriteria than the optimality gap to provide convergence analysis of Algorithm 1. The well-known\nFrank-Wolfe Gap given by\n\ngk\nX \u2318 g\n\nX\n\n(zk1) := hrf (zk1), zk1 \u02c6xki, where \u02c6xk = argmin\n\nu2X hrf (zk1), ui,\n\n(2.5)\n\nhas been widely use in the literature to show rate of convergence of the CG methods when f is convex\n(see e.g., [8, 6, 14]). In this case, it is easy to see that\nf (zk1) f\u21e4 \uf8ff g\n\n(2.6)\n(zk1), 8u 2X ,\nWhen f is nonconvex, this criteria is still useful since hrf (zk1), zk1 ui \uf8ff g\nwhich implies that one can obtain an approximate stationary point of problem (1.1) by minimizing\n, in the view of De\ufb01nition 1.1. Note that in our setting, this quantity is not exactly computable and\ngk\nX\nit is only used to provide convergence analysis of Algorithm 1 as shown in the next result.\nTheorem 2.1 Let {zk}k0 be generated by Algorithm 1 and Assumptions 1, 2, and 3 hold.\n\n(zk1).\n\nX\n\nX\n\n4\n\n\f1. Let f be nonconvex, bounded from below by f\u21e4, and let the parameters of the algorithm be\n\nset as\n\n\u232b =s 2BL\n\nN (d + 3)3 ,\u21b5 k =\n\n1\npN\n\n, mk = 2BL(d + 5)N, 8k 1\n\n(2.7)\n\nwe have\n\nfor some constant BL max{pB2 + 2/L, 1} and a given iteration bound N 1. Then\n\nX + 2pB2 + 2\nf (z0) f\u21e4 + LD2\npN\n\nE[gR\nX\n\n(2.8)\nwhere R is uniformly distributed over {1, . . . , N} and gk is de\ufb01ned in (2.5). Hence, the\ntotal number of calls to the zeroth-order stochastic oracle and linear subproblems required\nto be solved to \ufb01nd an \u270f-stationary point of problem (1.1) are, respectively, bounded by\n\n] \uf8ff\n\n,\n\n\u270f4\u25c6 , O\u2713 1\nO\u2713 d\n\u270f2\u25c6 .\n\n2. Let f be convex and let the parameters be set to\n\n\u232b =s 2BL\n\nN 2(d + 3)3 ,\u21b5 k =\n\n6\n\nk + 5\n\n, mk = 2BL(d + 5)N 2, 8k 1.\n\n(2.10)\n\nThen we have\n\nE[f (zN )] f\u21e4 + E[gR\n\nX\n\n] \uf8ff\n\n120[f (z0) f (x\u21e4)]\n\n(N + 3)3\n\n+\n\n36LD2\nX\nN + 5\n\n+\n\npB2 + 2\n\nN\n\n(2.11)\n\nwhere R is random variable from {1, . . . , N} whose probability distribution is given by\n\nPR(R = k) =\n\n\u21b5kN\n\n2N (1 N )\n\n,\n\nk =\n\nkYi=1\u21e31 \n\n\u21b5i\n\n2\u2318 , 0 = 1.\n\nHence, the total number of calls to the zeroth-order stochastic oracle and linear subproblems\nrequired to be solved to \ufb01nd and \u270f-optimal solution of problem (1.1) are, respectively,\nbounded by\n\n\u270f3\u25c6 , O\u2713 1\nO\u2713 d\n\u270f\u25c6 .\n\n(2.9)\n\n(2.12)\n\n(2.13)\n\nRemark 1 Observe that the complexity bounds in (2.9), in terms of \u270f, match the ones obtained in\n[10, 28, 23] for stochastic CG method with \ufb01rst-order oracle applied to nonconvex problems. For\nconvex problems, similar observation can be made for terms in (2.13) which match the ones in\n[13, 10]. Note that the linear dependence of our complexity bounds on d is unimprovable due to the\nlower bounds for zeorth-order algorithms applied to convex optimization problems [7]. We conjecture\nthat this is also the case for nonconvex problems.\n\nImproved Rates for Convex Problems\n\n2.2\nOur goal in this subsection is to improve the complexity bounds of the ZCSG method when f is\nconvex. Recall that the ZSCG method presented in Section 2.1 involves two main steps: the gradient\nevaluation step and the linear optimization step. Motivated by [19], we now propose a modi\ufb01ed\nalgorithm that allows one to skip the gradient evaluation from time to time. Notice that, as our\ngradients are estimated by calling the zeroth-order oracle, this directly reduces the number of calls to\nthe zeroth-order oracle. We \ufb01rst state a subroutine in Algorithm 2 used in our modi\ufb01ed algorithm.\nNote that Algorithm 2 is indeed the zeroth-order conditional gradient method for inexactly solving\nthe following quadratic program\n\nPX (x, g, ) = argmin\n\nu2X nhg, ui +\n\n\n\n2ku xk2o ,\n\n(2.15)\n\nwhich is the standard subproblem of stochastic \ufb01rst-order methods applied to a minimization problem\nwhen g is an unbiased stochastic gradient of the objective function at x. We now present Algorithm 3\nwhich applies the CG method to inexactly solve subproblems of the stochastic accelerated gradient\n\n5\n\n\fAlgorithm 2 Inexact Conditional Gradient (ICG) method\n\nInput: (x, g, , \u00b5).\nSet \u00afy0 = x, t = 1, and \uf8ff = 0..\nwhile \uf8ff = 0 do\n\nyt = argmin\n\nu2X {h(u) := hg + (\u00afyt1 x), u \u00afyt1i}\nt+1 yt and t = t + 1.\n\nIf h(yt) \u00b5, set \uf8ff = 1.\nElse \u00afyt = t1\n\nt+1 \u00afyt1 + 2\n\n(2.14)\n\nend while\nOutput \u00afyt.\n\nmethod. This way of using CG methods can signi\ufb01cantly improve the total number of calls to the\nstochastic oracle. Our next result provides convergence analysis of this algorithm.\n\nAlgorithm 3 Zeroth-order Stochastic Accelerated Gradient Method with Inexact Updates\n\nInput:z0 = x0 2X , smoothing parameter \u232b> 0, sequences \u21b5k, mk, k, \u00b5k, and iteration limit\nN 1.\nfor k = 1, . . . , N do\n1. Set\n\n(2.16)\n2. Generate uk = [uk,1, . . . , uk,mk ], where uk,j \u21e0 N (0, Id), call the stochastic oracle mk\n\ntimes to compute \u00afGk\n\nwk = (1 \u21b5k)zk1 + \u21b5kxk1\n\u232b \u2318 \u00afG\u232b(wk,\u21e0 k, uk) as given by (2.1), and set\n\u232b, k, \u00b5k),\nwhere ICG(\u00b7) is the output of Algorithm 2 with input (xk1, \u00afGk\nzk = (1 \u21b5k)zk1 + \u21b5kxk\n\nxk = ICG(xk1, \u00afGk\n\n3. Set\n\n(2.17)\n\n(2.18)\n\n\u232b, k).\n\nend for\nOutput: zN\n\nTheorem 2.2 Let {zk}k1 be generated by Algorithm 3, the function f be convex, and\n,s D0\nd(N + 1))\n\nmax( 1\n\nLD0\nX\nkN\n\n1\np2N\n\n, \u00b5k =\n\n, k =\n\n\u21b5k =\n\n4L\nk\n\n,\u232b =\n\n2\n\nk + 1\n\nd + 3\n\nX\n\nmk =\n\nk(k + 1)\n\nD0\nX\n\nand for some constants D0\nAssumptions 1, 2, and 3, we have\n\nmax{(d + 5)BLN, d + 3} , 8k 1,\nX kx0 x\u21e4k2 and BL max{pB2 + 2/L, 1}. Then under\n\n(2.19)\n\nE[f (zN ) f (x\u21e4)] \uf8ff\n\n12LD0\nX\nN (N + 1)\n\n.\n\n(2.20)\n\nHence, the total number of calls to the stochastic oracle and linear subproblems solved to \ufb01nd and\n\u270f-stationary point of problem (1.1) are, respectively, bounded by\n\n\u270f2\u25c6 , O\u2713 1\nO\u2713 d\n\u270f\u25c6 .\n\n(2.21)\n\nRemark 2 Observe that while the number of linear subproblems required to \ufb01nd an \u270f-optimal\nsolution of problem (1.1) is the same for both Algorithms 1 and 3, the number of calls to the stochastic\nzeroth-order oracle in Algorithm 3 is signi\ufb01cantly smaller than that of Algorithm 1. It is also natural\nto ask if such an improvement is achievable when f is nonconvex. This situation is more subtle and\nthe answer depends on the performance measure used to measure the rate of convergence. Indeed, we\ncan obtain improved complexity bounds for a different performance measure than the Frank-Wolfe\n\n6\n\n\fAlgorithm 4 Zeroth-Order Stochastic Gradient Method\n\nInput: x0 2 Rd, smoothing parameter \u232b> 0, iteration limit N 1, a probability distribution PR\nsupported on {0, . . . , N 1}.\nfor k =1, . . . , N do\nin (1.4) and set xk = xk1 kG\u232b(xk1,\u21e0 k; uk).\nend for\nOutput: Generate R according to PR(\u00b7) and output xR.\n\nGenerate uk \u21e0 N (0, Id), call the stochastic oracle, and compute G\u232b(xk1,\u21e0 k, uk) as de\ufb01ned\n\ngap with a modi\ufb01ed algorithm. However, the complexity bounds are of the same order as (2.9) in\nterms of the Frank-Wolfe gap for the modi\ufb01ed algorithm. For the sake of completeness, we add this\nalgorithm and its convergence analysis in the supplementary material in Section D.\n\n3 Zeroth-order Stochastic Gradient Methods\nIn this section, we study unconstrained variant of problem 1.1 i.e, X = Rd, under certain sparsity\nassumptions on the objective function f to facilitate zeroth-order optimization in high-dimensions.\nRecently, [34] considered the convex case and proposed algorithms for high-dimensional zeroth-order\nstochastic optimization. Motivated by [34], we make the following assumption.\nAssumption 4 For any x 2 Rd, we have krf (x)k0 \uf8ff s, i.e., the gradient is s-sparse, where s \u2327 d.\nNote that the above assumption implies krf (x)k2 \uf8ff pskrf (x)k1 and krf (x)k1 \uf8ff skrf (x)k1,\nfor all x 2 Rd. Furthermore, this assumption also implies that krf\u232b(x)k0 \uf8ff s for all x 2 Rd since\nrf\u232b(x) = Eu [rf (x + \u232bu)]. To exploit the above sparsity assumption, we assume that the primal\nspace Rd is equipped with the l1 norm throughout this section. More speci\ufb01cally, we assume that\nAssumptions 1 and 2 hold with the choice of k\u00b7k = k\u00b7k 1 and its dual norm k\u00b7k \u21e4 = k\u00b7k 1. We now\npresent zeroth-order stochastic gradient methods for solving problem (1.1) when f is nonconvex and\nconvex, in Subsections 3.1 and 3.2 respectively.\n\n3.1 Zeroth-order Stochastic Gradient Method for Nonconvex Problems\nIn this subsection, we consider the zeroth-order stochastic gradient method presented in [9] (provided\nin Algorithm 4 for convenience) and provide a re\ufb01ned convergence analysis for it under the sparsity\nassumption 1, when f is nonconvex. Our main convergence result for Algorithm 4 under the gradient\nsparsity assumption is stated below.\nTheorem 3.1 Let {xk}k0 be generated by Algorithm 4 and stepsizes are chosen such that 8k 1,\nN ) (3.1)\n,r D0\nfor some \u02c6s s, \u02c6C C (the universal constant de\ufb01ned in Lemma C.1), and D0 f (x0) f\u21e4.\nAssume that f is nonconvex. Then under Assumptions 1, 2, and 4, we have\n\n2N 29=;\n,s D0L \u02c6C\n\nmin(r 22\n\n,\u232b \uf8ff\n\n2L \u02c6C log d\n\nk =\n\nL\n\n1\n\n1\n\n150L \u02c6CD0\u02c6ss(log d)2\n\n(3.2)\nwhere \u21e3 = {\u21e0, u, R} and R is uniformly distributed over {0, . . . , N 1}. Hence, the total number of\ncalls to the stochastic oracle (number of iterations) required to \ufb01nd an \u270f-stationary point of problem\n(1.1), in the view of De\ufb01nition 1.1, is bounded by\n\nN\n\n,\n\n1\n\n12\u02c6s log d\n\nmin8<:\nE\u21e3\u21e5krf (xR)k2\n1\u21e4 \uf8ff\n\nqL \u02c6C log d\n54p2L \u02c6CD0 s log d\n\npN\n\n+\n\nO\u2713 (\u02c6s log d)2\n\n\u270f4\n\n\u25c6 .\n\n(3.3)\n\nRemark 3 Note that the above theorem establishes rate of convergence of Algorithm 4 which only\npoly-logarithmically depends on the problem dimension d, by just selecting the step-size appropriately,\nunder additional assumption that the gradient is sparse. This signi\ufb01cantly improves the linear\ndimensionality dependence of the rate of convergence of this algorithm as presented in [9] for general\nnonconvex smooth problems.\n\n7\n\n\fAlgorithm 5 Truncated Zeroth-Order Stochastic Gradient Method\n\nGiven a positive integer \u02c6s, replace updating step of Algorithm 4 with\nxk = P\u02c6s (xk1 kG\u232b(xk1,\u21e0 k; uk)) ,\n\nwhere P\u02c6s(x) keeps the top \u02c6s largest absolute value of components of x and make the others 0.\n\n(3.4)\n\nRemark 4 Remarkably, Algorithm 4 does not require any special operation to adapt to the sparsity\nassumption. This demonstrates an implicit regularization phenomenon exhibited by the zeroth-order\nstochastic gradient method in the high-dimensional setting when the performance is measured by the\nsize of the gradient in the dual norm. We emphasize that the choice of the performance measure is\nmotivated by the fact that we allow f to be nonconvex. Trivially, the result also applies to the case\nwhen f is convex, for the same performance measure.\n\n3.2 Zeroth-order Stochastic Gradient Method for Convex Problems\nWe now consider the case when the function f is convex. In this setting, a more natural performance\nmeasure is the convergence of optimality gap in terms of the function values. For this situation, we\npropose and analyze a truncate variant of Algorithm 4 that demonstrates similar poly-logarithmic\ndependence on the dimensionality. To proceed, in addition to Assumption 4, we also make the\nfollowing sparsity assumption on the optimal solution of problem (1.1).\nAssumption 5 Problem (1.1) has a sparse optimal solution x\u21e4 such that kx\u21e4k0 \uf8ff s\u21e4, where s\u21e4 \u21e1 s.\nOur algorithm for the convex setting is presented in Algorithm 5. Note that this algorithm could be\nconsidered as a truncated variant of Algorithm 4 and a zeroth-order stochastic variant of the truncated\ngradient descent algorithm [17]. In the next result, we present convergence analysis of this algorithm.\nTheorem 3.2 Let {xk}k1 be generated by Algorithm 4, f is convex, Assumptions 1, 2, 4, and 5\nhold. Also assume the stepsizes are chosen such that, 8k 1,\nN ) (3.5)\n\n3N 29=;\n,s D0\nfor some \u02c6C C, \u02c6s max{s, s\u21e4}, and D0\nX kx0 x\u21e4k2.\n\nmin8<:\n\n,r \u02c6s2D0\n\n4 \u02c6C \u02c6s log d\n\n12L\u02c6s log d\n\nk =\n\n\u02c6C \u02c6s\n\nX\n\n1\n\n1\n\nX\n\nlog d\n\n,\u232b \uf8ffplog d min( \n69q3 \u02c6CD0\npN\n\n+\n\nX \u02c6s log d\n\nE [f (\u00afxN ) f\u21e4] \uf8ff\n\n52L \u02c6CD0\n\nX \u02c6s2(log d)2\nN\n\n,\n\n(3.6)\n\nwhere \u00afxN = PN1\n\nN\n\nk=0 xk\n\nrequired to \ufb01nd an \u270f-optimal point of problem (1.1) is bounded by\n\n. Hence, the total number of calls to the stochastic oracle (number of iterations)\n\nO \u02c6s\u2713 log d\n\n\u270f \u25c62! .\n\n(3.7)\n\nRemark 5 While for convex case, similar to the nonconvex case, the complexity of Algorithm 5\ndepends poly-logarithmically on d, it only linearly depends on the choice of \u02c6s, facilitating zeroth-order\nstochastic optimization in high-dimensions under sparsity assumptions.\n\nRemark 6 As discussed in detail in [34], both Assumption 4 and 5 are implied when we assume the\nfunction f depends on only s of the d coordinates. But, both Assumption 4 and 5 are comparatively\nweaker than that assumption. Furthermore, unlike [34], we do not make any assumption on the\nsparsity or smoothness of the second-order derivative of the objective function f for our results.\n\nRemark 7 As mentioned before, [34] considers only the convex case. Furthermore, their gradient\nestimator with zeroth-order oracle requires poly(s, s\u21e4, log d) function queries in each iteration\nwhereas our estimator is based on only one function query per iteration. Moreover, [34] requires\ncomputationally expensive debiased Lasso estimators whereas our method requires only simple\nthresholding operations (for convex case) to handle sparsity.\n\n8\n\n\f4 Future Work\nTwo concrete extensions are possible for future work. First, for our results, we focus on performance\nmeasures common in the optimization setting. It is interesting to extend our results to the bandit\nsetting, where the performance is measured via regret of the algorithm. Next, the performance\nof conditional gradient algorithm in the high-dimensional constrained optimization setting is not\nwell-explored; the interaction between the geometry of the constraint set, sparsity structure and\nzeroth-order information is extremely interesting to explore. Finally, lower bounds can be explored\nfor the cases considered in this paper when f is nonconvex.\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\nReferences\n[1] S\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R\n[2] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1\u2013122,\n2012.\n[3] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order\noptimization based black-box attacks to deep neural networks without training substitute models.\nIn Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and Security, pages 15\u201326.\nACM, 2017.\n\n[4] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller.\nStructured evolution with compact architectures for scalable policy optimization. In Proceedings\nof the 35th International Conference on Machine Learning. PMLR, 2018.\n\n[5] Andrew Conn, Katya Scheinberg, and Luis Vicente. Introduction to derivative-free optimization,\n\nvolume 8. Siam, 2009.\n\n[6] V. Demyanov and A. Rubinov. Approximate methods in optimization problems. American\n\nElsevier Publishing Co, 1970.\n\n[7] John Duchi, Michael Jordan, Martin Wainwright, and Andre Wibisono. Optimal rates for\nzero-order convex optimization: The power of two function evaluations. IEEE Transactions on\nInformation Theory, 61(5):2788\u20132806, 2015.\n\n[8] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research\n\nLogistics Quarterly, 3:95\u2013110, 1956.\n\n[9] S. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[10] Saeed Ghadimi. Conditional gradient type methods for composite nonlinear and stochastic\n\noptimization. Mathematical Programming, 2018.\n\n[11] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1.\n\nMIT press Cambridge, 2016.\n\n[12] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th\nInternational Coference on International Conference on Machine Learning, pages 1843\u20131850.\nOmnipress, 2012.\n\n[13] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In\n\nInternational Conference on Machine Learning, pages 1263\u20131271, 2016.\n\n[14] Donald Hearn. The gap function of a convex program. Operations Research Letters, 2, 1982.\n[15] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1),\n\npages 427\u2013435, 2013.\n\n[16] Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning. Foundations\n\nand Trends R in Machine Learning, 10(3-4):142\u2013336, 2017.\n[17] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for\nhigh-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages\n685\u2013693, 2014.\n\n[18] Kevin Jamieson, Robert Nowak, and Ben Recht. Query complexity of derivative-free optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 2672\u20132680, 2012.\n\n9\n\n\f[19] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM\n\nJournal on Optimization, 26(2):1379\u20131409, 2016.\n\n[20] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive\napproach to reinforcement learning. In Advances in Neural Information Processing Systems,\n2018.\n\n[21] Jonas Mockus. Bayesian approach to global optimization: theory and applications, volume 37.\n\nSpringer Science & Business Media, 2012.\n\n[22] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochas-\nIn International Conference on Arti\ufb01cial\n\ntic submodular maximization: Closing the gap.\nIntelligence and Statistics, pages 1886\u20131895, 2018.\n\n[23] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods:\nFrom convex minimization to submodular maximization. arXiv preprint arXiv:1804.09554,\n2018.\n\n[24] A. S. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley-Interscience Series in Discrete Mathematics. John Wiley, XV, 1983.\n\n[25] Y. E. Nesterov.\n\nIntroductory Lectures on Convex Optimization: a basic course. Kluwer\n\nAcademic Publishers, Massachusetts, 2004.\n\n[26] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[27] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.\n\nFoundations of Computational Mathematics, 17(2):527\u2013566, 2017.\n\n[28] Sashank Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander Smola. Stochastic Frank-Wolfe\nMethods for Nonconvex Optimization. 2016 54th Annual Allerton Conference on Communica-\ntion, Control, and Computing (Allerton), pages 1244\u20131251, 2016.\n\n[29] Reuven Rubinstein and Dirk Kroese. Simulation and the Monte Carlo method, volume 10. John\n\nWiley & Sons, 2016.\n\n[30] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies\n\nas a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[31] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization.\n\nIn Conference on Learning Theory, pages 3\u201324, 2013.\n\n[32] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[33] James Spall. Introduction to stochastic search and optimization: estimation, simulation, and\n\ncontrol, volume 65. John Wiley & Sons, 2005.\n\n[34] Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order\noptimization in high dimensions. Proceedings of the Twenty-First International Conference on\nArti\ufb01cial Intelligence and Statistics, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1775, "authors": [{"given_name": "Krishnakumar", "family_name": "Balasubramanian", "institution": "University of California, Davis"}, {"given_name": "Saeed", "family_name": "Ghadimi", "institution": "Princeton University"}]}