{"title": "On the Local Minima of the Empirical Risk", "book": "Advances in Neural Information Processing Systems", "page_first": 4896, "page_last": 4905, "abstract": "Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex non-smooth losses (such as modern deep networks), the population risk is generally significantly more well behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function $F$ (population risk) given only access to an approximation $f$ (empirical risk) that is pointwise close to $F$ (i.e., $\\norm{F-f}_{\\infty} \\le \\nu$). Our objective is to find the $\\epsilon$-approximate local minima of the underlying function $F$ while avoiding the shallow local minima---arising because of the tolerance $\\nu$---which exist only in $f$. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ that is guaranteed \nto achieve our goal as long as $\\nu \\le O(\\epsilon^{1.5}/d)$. We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance $\\nu$ among all algorithms making a polynomial number of queries of $f$. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.", "full_text": "On the Local Minima of the Empirical Risk\n\nChi Jin\u2217\n\nUniversity of California, Berkeley\n\nchijin@cs.berkeley.edu\n\nLydia T. Liu\u2217\n\nUniversity of California, Berkeley\nlydiatliu@cs.berkeley.edu\n\nRong Ge\n\nDuke University\n\nrongge@cs.duke.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\n\njordan@cs.berkeley.edu\n\nAbstract\n\nPopulation risk is always of primary interest in machine learning; however, learning\nalgorithms only have access to the empirical risk. Even for applications with\nnonconvex nonsmooth losses (such as modern deep networks), the population\nrisk is generally signi\ufb01cantly more well-behaved from an optimization point of\nview than the empirical risk. In particular, sampling can create many spurious\nlocal minima. We consider a general framework which aims to optimize a smooth\nnonconvex function F (population risk) given only access to an approximation f\n(empirical risk) that is pointwise close to F (i.e., (cid:107)F \u2212 f(cid:107)\u221e \u2264 \u03bd). Our objective\nis to \ufb01nd the \u0001-approximate local minima of the underlying function F while\navoiding the shallow local minima\u2014arising because of the tolerance \u03bd\u2014which\nexist only in f. We propose a simple algorithm based on stochastic gradient descent\n(SGD) on a smoothed version of f that is guaranteed to achieve our goal as long as\n\u03bd \u2264 O(\u00011.5/d). We also provide an almost matching lower bound showing that\nour algorithm achieves optimal error tolerance \u03bd among all algorithms making\na polynomial number of queries of f. As a concrete example, we show that our\nresults can be directly used to give sample complexities for learning a ReLU unit.\n\n1\n\nIntroduction\n\nThe optimization of nonconvex loss functions has been key to the success of modern machine\nlearning. While classical research in optimization focused on convex functions having a unique\ncritical point that is both locally and globally minimal, a nonconvex function can have many local\nmaxima, local minima and saddle points, all of which pose signi\ufb01cant challenges for optimization.\nA recent line of research has yielded signi\ufb01cant progress on one aspect of this problem\u2014it has\nbeen established that favorable rates of convergence can be obtained even in the presence of saddle\npoints, using simple variants of stochastic gradient descent [e.g., Ge et al., 2015, Carmon et al., 2016,\nAgarwal et al., 2017, Jin et al., 2017a]. These research results have introduced new analysis tools\nfor nonconvex optimization, and it is of signi\ufb01cant interest to begin to use these tools to attack the\nproblems associated with undesirable local minima.\nIt is NP-hard to avoid all of the local minima of a general nonconvex function. But there are some\nclasses of local minima where we might expect that simple procedures\u2014such as stochastic gradient\ndescent\u2014may continue to prove effective. In particular, in this paper we consider local minima that\nare created by small perturbations to an underlying smooth objective function. Such a setting is\nnatural in statistical machine learning problems, where data arise from an underlying population, and\nthe population risk, F , is obtained as an expectation over a continuous loss function and is hence\n\n\u2217The \ufb01rst two authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: a) Function error \u03bd; b) Population risk vs empirical risk\n\nThe sampling process turns this smooth risk into an empirical risk, f (\u03b8) =(cid:80)n\n\nsmooth; i.e., we have F (\u03b8) = Ez\u223cD[L(\u03b8; z)], for a loss function L and population distribution D.\ni=1 L(\u03b8; zi)/n, which\nmay be nonsmooth and which generally may have many shallow local minima. From an optimization\npoint of view f can be quite poorly behaved; indeed, it has been observed in deep learning that\nthe empirical risk may have exponentially many shallow local minima, even when the underlying\npopulation risk is well-behaved and smooth almost everywhere [Brutzkus and Globerson, 2017,\nAuer et al., 1996]. From a statistical point of view, however, we can make use of classical results in\nempirical process theory [see, e.g., Boucheron et al., 2013, Bartlett and Mendelson, 2003] to show\nthat, under certain assumptions on the sampling process, f and F are uniformly close:\n\n(cid:107)F \u2212 f(cid:107)\u221e \u2264 \u03bd,\n\n(1)\n\nwhere the error \u03bd typically decreases with the number of samples n. See Figure 1(a) for a depiction of\nthis result, and Figure 1(b) for an illustration of the effect of sampling on the optimization landscape.\nWe wish to exploit this nearness of F and f to design and analyze optimization procedures that \ufb01nd\napproximate local minima (see De\ufb01nition 1) of the smooth function F , while avoiding the local\nminima that exist only in the sampled function f.\nAlthough the relationship between population risk and empirical risk is our major motivation, we\nnote that other applications of our framework include two-stage robust optimization and private\nlearning (see Section 5.2). In these settings, the error \u03bd can be viewed as the amount of adversarial\nperturbation or noise due to sources other than data sampling. As in the sampling setting, we hope to\nshow that simple algorithms such as stochastic gradient descent are able to escape the local minima\nthat arise as a function of \u03bd.\nMuch of the previous work on this problem studies relatively small values of \u03bd, leading to \u201cshallow\u201d\nlocal minima, and applies relatively large amounts of noise, through algorithms such as simulated\nannealing [Belloni et al., 2015] and stochastic gradient Langevin dynamics (SGLD) [Zhang et al.,\n2017]. While such \u201clarge-noise algorithms\u201d may be justi\ufb01ed if the goal is to approach a stationary\ndistribution, it is not clear that such large levels of noise is necessary in the optimization setting in\norder to escape shallow local minima. The best existing result for the setting of nonconvex F requires\nthe error \u03bd to be smaller than O(\u00012/d8), where \u0001 is the precision of the optimization guarantee (see\nDe\ufb01nition 1) and d is the problem dimension [Zhang et al., 2017] (see Figure 2). A fundamental\nquestion is whether algorithms exist that can tolerate a larger value of \u03bd, which would imply that they\ncan escape \u201cdeeper\u201d local minima. In the context of empirical risk minimization, such a result would\nallow fewer samples to be taken while still providing a strong guarantee on avoiding local minima.\nWe thus focus on the two central questions: (1) Can a simple, optimization-based algorithm avoid\nshallow local minima despite the lack of \u201clarge noise\u201d? (2) Can we tolerate larger error \u03bd in\nthe optimization setting, thus escaping \u201cdeeper\u201d local minima? What is the largest error that\nthe best algorithm can tolerate?\nIn this paper, we answer both questions in the af\ufb01rmative, establishing optimal dependencies between\nthe error \u03bd and the precision of a solution \u0001. We propose a simple algorithm based on SGD\n(Algorithm 1) that is guaranteed to \ufb01nd an approximate local minimum of F ef\ufb01ciently if \u03bd \u2264\nO(\u00011.5/d), thus escaping all saddle points of F and all additional local minima introduced by f.\nMoreover, we provide a matching lower bound (up to logarithmic factors) for all algorithms making a\npolynomial number of queries of f. The lower bound shows that our algorithm achieves the optimal\n\n2\n\nf\fFigure 2: Complete characterization of error \u03bd vs accuracy \u0001 and dimension d.\n\ntradeoff between \u03bd and \u0001, as well as the optimal dependence on dimension d. We also consider the\ninformation-theoretic limit for identifying an approximate local minimum of F regardless of the\nnumber of queries. We give a sharp information-theoretic threshold: \u03bd = \u0398(\u00011.5) (see Figure 2).\nAs a concrete example of the application to minimizing population risk, we show that our results\ncan be directly used to give sample complexities for learning a ReLU unit, whose empirical risk is\nnonsmooth while the population risk is smooth almost everywhere.\n1.1 Related Work\n\nA number of other papers have examined the problem of optimizing a target function F given only\nfunction evaluations of a function f that is pointwise close to F . Belloni et al. [2015] proposed an\nalgorithm based on simulated annealing. The work of Risteski and Li [2016] and Singer and Vondrak\n[2015] discussed lower bounds, though only for the setting in which the target function F is convex.\nFor nonconvex target functions F , Zhang et al. [2017] studied the problem of \ufb01nding approximate\nlocal minima of F , and proposed an algorithm based on Stochastic Gradient Langevin Dynamics\n(SGLD) [Welling and Teh, 2011], with maximum tolerance for function error \u03bd scaling as O(\u00012/d8)2.\nOther than difference in algorithm style and \u03bd tolerance as shown in Figure 2, we also note that we\ndo not require regularity assumptions on top of smoothness, which are inherently required by the\nMCMC algorithm proposed in Zhang et al. [2017]. Finally, we note that in parallel, Kleinberg et al.\n[2018] solved a similar problem using SGD under the assumption that F is one-point convex.\nPrevious work has also studied the relation between the landscape of empirical risks and the landscape\nof population risks for nonconvex functions. Mei et al. [2016] examined a special case where\nthe individual loss functions L are also smooth, which under some assumptions implies uniform\nconvergence of the gradient and Hessian of the empirical risk to their population versions. Loh and\nWainwright [2013] showed for a restricted class of nonconvex losses that even though many local\nminima of the empirical risk exist, they are all close to the global minimum of population risk.\nOur work builds on recent work in nonconvex optimization, in particular, results on escaping saddle\npoints and \ufb01nding approximate local minima. Beyond the classical result by Nesterov [2004] for\n\ufb01nding \ufb01rst-order stationary points by gradient descent, recent work has given guarantees for escaping\nsaddle points by gradient descent [Jin et al., 2017a] and stochastic gradient descent [Ge et al., 2015].\nAgarwal et al. [2017] and Carmon et al. [2016] established faster rates using algorithms that make use\nof Nesterov\u2019s accelerated gradient descent in a nested-loop procedure [Nesterov, 1983], and Jin et al.\n[2017b] have established such rates even without the nested loop. There have also been empirical\nstudies on various types of local minima [e.g. Keskar et al., 2016, Dinh et al., 2017].\nFinally, our work is also related to the literature on zero-th order optimization or more generally,\nbandit convex optimization. Our algorithm uses function evaluations to construct a gradient estimate\nand perform SGD, which is similar to standard methods in this community [e.g., Flaxman et al., 2005,\nAgarwal et al., 2010, Duchi et al., 2015]. Compared to \ufb01rst-order optimization, however, the conver-\ngence of zero-th order methods is typically much slower, depending polynomially on the underlying\ndimension even in the convex setting [Shamir, 2013]. Other derivative-free optimization methods\n\n2The difference between the scaling for \u03bd asserted here and the \u03bd = O(\u00012) claimed in [Zhang et al., 2017]\nis due to difference in assumptions. In our paper we assume that the Hessian is Lipschitz with respect to the\nstandard spectral norm; Zhang et al. [2017] make such an assumption with respect to nuclear norm.\n\n3\n\n\finclude simulated annealing [Kirkpatrick et al., 1983] and evolutionary algorithms [Rechenberg and\nEigen, 1973], whose convergence guarantees are less clear.\n\n2 Preliminaries\nNotation We use bold lower-case letters to denote vectors, as in x, y, z. We use (cid:107)\u00b7(cid:107) to denote the\n(cid:96)2 norm of vectors and spectral norm of matrices. For a matrix, \u03bbmin denotes its smallest eigenvalue.\nFor a function f : Rd \u2192 R, \u2207f and \u22072f denote its gradient vector and Hessian matrix respectively.\nWe also use (cid:107)\u00b7(cid:107)\u221e on a function f to denote the supremum of its absolute function value over entire\ndomain, supx\u2208Rd |f|. We use B0(r) to denote the (cid:96)2 ball of radius r centered at 0 in Rd. We use\nnotation \u02dcO(\u00b7), \u02dc\u0398(\u00b7), \u02dc\u2126(\u00b7) to hide only absolute constants and poly-logarithmic factors. A multivariate\nGaussian distribution with mean 0 and covariance \u03c32 in every direction is denoted as N (0, \u03c32I).\nThroughout the paper, we say \u201cpolynomial number of queries\u201d to mean that the number of queries\ndepends polynomially on all problem-dependent parameters.\nObjectives in nonconvex optimization Our goal is to \ufb01nd a point that has zero gradient and\npositive semi-de\ufb01nite Hessian, thus escaping saddle points. We formalize this idea as follows.\nDe\ufb01nition 1. x is called a second-order stationary point (SOSP) or approximate local minimum\nof a function F if\n\n(cid:107)\u2207F (x)(cid:107) = 0 and \u03bbmin(\u22072F (x)) \u2265 0.\n\nWe note that there is a slight difference between SOSP and local minima\u2014an SOSP as de\ufb01ned here\ndoes not preclude higher-order saddle points, which themselves can be NP-hard to escape from\n[Anandkumar and Ge, 2016].\nSince an SOSP is characterized by its gradient and Hessian, and since convergence of algorithms to\nan SOSP will depend on these derivatives in a neighborhood of an SOSP, it is necessary to impose\nsmoothness conditions on the gradient and Hessian. A minimal set of conditions that have become\nstandard in the literature are the following.\nDe\ufb01nition 2. A function F is (cid:96)-gradient Lipschitz if \u2200x, y (cid:107)\u2207F (x) \u2212 \u2207F (y)(cid:107) \u2264 (cid:96)(cid:107)x \u2212 y(cid:107).\nDe\ufb01nition 3. A function F is \u03c1-Hessian Lipschitz if \u2200x, y (cid:107)\u22072F (x) \u2212 \u22072F (y)(cid:107) \u2264 \u03c1(cid:107)x \u2212 y(cid:107).\nAnother common assumption is that the function is bounded.\nDe\ufb01nition 4. A function F is B-bounded if for any x that |F (x)| \u2264 B.\nFor any \ufb01nite-time algorithm, we cannot hope to \ufb01nd an exact SOSP. Instead, we can de\ufb01ne \u0001-\napproximate SOSP that satisfy relaxations of the \ufb01rst- and second-order optimality conditions.\nLetting \u0001 vary allows us to obtain rates of convergence.\nDe\ufb01nition 5. x is an \u0001-second-order stationary point (\u0001-SOSP) of a \u03c1-Hessian Lipschitz function\nF if\n\n(cid:107)\u2207F (x)(cid:107) \u2264 \u0001 and \u03bbmin(\u22072F (x)) \u2265 \u2212\u221a\n\n\u03c1\u0001.\n\nGiven these de\ufb01nitions, we can ask whether it is possible to \ufb01nd an \u0001-SOSP in polynomial time under\nthe Lipchitz properties. Various authors have answered this question in the af\ufb01rmative.\nTheorem 6. [e.g. Carmon et al., 2016, Agarwal et al., 2017, Jin et al., 2017a] If the function\nF : Rd \u2192 R is B-bounded, l-gradient Lipschitz and \u03c1 Hessian Lipschitz, given access to the gradient\n(and sometimes Hessian) of F , it is possible to \ufb01nd an \u0001-SOSP in poly(d, B, l, \u03c1, 1/\u0001) time.\n\n3 Main Results\nIn the setting we consider, there is an unknown function F (the population risk) that has regularity\nproperties (bounded, gradient and Hessian Lipschitz). However, we only have access to a function f\n(the empirical risk) that may not even be everywhere differentiable. The only information we use is\nthat f is pointwise close to F . More precisely, we assume\nAssumption A1. We assume that the function pair (F : Rd \u2192 R, f : Rd \u2192 R) satis\ufb01es the\nfollowing properties:\n\n1. F is B-bounded, (cid:96)-gradient Lipschitz, \u03c1-Hessian Lipschitz.\n\n4\n\n\fAlgorithm 1 Zero-th order Perturbed Stochastic Gradient Descent (ZPSGD)\nInput: x0, learning rate \u03b7, noise radius r, mini-batch size m.\n\nt\n\nt\n\nt ) \u2212 f (xt)]/(m\u03c32)\n\n\u03bet uniformly \u223c B0(r)\n\nfor t = 0, 1, . . . , do\n\ngt(xt) \u2190(cid:80)m\n\nsample (z(1)\nxt+1 \u2190 xt \u2212 \u03b7(gt(xt) + \u03bet),\n\n,\u00b7\u00b7\u00b7 , z(m)\ni=1 z(i)\n\nt\n[f (xt + z(i)\n\n) \u223c N (0, \u03c32I)\n\nreturn xT\n\n2. f, F are \u03bd-pointwise close; i.e., (cid:107)F \u2212 f(cid:107)\u221e \u2264 \u03bd.\n\nAs we explained in Section 2, our goal is to \ufb01nd second-order stationary points of F given only\nfunction value access to f. More precisely:\nProblem 1. Given a function pair (F, f) that satis\ufb01es Assumption A1, \ufb01nd an \u0001-second-order\nstationary point of F with only access to values of f.\n\nThe only way our algorithms are allowed to interact with f is to query a point x, and obtain a function\nvalue f (x). This is usually called a zero-th order oracle in the optimization literature. In this paper\nwe give tight upper and lower bounds for the dependencies between \u03bd, \u0001 and d, both for algorithms\nwith polynomially many queries and in the information-theoretic limit.\n\n3.1 Optimal algorithm with polynomial number of queries\n\nThere are three main dif\ufb01culties in applying stochastic gradient descent to Problem 1: (1) in order\nto converge to a second-order stationary point of F , the algorithm must avoid being stuck in saddle\npoints; (2) the algorithm does not have access to the gradient of f; (3) there is a gap between the\nobserved f and the target F , which might introduce non-smoothness or additional local minima.\nThe \ufb01rst dif\ufb01culty was addressed in Jin et al. [2017a] by perturbing the iterates in a small ball; this\npushes the iterates away from any potential saddle points. For the latter two dif\ufb01culties, we apply\nGaussian smoothing to f and use z[f (x + z) \u2212 f (x)]/\u03c32 (z \u223c N (0, \u03c32I)) as a stochastic gradient\nestimate. This estimate, which only requires function values of f, is well known in the zero-th order\noptimization literature [e.g. Duchi et al., 2015]. For more details, see Section 4.1.\nIn short, our algorithm (Algorithm 1) is a variant of SGD, which uses z[f (x + z) \u2212 f (x)]/\u03c32 as\nthe gradient estimate (computed over mini-batches), and adds isotropic perturbations. Using this\nalgorithm, we can achieve the following trade-off between \u03bd and \u0001.\nTheorem 7 (Upper Bound (ZPSGD)). Given that the function pair (F, f) satis\ufb01es Assump-\n\ntion A1 with \u03bd \u2264 O((cid:112)\u00013/\u03c1 \u00b7 (1/d)), then for any \u03b4 > 0, with smoothing parameter \u03c3 =\n\u0398((cid:112)\u0001/(\u03c1d)), learning rate \u03b7 = 1/(cid:96), perturbation r = \u02dc\u0398(\u0001), and mini-batch size m =\n\npoly(d, B, (cid:96), \u03c1, 1/\u0001, log(1/\u03b4)), ZPSGD will \ufb01nd an \u0001-second-order stationary point of F with proba-\nbility 1 \u2212 \u03b4, in poly(d, B, (cid:96), \u03c1, 1/\u0001, log(1/\u03b4)) number of queries.\n\nTheorem 7 shows that assuming a small enough function error \u03bd, ZPSGD will solve Problem 1 within\na number of queries that is polynomial in all the problem-dependent parameters. The tolerance on\nfunction error \u03bd varies inversely with the number of dimensions, d. This rate is in fact optimal for all\npolynomial queries algorithms. In the following result, we show that the \u0001, \u03c1, and d dependencies in\nfunction difference \u03bd are tight up to a logarithmic factors in d.\nTheorem 8 (Polynomial Queries Lower Bound). For any B > 0, (cid:96) > 0, \u03c1 > 0 there exists\n\u00010 = \u0398(min{(cid:96)2/\u03c1, (B2\u03c1/d2)1/3}) such that for any \u0001 \u2208 (0, \u00010], there exists a function pair (F, f)\n\nsatisfying Assumption A1 with \u03bd = \u02dc\u0398((cid:112)\u00013/\u03c1 \u00b7 (1/d)), so that any algorithm that only queries a\n\npolynomial number of function values of f will fail, with high probability, to \ufb01nd an \u0001-SOSP of F .\n\nThis theorem establishes that for any \u03c1, (cid:96), B and any \u0001 small enough, we can construct a randomized\n\u2018hard\u2019 instance (F, f) such that any (possibly randomized) algorithm with a polynomial number of\nqueries will fail to \ufb01nd an \u0001-SOSP of F with high probability. Note that the error \u03bd here is only a\npoly-logarithmic factor larger than the requirement for our algorithm. In other words, the guarantee\nof our Algorithm 1 in Theorem 7 is optimal up to a logarithmic factor.\n\n5\n\n\f3.2\n\nInformation-theoretic guarantees\n\nIf we allow an unlimited number of queries, we can show that the upper and lower bounds on\nthe function error tolerance \u03bd no longer depends on the problem dimension d. That is, Problem 1\nexhibits a statistical-computational gap\u2014polynomial-queries algorithms are unable to achieve the\ninformation-theoretic limit. We \ufb01rst state that an algorithm (with exponential queries) is able to \ufb01nd\nan \u0001-SOSP of F despite a much larger value of error \u03bd. The basic algorithmic idea is that an \u0001-SOSP\nmust exist within some compact space, such that once we have a subroutine that approximately\ncomputes the gradient and Hessian of F at an arbitrary point, we can perform a grid search over this\ncompact space (see Section D for more details):\nTheorem 9. There exists an algorithm so that if the function pair (F, f) satis\ufb01es Assumption A1 with\n\u03c1\u0001, then the algorithm will \ufb01nd an \u0001-second-order stationary point of F\n\n\u03bd \u2264 O((cid:112)\u00013/\u03c1) and (cid:96) >\n\n\u221a\n\nwith an exponential number of queries.\n\nWe also show a corresponding information-theoretic lower bound that prevents any algorithm from\neven identifying a second-order stationary point of F . This completes the characterization of function\nerror tolerance \u03bd in terms of required accuracy \u0001.\nTheorem 10. For any B > 0, (cid:96) > 0, \u03c1 > 0, there exists \u00010 = \u0398(min{(cid:96)2/\u03c1, (B2\u03c1/d)1/3}) such that\n\nfor any \u0001 \u2208 (0, \u00010] there exists a function pair (F, f) satisfying Assumption A1 with \u03bd = O((cid:112)\u00013/\u03c1),\n\nso that any algorithm will fail, with high probability, to \ufb01nd an \u0001-SOSP of F .\n\n3.3 Extension: Gradients pointwise close\n\nWe may extend our algorithmic ideas to solve the problem of optimizing an unknown smooth function\nF when given only a gradient vector \ufb01eld g : Rd \u2192 Rd that is pointwise close to the gradient \u2207F .\nSpeci\ufb01cally, we answer the question: what is the error in the gradient oracle that we can tolerate to\nobtain optimization guarantees for the true function F ? We observe that our algorithm\u2019s tolerance on\ngradient error is much better compared to Theorem 7. Details can be found in Appendix E and F.\n4 Overview of Analysis\nIn this section we present the key ideas underlying our theoretical results. We will focus on the results\nfor algorithms that make a polynomial number of queries (Theorems 7 and 8).\n4.1 Ef\ufb01cient algorithm for Problem 1\nWe \ufb01rst argue the correctness of Theorem 7. As discussed earlier, there are two key ideas in the\nalgorithm: Gaussian smoothing and perturbed stochastic gradient descent. Gaussian smoothing\nallows us to transform the (possibly non-smooth) function f into a smooth function \u02dcf\u03c3 that has\nsimilar second-order stationary points as F ; at the same time, it can also convert function evaluations\nof f into a stochastic gradient of \u02dcf\u03c3. We can use this stochastic gradient information to \ufb01nd a\nsecond-order stationary point of \u02dcf\u03c3, which by the choice of the smoothing radius is guaranteed to be\nan approximate second-order stationary point of F .\nFirst, we introduce Gaussian smoothing, which perturbs the current point x using a multivariate\nGaussian and then takes an expectation over the function value.\nDe\ufb01nition 11 (Gaussian smoothing). Given f satisfying assumption A1, de\ufb01ne its Gaussian smooth-\ning as \u02dcf\u03c3(x) = Ez\u223cN (0,\u03c32I)[f (x + z)]. The parameter \u03c3 is henceforth called the smoothing radius.\nIn general f need not be smooth or even differentiable, but its Gaussian smoothing \u02dcf\u03c3 will be a\ndifferentiable function. Although it is in general dif\ufb01cult to calculate the exact smoothed function \u02dcf\u03c3,\nit is not hard to give an unbiased estimate of function value and gradient of \u02dcf\u03c3:\nLemma 12. [e.g. Duchi et al., 2015] Let \u02dcf\u03c3 be the Gaussian smoothing of f (as in De\ufb01nition 11),\nthe gradient of \u02dcf\u03c3 can be computed as \u2207 \u02dcf\u03c3 = 1\nLemma 12 allows us to query the function value of f to get an unbiased estimate of the gradient of\n\u02dcf\u03c3. This stochastic gradient is used in Algorithm 1 to \ufb01nd a second-order stationary point of \u02dcf\u03c3.\nTo make sure the optimizer is effective on \u02dcf\u03c3 and that guarantees on \u02dcf\u03c3 carry over to the target\nfunction F , we need two sets of properties: the smoothed function \u02dcf\u03c3 should be gradient and Hessian\n\n\u03c32 Ez\u223cN (0,\u03c32I)[(f (x + z) \u2212 f (x))z].\n\n6\n\n\fLipschitz, and at the same time should have gradients and Hessians close to those of the true function\nF . These properties are summarized in the following lemma:\nLemma 13 (Property of smoothing). Assume that the function pair (F, f) satis\ufb01es Assumption A1,\nand let \u02dcf\u03c3(x) be as given in de\ufb01nition 11. Then, the following holds\n\n1. \u02dcf\u03c3(x) is O((cid:96) + \u03bd\n2. (cid:107)\u2207 \u02dcf\u03c3(x) \u2212 \u2207F (x)(cid:107) \u2264 O(\u03c1d\u03c32 + \u03bd\n\n\u03c32 )-gradient Lipschitz and O(\u03c1 + \u03bd\n\n\u03c33 )-Hessian Lipschitz.\n\u03c3 ) and (cid:107)\u22072 \u02dcf\u03c3(x) \u2212 \u22072F (x)(cid:107) \u2264 O(\u03c1\n\n\u221a\n\nd\u03c3 + \u03bd\n\n\u03c32 ).\n\nThe proof is deferred to Appendix A. Part (1) of the lemma says that the gradient (and Hessian)\nLipschitz constants of \u02dcf\u03c3 are similar to the gradient (and Hessian) Lipschitz constants of F up to a\nterm involving the function difference \u03bd and the smoothing parameter \u03c3. This means as f is allowed\nto deviate further from F , we must smooth over a larger radius\u2014choose a larger \u03c3\u2014to guarantee the\nsame smoothness as before. On the other hand, part (2) implies that choosing a large \u03c3 increases the\nupper bound on the gradient and Hessian difference between \u02dcf\u03c3 and F . Smoothing is a form of local\naveraging, so choosing a too-large radius will erase information about local geometry. The choice of\n\u03c3 must strike the right balance between making \u02dcf\u03c3 smooth (to guarantee ZPSGD \ufb01nds a \u0001-SOSP of\n\u02dcf\u03c3 ) and keeping the derivatives of \u02dcf\u03c3 close to those of F (to guarantee any \u0001-SOSP of \u02dcf\u03c3 is also an\n\nO(\u0001)-SOSP of F ). In Appendix A.3, we show that this can be satis\ufb01ed by choosing \u03c3 =(cid:112)\u0001/(\u03c1d).\n\nPerturbed stochastic gradient descent\nIn ZPSGD, we use the stochastic gradients suggested\nby Lemma 12. Perturbed Gradient Descent (PGD) [Jin et al., 2017a] was shown to converge to\na second-order stationary point. Here we use a simple modi\ufb01cation of PGD that relies on batch\nstochastic gradient. In order for PSGD to converge, we require that the stochastic gradients are\nwell-behaved; that is, they are unbiased and have good concentration properties, as asserted in the\nfollowing lemma. It is straightforward to verify given that we sample z from a zero-mean Gaussian\n(proof in Appendix A.2).\nLemma 14 (Property of stochastic gradient). Let g(x; z) = z[f (x + z) \u2212 f (x)]/\u03c32, where z \u223c\nN (0, \u03c32I). Then Ezg(x; z) = \u2207 \u02dcf\u03c3(x), and g(x; z) is sub-Gaussian with parameter B\n\u03c3 .\nAs it turns out, these assumptions suf\ufb01ce to guarantee that perturbed SGD (PSGD), a simple adaptation\nof PGD in Jin et al. [2017a] with stochastic gradient and large mini-batch size, converges to the\nsecond-order stationary point of the objective function.\nTheorem 15 (PSGD ef\ufb01ciently escapes saddle points [Jin et al., 2018], informal). Suppose f (\u00b7)\nis (cid:96)-gradient Lipschitz and \u03c1-Hessian Lipschitz, and stochastic gradient g(x, \u03b8) with Eg(x; \u03b8) =\n\u221a\n\u2207f (x) has a sub-Gaussian tail with parameter \u03c3/\nd, then for any \u03b4 > 0, with proper choice\nof hyperparameters, PSGD (Algorithm 3) will \ufb01nd an \u0001-SOSP of f with probability 1 \u2212 \u03b4, in\npoly(d, B, (cid:96), \u03c1, \u03c3, 1/\u0001, log(1/\u03b4)) number of queries.\n\nFor completeness, we include the formal version of the theorem and its proof in Appendix H.\n\u221a\nCombining this theorem and the second part of Lemma 13, we see that by choosing an appropriate\nd-SOSP for \u02dcf\u03c3 which is also an \u0001-SOSP\nsmoothing radius \u03c3, our algorithm ZPSGD \ufb01nds an C\u0001/\nfor F for some universal constant C.\n4.2 Polynomial queries lower bound\nThe proof of Theorem 8 depends on the construction of a \u2018hard\u2019 function pair. The argument\ncrucially depends on the concentration of measure in high dimensions. We provide a proof sketch in\nAppendix B and the full proof in Appendix C.\n5 Applications\nIn this section, we present several applications of our algorithm. We \ufb01rst show a simple example of\nlearning one recti\ufb01ed linear unit (ReLU), where the empirical risk is nonconvex and nonsmooth. We\nalso brie\ufb02y survey other potential applications for our model as stated in Problem 1.\n\n5.1 Statistical Learning Example: Learning ReLU\nConsider the simple example of learning a ReLU unit. Let ReLU(z) = max{z, 0} for z \u2208 R. Let\nw(cid:63)((cid:107)w(cid:63)(cid:107) = 1) be the desired solution. We assume data (xi, yi) is generated as yi = ReLU(x(cid:62)\ni w(cid:63))+\n\n7\n\n\fFigure 3: Population (left) and Empirical (right) risk for learning ReLU Unit , d = 2. Sharp corners\npresent in the empirical risk are not found in the population version.\n\u03b6i where noise \u03b6i \u223c N (0, 1). We further assume the features xi \u223c N (0, I) are also generated from a\nstandard Gaussian distribution. The empirical risk with a squared loss function is:\n\nn(cid:88)\n\ni=1\n\n\u02c6Rn(w) =\n\n1\nn\n\n(yi \u2212 ReLU(x(cid:62)\n\ni w))2.\n\nIts population version is R(w) = E[ \u02c6Rn(w)]. In this case, the empirical risk is highly nonsmooth\u2014in\nfact, not differentiable in all subspaces perpendicular to each xi. The population risk turns out to be\nsmooth in the entire space Rd except at 0. This is illustrated in Figure 3, where the empirical risk\ndisplays many sharp corners.\nDue to nonsmoothness at 0 even for population risk, we focus on a compact region B = {w|w(cid:62)w(cid:63) \u2265\n} \u2229 {w|(cid:107)w(cid:107) \u2264 2} which excludes 0. This region is large enough so that a random initialization\n1\u221a\nd\nhas at least constant probability of being inside it. We also show the following properties that allow\nus to apply Algorithm 1 directly:\nLemma 16. The population and empirical risk R, \u02c6Rn of learning a ReLU unit problem satis\ufb01es:\n1. If w0 \u2208 B, then runing ZPSGD (Algorithm 1) gives wt \u2208 B for all t with high probability.\n2. Inside B, R is O(1)-bounded, O(\n\n3. supw\u2208B | \u02c6Rn(w) \u2212 R(w)| \u2264 \u02dcO((cid:112)d/n) w.h.p.\n\nd)-gradient Lipschitz, and O(d)-Hessian Lipschitz.\n\n\u221a\n\n4. Inside B, R is nonconvex function, w(cid:63) is the only SOSP of R(w).\n\nThese properties show that the population loss has a well-behaved landscape, while the empirical risk\nis pointwise close. This is exactly what we need for Algorithm 1. Using Theorem 7 we immediately\nget the following sample complexity, which guarantees an approximate population risk minimizer.\nWe defer all proofs to Appendix G.\nTheorem 17. For learning a ReLU unit problem, suppose the sample size is n \u2265 \u02dcO(d4/\u00013), and the\ninitialization is w0 \u223c N (0, 1\nd I), then with at least constant probability, Algorithm 1 will output an\nestimator \u02c6w so that (cid:107) \u02c6w \u2212 w(cid:63)(cid:107) \u2264 \u0001.\n5.2 Other applications\nPrivate machine learning Data privacy is a signi\ufb01cant concern in machine learning as it creates\na trade-off between privacy preservation and successful learning. Previous work on differentially\nprivate machine learning [e.g. Chaudhuri et al., 2011] have studied objective perturbation, that is,\nadding noise to the original (convex) objective and optimizing this perturbed objective, as a way to\nsimultaneously guarantee differential privacy and learning generalization: f = F + p(\u03b5). Our results\nmay be used to extend such guarantees to nonconvex objectives, characterizing when it is possible to\noptimize F even if the data owner does not want to reveal the true value of F (x) and instead only\nreveals f (x) after adding a perturbation p(\u03b5), which depends on the privacy guarantee \u03b5.\n\nTwo stage robust optimization Motivated by the problem of adversarial examples in machine\nlearning, there has been a lot of recent interest [e.g. Steinhardt et al., 2017, Sinha et al., 2018] in\na form of robust optimization that involves a minimax problem formulation: minx maxu G(x, u).\nThe function F (x) = maxu G(x, u) tends to be nonconvex in such problems, since G can be very\ncomplicated. It can be intractable or costly to compute the solution to the inner maximization exactly,\nbut it is often possible to get a good enough approximation f, such that supx |F (x)\u2212 f (x)| = \u03bd. It is\nthen possible to solve minx f (x) by ZPSGD, with guarantees for the original optimization problem.\n\n8\n\n\fAcknowledgments\n\nWe thank Aditya Guntuboyina, Yuanzhi Li, Yi-An Ma, Jacob Steinhardt, and Yang Yuan for valuable\ndiscussions.\n\nReferences\nAlekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point\n\nbandit feedback. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.\n\nNaman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local\nminima faster than gradient descent. In Proceedings of the 49th Annual ACM Symposium on Theory of\nComputing, pages 1195\u20131199. ACM, 2017.\n\nAnimashree Anandkumar and Rong Ge. Ef\ufb01cient approaches for escaping higher order saddle points in non-\nconvex optimization. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), volume 49,\npages 81\u2013102, 2016.\n\nPeter Auer, Mark Herbster, and Manfred K Warmuth. Exponentially many local minima for single neurons. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 316\u2013322. 1996.\n\nPeter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. J. Mach. Learn. Res., 3, 2003.\n\nAlexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexander Rakhlin. Escaping the Local Minima\nvia Simulated Annealing: Optimization of Approximately Convex Functions. In Proceedings of the 28th\nConference on Learning Theory (COLT), pages 240\u2013265, 2015.\n\nSt\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. Oxford University Press, 2013.\n\nAlon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs.\nIn Proceedings of the International Conference on Machine Learning (ICML), volume 70, pages 605\u2013614.\nPMLR, 2017.\n\nYair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization.\n\narXiv preprint arXiv:1611.00756, 2016.\n\nKamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimiza-\n\ntion. J. Mach. Learn. Res., 12:1069\u20131109, July 2011. ISSN 1532-4435.\n\nLaurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets.\n\narXiv preprint arXiv:1703.04933, 2017.\n\nJohn C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order\nIEEE Trans. Information Theory, 61(5):\n\nconvex optimization: The power of two function evaluations.\n2788\u20132806, 2015.\n\nAbraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the\nbandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM\nSymposium on Discrete Algorithms (SODA), pages 385\u2013394, 2005.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online stochastic gradient for\n\ntensor decomposition. In Proceedings of the 28th Conference on Learning Theory (COLT), 2015.\n\nChi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points\nef\ufb01ciently. In Proceedings of the International Conference on Machine Learning (ICML), pages 1724\u20131732,\n2017a.\n\nChi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster\n\nthan gradient descent. CoRR, abs/1711.10456, 2017b.\n\nChi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. SGD escapes saddle points\n\nef\ufb01ciently. Personal Communication, 2018.\n\nNitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On\nlarge-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,\n2016.\n\n9\n\n\fScott Kirkpatrick, C. D. Gelatt, and Mario Vecchi. Optimization by simulated annealing. Science, 220(4598):\n\n671\u2013680, 1983.\n\nRobert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does SGD escape local minima?\n\nCoRR, abs/1802.06175, 2018.\n\nPo-Ling Loh and Martin J Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic\ntheory for local optima. In Advances in Neural Information Processing Systems (NIPS), pages 476\u2013484, 2013.\n\nSong Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint\n\narXiv:1607.06534, 2016.\n\nYurii Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet\n\nMathematics Doklady, 27:372\u2013376, 1983.\n\nYurii Nesterov. Introductory Lectures on Convex Programming. Springer, 2004.\n\nIngo Rechenberg and Manfred Eigen. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien\n\nder Biologischen Evolution. Frommann-Holzboog, Stuttgart, 1973.\n\nAndrej Risteski and Yuanzhi Li. Algorithms and matching lower bounds for approximately-convex optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 4745\u20134753. 2016.\n\nOhad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of\n\nthe 26th Annual Conference on Learning Theory (COLT), volume 30, 2013.\n\nYaron Singer and Jan Vondrak. Information-theoretic lower bounds for convex optimization with erroneous\n\noracles. In Advances in Neural Information Processing Systems (NIPS), pages 3204\u20133212. 2015.\n\nAman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with principled\n\nadversarial training. International Conference on Learning Representations, 2018.\n\nJacob Steinhardt, Pang W. Koh, and Percy Liang. Certi\ufb01ed defenses for data poisoning attacks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2017.\n\nMax Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings\n\nof the International Conference on Machine Learning (ICML), pages 681\u2013688, 2011.\n\nYuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin\n\ndynamics. Proceedings of the 30th Conference on Learning Theory (COLT), pages 1980\u20132022, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2380, "authors": [{"given_name": "Chi", "family_name": "Jin", "institution": "University of California, Berkeley"}, {"given_name": "Lydia T.", "family_name": "Liu", "institution": "University of California, Berk"}, {"given_name": "Rong", "family_name": "Ge", "institution": "Duke University"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}