{"title": "Beyond Convexity: Stochastic Quasi-Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1594, "page_last": 1602, "abstract": "This poster has been moved from Monday #86 to Thursday #101.\r\n\r\nStochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD).The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the concept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.", "full_text": "Beyond Convexity: Stochastic\nQuasi-Convex Optimization\n\nElad Hazan\n\nPrinceton University\n\nK\ufb01r Y. Levy\n\nTechnion\n\nehazan@cs.princeton.edu\n\nkfiryl@tx.technion.ac.il\n\nShai Shalev-Shwartz\nThe Hebrew University\nshais@cs.huji.ac.il\n\nAbstract\n\nStochastic convex optimization is a basic and well studied primitive in machine\nlearning. It is well known that convex and Lipschitz functions can be minimized\nef\ufb01ciently using Stochastic Gradient Descent (SGD).\nThe Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient\nDescent, which updates according to the direction of the gradients, rather than the\ngradients themselves. In this paper we analyze a stochastic version of NGD and\nprove its convergence to a global minimum for a wider class of functions: we\nrequire the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity\nbroadens the concept of unimodality to multidimensions and allows for certain\ntypes of saddle points, which are a known hurdle for \ufb01rst-order optimization meth-\nods such as gradient descent. Locally-Lipschitz functions are only required to be\nLipschitz in a small region around the optimum. This assumption circumvents\ngradient explosion, which is another known hurdle for gradient descent variants.\nInterestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient\ndescent algorithm provably requires a minimal minibatch size.\n\n1\n\nIntroduction\n\nThe bene\ufb01ts of using the Stochastic Gradient Descent (SGD) scheme for learning could not be\nstressed enough. For convex and Lipschitz objectives, SGD is guaranteed to \ufb01nd an \u0001-optimal so-\nlution within O(1/\u00012) iterations and requires only an unbiased estimator for the gradient, which\nis obtained with only one (or a few) data samples. However, when applied to non-convex prob-\nlems several drawbacks are revealed. In particular, SGD is widely used for deep learning [2], one\nof the most interesting \ufb01elds where stochastic non-convex optimization problems arise. Often, the\nobjective in these kind of problems demonstrates two extreme phenomena [3]: on the one hand\nplateaus\u2014regions with vanishing gradients; and on the other hand cliffs\u2014exceedingly high gradi-\nents. As expected, applying SGD to such problems is often reported to yield unsatisfactory results.\nIn this paper we analyze a stochastic version of the Normalized Gradient Descent (NGD) algorithm,\nwhich we denote by SNGD. Each iteration of SNGD is as simple and ef\ufb01cient as SGD, but is\nmuch more appropriate for non-convex optimization problems, overcoming some of the pitfalls that\nSGD may encounter. Particularly, we de\ufb01ne a family of locally-quasi-convex and locally-Lipschitz\nfunctions, and prove that SNGD is suitable for optimizing such objectives.\nLocal-Quasi-convexity is a generalization of unimodal functions to multidimensions, which includes\nquasi-convex, and convex functions as a subclass. Locally-Quasi-convex functions allow for certain\ntypes of plateaus and saddle points which are dif\ufb01cult for SGD and other gradient descent variants.\nLocal-Lipschitzness is a generalization of Lipschitz functions that only assumes Lipschitzness in a\nsmall region around the minima, whereas farther away the gradients may be unbounded. Gradient\nexplosion is, thus, another dif\ufb01culty that is successfully tackled by SNGD and poses dif\ufb01culties for\nother stochastic gradient descent variants.\n\n1\n\n\fOur contributions:\n\n\u2022 We introduce local-quasi-convexity, a property that extends quasi-convexity and captures\nunimodal functions which are not quasi-convex. We prove that NGD \ufb01nds an \u0001-optimal\nminimum for such functions within O(1/\u00012) iterations. As a special case, we show that the\nabove rate can be attained for quasi-convex functions that are Lipschitz in an \u2126(\u0001)-region\naround the optimum (gradients may be unbounded outside this region). For objectives that\nare also smooth in an \u2126(\u221a\u0001)-region around the optimum, we prove a faster rate of O(1/\u0001).\n\u2022 We introduce a new setup: stochastic optimization of locally-quasi-convex functions; and\nshow that this setup captures Generalized Linear Models (GLM) regression, [14]. For this\nsetup, we devise a stochastic version of NGD (SNGD), and show that it converges within\nO(1/\u00012) iterations to an \u0001-optimal minimum.\n\n\u2022 The above positive result requires that at each iteration of SNGD, the gradient should be\nestimated using a minibatch of a minimal size. We provide a negative result showing that\nif the minibatch size is too small then the algorithm might indeed diverge.\n\n\u2022 We report experimental results supporting our theoretical guarantees and demonstrate an\n\naccelerated convergence attained by SNGD.\n\n1.1 Related Work\n\nQuasi-convex optimization problems arise in numerous \ufb01elds, spanning economics [20, 12], indus-\ntrial organization [21] , and computer vision [8]. It is well known that quasi-convex optimization\ntasks can be solved by a series of convex feasibility problems [4]; However, generally solving such\nfeasibility problems may be very costly [6]. There exists a rich literature concerning quasi-convex\noptimization in the of\ufb02ine case, [17, 22, 9, 18]. A pioneering paper by [15], was the \ufb01rst to suggest\nan ef\ufb01cient algorithm, namely Normalized Gradient Descent, and prove that this algorithm attains \u0001-\noptimal solution within O(1/\u00012) iterations given a differentiable quasi-convex objective. This work\nwas later extended by [10], establishing the same rate for upper semi-continuous quasi-convex ob-\njectives. In [11] faster rates for quasi-convex optimization are attained, but they assume to know the\noptimal value of the objective, an assumption that generally does not hold in practice.\nAmong the deep learning community there have been several attempts to tackle plateaus/gradient-\nexplosion. Ideas spanning gradient-clipping [16], smart initialization [5], and more [13], have shown\nto improve training in practice. Yet, non of these works provides a theoretical analysis showing better\nconvergence guarantees. To the best of our knowledge, there are no previous results on stochastic\nversions of NGD, neither results regarding locally-quasi-convex/locally-Lipschitz functions.\n\n1.2 Plateaus and Cliffs - Dif\ufb01culties for GD\n\nextreme\n\nGradient\ndescent with\n\ufb01xed step sizes, including\nits stochastic variants,\nis\nknown to perform poorly\nwhen the gradients are\ntoo small\nin a plateau\nthe function, or\narea of\nalternatively when\nthe\nother\nhappens:\ngradient explosions. These\ntwo phenomena have been\nreported in certain types of\nnon-convex optimization,\nsuch as training of deep networks.\nFigure 1 depicts a one-dimensional family of functions for which GD behaves provably poorly. With\na large step-size, GD will hit the cliffs and then oscillate between the two boundaries. Alternatively,\nwith a small step size, the low gradients will cause GD to miss the middle valley which has constant\nsize of 1/2. On the other hand, this exact function is quasi-convex and locally-Lipschitz, and hence\nthe NGD algorithm provably converges to the optimum quickly.\n\nFigure 1: A quasi-convex Locally-Lipschitz function with plateaus\nand cliffs.\n\n2\n\nx\u21e4krf(x)k=m7!0krf(x)k=M7!112\f2 De\ufb01nitions and Notations\nWe use (cid:107) \u00b7 (cid:107) to denote the Euclidean norm. Bd(x, r) denotes the d dimensional Euclidean ball of\nradius r, centered around x, and Bd := Bd(0, 1). [N ] denotes the set {1, . . . , N}.\nFor simplicity, throughout the paper we always assume that functions are differentiable (but if not\nstated explicitly, we do not assume any bound on the norm of the gradients).\nDe\ufb01nition 2.1. (Local-Lipschitzness and Local-Smoothness) Let z \u2208 Rd, G, \u0001 \u2265 0. A function\nf : K (cid:55)\u2192 R is called (G, \u0001, z)-Locally-Lipschitz if for every x, y \u2208 Bd(z, \u0001), we have\n\nSimilarly, the function is (\u03b2, \u0001, z)-locally-smooth if for every x, y \u2208 Bd(z, \u0001) we have,\n\n|f (x) \u2212 f (y)| \u2264 G(cid:107)x \u2212 y(cid:107) .\n\n|f (y) \u2212 f (x) \u2212 (cid:104)\u2207f (y), x \u2212 y(cid:105)| \u2264\n\n\u03b2\n2 (cid:107)x \u2212 y(cid:107)2 .\n\nNext we de\ufb01ne quasi-convex functions:\nDe\ufb01nition 2.2. (Quasi-Convexity) We say that a function f : Rd (cid:55)\u2192 R is quasi-convex if \u2200x, y \u2208\nRd, such that f (y) \u2264 f (x), it follows that\n\n(cid:104)\u2207f (x), y \u2212 x(cid:105) \u2264 0 .\n\nWe further say that f is strictly-quasi-convex, if it is quasi-convex and its gradients vanish only at\nthe global minima, i.e., \u2200y : f (y) > minx\u2208Rd f (x) \u21d2 (cid:107)\u2207f (y)(cid:107) > 0.\nInformally, the above characterization states that the (opposite) gradient of a quasi-convex function\ndirects us in a global descent direction. Following is an equivalent (more common) de\ufb01nition:\nDe\ufb01nition 2.3. (Quasi-Convexity) We say that a function f : Rd (cid:55)\u2192 R is quasi-convex if any\n\u03b1-sublevel-set of f is convex, i.e., \u2200\u03b1 \u2208 R the set\n\nL\u03b1(f ) = {x : f (x) \u2264 \u03b1}\n\nis convex.\n\nThe equivalence between the above de\ufb01nitions can be found in [4]. During this paper we denote the\nsublevel-set of f at x by\n\nSf (x) = {y : f (y) \u2264 f (x)} .\n\n(1)\n\n3 Local-Quasi-Convexity\n\nQuasi-convexity does not fully capture the notion of unimodality in several dimension. As an exam-\nple let x = (x1, x2) \u2208 [\u221210, 10]2, and consider the function\n\ng(x) = (1 + e\u2212x1 )\u22121 + (1 + e\u2212x2)\u22121 .\n\n(2)\nIt is natural to consider g as unimodal since it acquires no local minima but for the unique\nglobal minima at x\u2217 = (\u221210,\u221210). However, g is not quasi-convex: consider the points\nx = (log 16,\u2212 log 4), y = (\u2212 log 4, log 16), which belong to the 1.2-sub-level set, their average\ndoes not belong to the same sub-level-set since g(x/2 + y/2) = 4/3.\nQuasi-convex functions always enable us to explore, meaning that the gradient always directs us\nin a global descent direction. Intuitively, from an optimization point of view, we only need such a\ndirection whenever we do not exploit, i.e., whenever we are not approximately optimal.\nIn what follows we de\ufb01ne local-quasi-convexity, a property that enables us to either explore/exploit.\nThis property captures a wider class of unimodal function (such as g above) rather than mere quasi-\nconvexity. Later we justify this de\ufb01nition by showing that it captures Generalized Linear Models\n(GLM) regression, see [14, 7].\nDe\ufb01nition 3.1. (Local-Quasi-Convexity) Let x, z \u2208 Rd, \u03ba, \u0001 > 0. We say that f : Rd (cid:55)\u2192 R is\n(\u0001, \u03ba, z)-Strictly-Locally-Quasi-Convex (SLQC) in x, if at least one of the following applies:\n\n1. f (x) \u2212 f (z) \u2264 \u0001 .\n\n3\n\n\f2. (cid:107)\u2207f (x)(cid:107) > 0, and for every y \u2208 B(z, \u0001/\u03ba) it holds that (cid:104)\u2207f (x), y \u2212 x(cid:105) \u2264 0 .\n\nNote that if f is G-Lispschitz and strictly-quasi-convex function, then \u2200x, z \u2208 Rd, \u2200\u0001 > 0, it\nholds that f is (\u0001, G, z)-SLQC in x. Recalling the function g that appears in Equation (2), then it\ncan be shown that \u2200\u0001 \u2208 (0, 1],\u2200x \u2208 [\u221210, 10]2 then this function is (\u0001, 1, x\u2217)-SLQC in x, where\nx\u2217 = (\u221210,\u221210).\n3.1 Generalized Linear Models (GLM)\n\n3.1.1 The Idealized GLM\ni=1 \u2208 Bd \u00d7 [0, 1], and an activation\nIn this setup we have a collection of m samples {(xi, yi)}m\nfunction \u03c6 : R (cid:55)\u2192 R. We are guaranteed to have w\u2217 \u2208 Rd such that: yi = \u03c6(cid:104)w\u2217, xi(cid:105), \u2200i \u2208 [m] (we\ndenote \u03c6(cid:104)w, x(cid:105) := \u03c6((cid:104)w, x(cid:105))). The performance of a predictor w \u2208 Rd, is measured by the average\nsquare error over all samples. (cid:99)errm(w) =\n\nm(cid:88)\n\n(3)\n\n(yi \u2212 \u03c6(cid:104)w, xi(cid:105))2 .\n\n1\nm\n\ni=1\n\nIn [7] it is shown that the Perceptron problem with \u03b3-margin is a private case of GLM regression.\nThe sigmoid function \u03c6(z) = (1 + e\u2212z)\u22121 is a popular activation function in the \ufb01eld of deep\nlearning. The next lemma states that in the idealized GLM problem with sigmoid activation, then\nthe error function is SLQC (but not quasi-convex). As we will see in Section 4 this implies that\n\nAlgorithm 1 \ufb01nds an \u0001-optimal minima of(cid:99)errm(w) within poly(1/\u0001) iterations.\nLemma 3.1. Consider the idealized GLM problem with the sigmoid activation, and assume that\n(cid:107)w\u2217(cid:107) \u2264 W . Then the error function appearing in Equation (3) is (\u0001, eW , w\u2217)-SLQC in w, \u2200\u0001 >\n0, \u2200w \u2208 Bd(0, W ) (But it is not generally quasi-convex).\n3.1.2 The Noisy GLM\ni=1 \u2208 Bd \u00d7 [0, 1],\nIn the noisy GLM setup (see [14, 7]), we may draw i.i.d. samples {(xi, yi)}m\nfrom an unknown distribution D. We assume that there exists a predictor w\u2217 \u2208 Rd such that\nE(x,y)\u223cD[y|x] = \u03c6(cid:104)w\u2217, x(cid:105), where \u03c6 is an activation function. Given w \u2208 Rd we de\ufb01ne its expected\nerror as follows:\nand it can be shown that w\u2217 is a global minima of E. We are interested in schemes that obtain an\ntheir empirical error(cid:99)errm(w), is de\ufb01ned as in Equation (3). The following lemma states that in this\n\u0001-optimal minima to E, within poly(1/\u0001) samples and optimization steps. Given m samples from D,\nsetup, letting m = \u2126(1/\u00012), then(cid:99)errm is SLQC with high probability. This property will enable us\nto apply Algorithm 2, to obtain an \u0001-optimal minima to E, within poly(1/\u0001) samples from D, and\npoly(1/\u0001) optimization steps.\nLemma 3.2. Let \u03b4, \u0001 \u2208 (0, 1). Consider the noisy GLM problem with the sigmoid activation,\nand assume that (cid:107)w\u2217(cid:107) \u2264 W . Given a \ufb01xed point w \u2208 B(0, W ), then w.p.\u2265 1 \u2212 \u03b4, after\nm \u2265 8e2W (W +1)2\nlog(1/\u03b4) samples, the empirical error function appearing in Equation (3) is\n(\u0001, eW , w\u2217)-SLQC in w.\nNote that if we had required the SLQC to hold \u2200w \u2208 B(0, W ), then we would need the number of\nsamples to depend on the dimension, d, which we would like to avoid. Instead, we require SLQC\nto hold for a \ufb01xed w. This satis\ufb01es the conditions of Algorithm 2, enabling us to \ufb01nd an \u0001-optimal\nsolution with a sample complexity that is independent of the dimension.\n\nE(w) = E(x,y)\u223cD(y \u2212 \u03c6(cid:104)w, x(cid:105))2 ,\n\n\u00012\n\n4 NGD for Locally-Quasi-Convex Optimization\n\nHere we present the NGD algorithm, and prove the convergence rate of this algorithm for SLQC\nobjectives. Our analysis is simple, enabling us to extend the convergence rate presented in [15]\nbeyond quasi-convex functions. We then show that quasi-convex and locally-Lipschitz objective are\nSLQC, implying that NGD converges even if the gradients are unbounded outside a small region\n\n4\n\n\fAlgorithm 1 Normalized Gradient Descent (NGD)\nInput: #Iterations T , x1 \u2208 Rd, learning rate \u03b7\nfor t = 1 . . . T do\nUpdate:\n\nxt+1 = xt \u2212 \u03b7\u02c6gt where gt = \u2207f (xt), \u02c6gt =\n\nend for\nReturn: \u00afxT = arg min{x1,...,xT } f (xt)\n\ngt\n(cid:107)gt(cid:107)\n\naround the minima. For quasi-convex and locally-smooth objectives, we show that NGD attains a\nfaster convergence rate.\nNGD is presented in Algorithm 1. NGD is similar to GD, except we normalize the gradients. It is\nintuitively clear that to obtain robustness to plateaus (where the gradient can be arbitrarily small)\nand to exploding gradients (where the gradient can be arbitrarily large), one must ignore the size\nof the gradient. It is more surprising that the information in the direction of the gradient suf\ufb01ces to\nguarantee convergence.\nFollowing is the main theorem of this section:\nTheorem 4.1. Fix \u0001 > 0, let f : Rd (cid:55)\u2192 R, and x\u2217 \u2208 arg minx\u2208Rd f (x). Given that f is (\u0001, \u03ba, x\u2217)-\nSLQC in every x \u2208 Rd. Then running the NGD algorithm with T \u2265 \u03ba2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012, and \u03b7 =\n\u0001/\u03ba, we have that: f (\u00afxT ) \u2212 f (x\u2217) \u2264 \u0001.\nTheorem 4.1 states that (\u00b7,\u00b7, x\u2217)-SLQC functions admit poly(1/\u0001) convergence rate using NGD.\nThe intuition behind this lies in De\ufb01nition 3.1, which asserts that at a point x either the (oppo-\nsite) gradient points out a global optimization direction, or we are already \u0001-optimal. Note that the\nrequirement of (\u0001,\u00b7,\u00b7)-SLQC in any x is not restrictive, as we have seen in Section 3, there are\ninteresting examples of functions that admit this property \u2200\u0001 \u2208 [0, 1], and for any x.\nFor simplicity we have presented NGD for unconstrained problems. Using projections we can eas-\nily extend the algorithm and and its analysis for constrained optimization over convex sets. This\nwill enable to achieve convergence of O(1/\u00012) for the objective presented in Equation (2), and the\nidealized GLM problem presented in Section 3.1.1. We are now ready to prove Theorem 4.1:\n\nProof of Theorem 4.1. First note that if the gradient of f vanishes at xt, then by the SLQC assump-\ntion we must have that f (xt)\u2212f (x\u2217) \u2264 \u0001. Assume next that we perform T iterations and the gradient\nof f at xt never vanishes in these iterations. Consider the update rule of NGD (Algorithm 1), then\nby standard algebra we get,\n\n(cid:107)xt+1 \u2212 x\u2217(cid:107)2 = (cid:107)xt \u2212 x\u2217(cid:107)2 \u2212 2\u03b7(cid:104)\u02c6gt, xt \u2212 x\u2217(cid:105) + \u03b72 .\n\nAssume that \u2200t \u2208 [T ] we have f (xt) \u2212 f (x\u2217) > \u0001. Take y = x\u2217 + (\u0001/\u03ba) \u02c6gt, and observe that\n(cid:107)y \u2212 x\u2217(cid:107) \u2264 \u0001/\u03ba. The (\u0001, \u03ba, x\u2217)-SLQC assumption implies that (cid:104)\u02c6gt, y \u2212 xt(cid:105) \u2264 0, and therefore\n\n(cid:104)\u02c6gt, x\u2217 + (\u0001/\u03ba) \u02c6gt \u2212 xt(cid:105) \u2264 0 \u21d2 (cid:104)\u02c6gt, xt \u2212 x\u2217(cid:105) \u2265 \u0001/\u03ba .\n\nSetting \u03b7 = \u0001/\u03ba, the above implies,\n\n(cid:107)xt+1 \u2212 x\u2217(cid:107)2 \u2264 (cid:107)xt \u2212 x\u2217(cid:107)2 \u2212 2\u03b7\u0001/\u03ba + \u03b72\n\n= (cid:107)xt \u2212 x\u2217(cid:107)2 \u2212 \u00012/\u03ba2 .\n\nThus, after T iterations for which f (xt) \u2212 f (x\u2217) > \u0001 we get\n\n0 \u2264 (cid:107)xT +1 \u2212 x\u2217(cid:107)2 \u2264 (cid:107)x1 \u2212 x\u2217(cid:107)2 \u2212 T \u00012/\u03ba2 ,\n\nTherefore, we must have T \u2264 \u03ba2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012 .\n4.1 Locally-Lipschitz/Smooth Quasi-Convex Optimization\n\nIt can be shown that strict-quasi-convexity and (G, \u0001/G, x\u2217)-local-Lipschitzness of f implies that f\nis (\u0001, G, x\u2217)-SLQC \u2200x \u2208 Rd, \u2200\u0001 \u2265 0, and x\u2217 \u2208 arg minx\u2208Rd f (x). Therefore the following is a\ndirect corollary of Theorem 4.1:\n\n5\n\n\fAlgorithm 2 Stochastic Normalized Gradient Descent (SNGD)\nInput: #Iterations T , x1 \u2208 Rd, learning rate \u03b7, minibatch size b\nfor t = 1 . . . T do\nSample: {\u03c8i}b\n\ni=1 \u223c Db, and de\ufb01ne,\n\nb(cid:88)\n\ni=1\n\nft(x) =\n\n1\nb\n\n\u03c8i(x)\n\nUpdate:\n\nxt+1 = xt \u2212 \u03b7\u02c6gt where gt = \u2207ft(xt), \u02c6gt =\n\nend for\nReturn: \u00afxT = arg min{x1,...,xT } ft(xt)\n\ngt\n(cid:107)gt(cid:107)\n\nCorollary 4.1. Fix \u0001 > 0, let f : Rd (cid:55)\u2192 R, and x\u2217 \u2208 arg minx\u2208Rd f (x). Given that f is\nstrictly quasi-convex and (G, \u0001/G, x\u2217)-locally-Lipschitz. Then running the NGD algorithm with\nT \u2265 G2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012, and \u03b7 = \u0001/G, we have that: f (\u00afxT ) \u2212 f (x\u2217) \u2264 \u0001.\n(cid:112)\nIn case f is also locally-smooth, we state an even faster rate:\n(cid:112)\nTheorem 4.2. Fix \u0001 > 0, let f : Rd (cid:55)\u2192 R, and x\u2217 \u2208 arg minx\u2208Rd f (x). Given that f is strictly\nquasi-convex and (\u03b2,\n2\u0001/\u03b2, x\u2217)-locally-smooth. Then running the NGD algorithm with T \u2265\n\u03b2(cid:107)x1 \u2212 x\u2217(cid:107)2/2\u0001, and \u03b7 =\nRemark 1. The above corollary (resp. theorem) implies that f could have arbitrarily large gradients\nand second derivatives outside B(x\u2217, \u0001/G) (resp. B(x\u2217,\n2\u0001/\u03b2)), yet NGD is still ensured to output\nan \u0001-optimal point within G2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012 (resp. \u03b2(cid:107)x1 \u2212 x\u2217(cid:107)2/2\u0001) iterations. We are not familiar\nwith a similar guarantee for GD even in the convex case.\n\n2\u0001/\u03b2, we have that: f (\u00afxT ) \u2212 f (x\u2217) \u2264 \u0001.\n\n(cid:112)\n\n5 SNGD for Stochastic SLQC Optimization\n\nHere we describe the setting of stochastic SLQC optimization. Then we describe our SNGD algo-\nrithm which is ensured to yield an \u0001-optimal solution within poly(1/\u0001) queries. We also show that\nthe (noisy) GLM problem, described in Section 3.1.2 is an instance of stochastic SLQC optimiza-\ntion, allowing us to provably solve this problem within poly(1/\u0001) samples and optimization steps\nusing SNGD.\n\nf (x) := E\u03c8\u223cD[\u03c8(x)] .\n\nThe stochastic SLQC optimization Setup: Consider the problem of minimizing a function f :\nRd (cid:55)\u2192 R, and assume there exists a distribution over functions D, such that:\n(cid:80)b\n\nWe assume that we may access f by randomly sampling minibatches of size b, and querying\nthe gradients of these minibatches. Thus, upon querying a point xt \u2208 Rd, a random minibatch\ni=1 \u03c8i(x). We make the\ni=1 \u223c Db is sampled, and we receive \u2207ft(xt), where ft(x) = 1\n{\u03c8i}b\nfollowing assumption regarding the minibatch averages:\n(cid:80)b\nAssumption 5.1. Let T, \u0001, \u03b4 > 0, x\u2217 \u2208 arg minx\u2208Rd f (x). There exists \u03ba > 0, and a function\nb0 : R3 (cid:55)\u2192 R, that for b \u2265 b0(\u0001, \u03b4, T ) then w.p.\u2265 1\u2212 \u03b4 and \u2200t \u2208 [T ], the minibatch average ft(x) =\ni=1 \u03c8i(x) is (\u0001, \u03ba, x\u2217)-SLQC in xt. Moreover, we assume |ft(x)| \u2264 M, \u2200t \u2208 [T ], x \u2208 Rd .\n\n1\nb\n\nb\n\nNote that we assume that b0 = poly(1/\u0001, log(T /\u03b4)).\n\nJusti\ufb01cation of Assumption 5.1 Noisy GLM regression (see Section 3.1.2), is an interesting\ninstance of stochastic optimization problem where Assumption 5.1 holds.\nIndeed according to\nLemma 3.2, given \u0001, \u03b4, T > 0, then for b \u2265 \u2126(log(T /\u03b4)/\u00012) samples, the average minibatch func-\ntion is (\u0001, \u03ba, x\u2217)-SLQC in xt, \u2200t \u2208 [T ], w.p.\u2265 1 \u2212 \u03b4.\n\n6\n\n\fLocal-quasi-convexity of minibatch averages is a plausible assumption when we optimize an ex-\npected sum of quasi-convex functions that share common global minima (or when the different\nglobal minima are close by). As seen from the Examples presented in Equation (2), and in Sec-\ntions 3.1.1, 3.1.2, this sum is generally not quasi-convex, but is more often locally-quasi-convex.\nNote that in the general case when the objective is a sum of quasi-convex functions, the number of\nlocal minima of such objective may grow exponentially with the dimension d, see [1]. This might\nimply that a general setup where each \u03c8 \u223c D is quasi-convex may be generally hard.\n5.1 Main Results\n\n2\u00012\n\nSNGD is presented in Algorithm 2. SNGD is similar to SGD, except we normalize the gradients.\nThe normalization is crucial in order to take advantage of the SLQC assumption, and in order to\novercome the hurdles of plateaus and cliffs. Following is our main theorem:\nTheorem 5.1. Fix \u03b4, \u0001, G, M, \u03ba > 0. Suppose we run SNGD with T \u2265 \u03ba2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012 iterations,\n\u03b7 = \u0001/\u03ba, and b \u2265 max{ M 2 log(4T /\u03b4)\n, b0(\u0001, \u03b4, T )} . Assume that for b \u2265 b0(\u0001, \u03b4, T ) then w.p.\u2265 1\u2212 \u03b4\nand \u2200t \u2208 [T ], the function ft de\ufb01ned in the algorithm is M-bounded, and is also (\u0001, \u03ba, x\u2217)-SLQC in\nxt. Then, with probability of at least 1 \u2212 2\u03b4, we have that f (\u00afxT ) \u2212 f (x\u2217) \u2264 3\u0001.\nWe prove of Theorem 5.1 at the end of this section.\nRemark 2. Since strict-quasi-convexity and (G, \u0001/G, x\u2217)-local-Lipschitzness are equivalent to\nSLQC, the theorem implies that f could have arbitrarily large gradients outside B(x\u2217, \u0001/G), yet\nSNGD is still ensured to output an \u0001-optimal point within G2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012 iterations. We are not\nfamiliar with a similar guarantee for SGD even in the convex case.\nRemark 3. Theorem 5.1 requires the minibatch size to be \u2126(1/\u00012).\nIn the context of learning,\nthe number of functions, n, corresponds to the number of training examples. By standard sample\ncomplexity bounds, n should also be order of 1/\u00012. Therefore, one may wonder, if the size of the\nminibatch should be order of n. This is not true, since the required training set size is 1/\u00012 times\nthe VC dimension of the hypothesis class.\nIn many practical cases, the VC dimension is more\nsigni\ufb01cant than 1/\u00012, and therefore n will be much larger than the required minibatch size. The\nreason our analysis requires a minibatch of size 1/\u00012, without the VC dimension factor, is because\nwe are just \u201cvalidating\u201d and not \u201clearning\u201d.\n\nIn SGD and for the case of convex functions, even a minibatch of size 1 suf\ufb01ces for guaranteed\nconvergence. In contrast, for SNGD we require a minibatch of size 1/\u00012. The theorem below shows\nthat the requirement for a large minibatch is not an artifact of our analysis but is truly required.\nTheorem 5.2. Let \u0001 \u2208 (0, 0.1]; There exists a distribution over convex functions, such that running\n\u0001 , with a high probability it never reaches an \u0001-optimal solution\nSNGD with minibatch size of b = 0.2\n\nThe gap between the upper bound of 1/\u00012 and the lower bound of 1/\u0001 remains as an open question.\nWe now provide a sketch for the proof of Theorem 5.1:\n\nProof of Theorem 5.1. Theorem 5.1 is a consequence of the following two lemmas. In the \ufb01rst we\nshow that whenever all ft\u2019s are SLQC, there exists some t such that ft(xt) \u2212 ft(x\u2217) \u2264 \u0001. In the\nsecond lemma, we show that for a large enough minibatch size b, then for any t \u2208 [T ] we have\nf (xt) \u2264 ft(xt) + \u0001, and f (x\u2217) \u2265 ft(x\u2217) \u2212 \u0001. Combining these two lemmas we conclude that\nf (\u00afxT ) \u2212 f (x\u2217) \u2264 3\u0001.\nLemma 5.1. Let \u0001, \u03b4 > 0. Suppose we run SNGD for T \u2265 \u03ba2(cid:107)x1 \u2212 x\u2217(cid:107)2/\u00012 iterations, b \u2265\nb0(\u0001, \u03b4, T ), and \u03b7 = \u0001/\u03ba. Assume that w.p.\u2265 1 \u2212 \u03b4 all ft\u2019s are (\u0001, \u03ba, x\u2217)-SLQC in xt, whenever\nb \u2265 b0(\u0001, \u03b4, T ). Then w.p.\u2265 1 \u2212 \u03b4 we must have some t \u2208 [T ] for which ft(xt) \u2212 ft(x\u2217) \u2264 \u0001.\nLemma 5.1 is proved similarly to Theorem 4.1. We omit the proof due to space constraints.\nThe second Lemma relates ft(xt) \u2212 ft(x\u2217) \u2264 \u0001 to a bound on f (xt) \u2212 f (x\u2217).\nLemma 5.2. Suppose b \u2265 M 2 log(4T /\u03b4)\nf (xt) \u2264 ft(xt) + \u0001 ,\n\n\u0001\u22122 then w.p.\u2265 1 \u2212 \u03b4 and for every t \u2208 [T ]:\n\n2\n\nand also,\n\nf (x\u2217) \u2265 ft(x\u2217) \u2212 \u0001 .\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Comparison between optimizations schemes. Left: test error. Middle: objective value (on\ntraining set). On the Right we compare the objective of SNGD for different minibatch sizes.\n\nLemma 5.2 is a direct consequence of Hoeffding\u2019s bound. Using the de\ufb01nition of \u00afxT (Alg. 2) ,\ntogether with Lemma 5.2 gives:\n\nCombining the latter with Lemma 5.1, establishes Theorem 5.1.\n\nf (\u00afxT ) \u2212 f (x\u2217) \u2264 ft(xt) \u2212 ft(x\u2217) + 2\u0001, \u2200t \u2208 [T ]\n\n6 Experiments\n\nA better understanding of how to train deep neural networks is one of the greatest challenges in\ncurrent machine learning and optimization. Since learning NN (Neural Network) architectures es-\nsentially requires to solve a hard non-convex program, we have decided to focus our empirical study\non this type of tasks. As a test case, we train a Neural Network with a single hidden layer of 100\nunits over the MNIST data set. We use a ReLU activation function, and minimize the square loss.\nWe employ a regularization over weights with a parameter of \u03bb = 5 \u00b7 10\u22124.\nAt \ufb01rst we were interested in comparing the performance of SNGD to MSGD (Minibatch Stochastic\nGradient Descent), and to a stochastic variant of Nesterov\u2019s accelerated gradient method [19], which\nis considered to be state-of-the-art. For MSGD and Nesterov\u2019s method we used a step size rule of\nthe form \u03b7t = \u03b70(1 + \u03b3t)\u22123/4, with \u03b70 = 0.01 and \u03b3 = 10\u22124. For SNGD we used the constant\nstep size of 0.1. In Nesterov\u2019s method we used a momentum of 0.95. The comparison appears in\nFigures 2(a),2(b). As expected, MSGD converges relatively slowly. Conversely, the performance of\nSNGD is comparable with Nesterov\u2019s method. All methods employed a minibatch size of 100.\nLater, we were interested in examining the effect of minibatch size on the performance of SNGD. We\nemployed SNGD with different minibatch sizes. As seen in Figure 2(c), the performance improves\nsigni\ufb01cantly with the increase of minibatch size.\n\n7 Discussion\n\nWe have presented the \ufb01rst provable gradient-based algorithm for stochastic quasi-convex optimiza-\ntion. This is a \ufb01rst attempt at generalizing the well-developed machinery of stochastic convex opti-\nmization to the challenging non-convex problems facing machine learning, and better characterizing\nthe border between NP-hard non-convex optimization and tractable cases such as the ones studied\nherein.\nAmongst the numerous challenging questions that remain, we note that there is a gap between the\nupper and lower bound of the minibatch size suf\ufb01cient for SNGD to provably converge.\n\nAcknowledgments\n\nThe research leading to these results has received funding from the European Union\u2019s Seventh\nFramework Programme (FP7/2007-2013) under grant agreement n\u25e6 336078 \u2013 ERC-SUBLRN. Shai\nS-Shwartz is supported by ISF n\u25e6 1673/14 and by Intel\u2019s ICRI-CI.\n\n8\n\n500100015002000250030003500400045005000550060000.050.10.150.20.250.3IterationError MSGDNesterovSNGD500100015002000250030003500400045005000550060000.0050.010.0150.020.0250.030.0350.04IterationObjective MSGDNesterovSNGD500100015002000250030003500400045005000550060000.0050.010.0150.020.0250.030.0350.040.0450.050.055IterationObjective b =1b =10b =100b =500\fReferences\n[1] Peter Auer, Mark Herbster, and Manfred K Warmuth. Exponentially many local minima for\n\nsingle neurons. Advances in neural information processing systems, pages 316\u2013322, 1996.\n\n[2] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine\n\nLearning, 2(1):1\u2013127, 2009.\n\n[3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with\n\ngradient descent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[5] Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning.\n\nTransactions on neural networks, 1:75\u201380, 1993.\n\nIEEE\n\n[6] Jean-Louis Gof\ufb01n, Zhi-Quan Luo, and Yinyu Ye. Complexity analysis of an interior cutting\nplane method for convex feasibility problems. SIAM Journal on Optimization, 6(3):638\u2013652,\n1996.\n\n[7] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic re-\n\ngression. In COLT, 2009.\n\n[8] Qifa Ke and Takeo Kanade. Quasiconvex optimization for robust geometric reconstruction.\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 29(10):1834\u20131847, 2007.\n[9] Rustem F Khabibullin. A method to \ufb01nd a point of a convex set. Issled. Prik. Mat., 4:15\u201322,\n\n1977.\n\n[10] Krzysztof C Kiwiel. Convergence and ef\ufb01ciency of subgradient methods for quasiconvex min-\n\nimization. Mathematical programming, 90(1):1\u201325, 2001.\n\n[11] Igor V Konnov. On convergence properties of a subgradient method. Optimization Methods\n\nand Software, 18(1):53\u201362, 2003.\n\n[12] Jean-Jacques Laffont and David Martimort. The theory of incentives: the principal-agent\n\nmodel. Princeton university press, 2009.\n\n[13] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free opti-\nmization. In Proceedings of the 28th International Conference on Machine Learning (ICML-\n11), pages 1033\u20131040, 2011.\n\n[14] P. McCullagh and JA Nelder. Generalised linear models. London: Chapman and Hall/CRC,\n\n1989.\n\n[15] Yu E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions.\n\nMatekon, 29:519\u2013531, 1984.\n\n[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In Proceedings of The 30th International Conference on Machine Learning,\npages 1310\u20131318, 2013.\n\n[17] Boris T Polyak. A general method of solving extremum problems. Dokl. Akademii Nauk SSSR,\n\n174(1):33, 1967.\n\n[18] Jaros\u0142aw Sikorski. Quasi subgradient algorithms for calculating surrogate constraints. In Anal-\n\nysis and algorithms of optimization problems, pages 203\u2013236. Springer, 1986.\n\n[19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In Proceedings of the 30th International Con-\nference on Machine Learning (ICML-13), pages 1139\u20131147, 2013.\n\n[20] Hal R Varian. Price discrimination and social welfare. The American Economic Review, pages\n\n870\u2013875, 1985.\n\n[21] Elmar Wolfstetter. Topics in microeconomics: Industrial organization, auctions, and incen-\n\ntives. Cambridge University Press, 1999.\n\n[22] Yaroslav Ivanovich Zabotin, AI Korablev, and Rustem F Khabibullin. The minimization of\n\nquasicomplex functionals. Izv. Vyssh. Uch. Zaved. Mat., (10):27\u201333, 1972.\n\n9\n\n\f", "award": [], "sourceid": 994, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Kfir", "family_name": "Levy", "institution": "Technion"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "Hebrew University"}]}