{"title": "Uniform Convergence of Gradients for Non-Convex Learning and Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 8745, "page_last": 8756, "abstract": "We investigate 1) the rate at which refined properties of the empirical risk---in particular, gradients---converge to their population counterparts in standard non-convex learning tasks, and 2) the consequences of this convergence for optimization. Our analysis follows the tradition of norm-based capacity control. We propose vector-valued Rademacher complexities as a simple, composable, and user-friendly tool to derive dimension-free uniform convergence bounds for gradients in non-convex learning problems. As an application of our techniques, we give a new analysis of batch gradient descent methods for non-convex generalized linear models and non-convex robust regression, showing how to use any algorithm that finds approximate stationary points to obtain optimal sample complexity, even when dimension is high or possibly infinite and multiple passes over the dataset are allowed.\n\nMoving to non-smooth models we show----in contrast to the smooth case---that even for a single ReLU it is not possible to obtain dimension-independent convergence rates for gradients in the worst case. On the positive side, it is still possible to obtain dimension-independent rates under a new type of distributional assumption.", "full_text": "Uniform Convergence of Gradients for\nNon-Convex Learning and Optimization\n\nDylan J. Foster\nCornell University\n\ndjfoster@cornell.edu\n\nAyush Sekhari\nCornell University\n\nsekhari@cs.cornell.edu\n\nKarthik Sridharan\nCornell University\n\nsridharan@cs.cornell.edu\n\nAbstract\n\nWe investigate 1) the rate at which re\ufb01ned properties of the empirical risk\u2014in\nparticular, gradients\u2014converge to their population counterparts in standard non-\nconvex learning tasks, and 2) the consequences of this convergence for optimization.\nOur analysis follows the tradition of norm-based capacity control. We propose\nvector-valued Rademacher complexities as a simple, composable, and user-friendly\ntool to derive dimension-free uniform convergence bounds for gradients in non-\nconvex learning problems. As an application of our techniques, we give a new\nanalysis of batch gradient descent methods for non-convex generalized linear\nmodels and non-convex robust regression, showing how to use any algorithm that\n\ufb01nds approximate stationary points to obtain optimal sample complexity, even\nwhen dimension is high or possibly in\ufb01nite and multiple passes over the dataset\nare allowed.\nMoving to non-smooth models we show\u2014-in contrast to the smooth case\u2014that\neven for a single ReLU it is not possible to obtain dimension-independent conver-\ngence rates for gradients in the worst case. On the positive side, it is still possible to\nobtain dimension-independent rates under a new type of distributional assumption.\n\n1\n\nIntroduction\n\nThe last decade has seen a string of empirical successes for gradient-based algorithms solving large\nscale non-convex machine learning problems [24, 16]. Inspired by these successes, the theory\ncommunity has begun to make progress on understanding when gradient-based methods succeed\nfor non-convex learning in certain settings of interest [17]. The goal of the present work is to\nintroduce learning-theoretic tools to\u2014in a general sense\u2014improve understanding of when and why\ngradient-based methods succeed for non-convex learning problems.\nIn a standard formulation of the non-convex statistical learning problem, we aim to solve\n\nminimize LD(w)\u2236= E(x,y)\u223cD `(w ; x, y),\n\nwhere w\u2208W\u2286 Rd is a parameter vector,D is an unknown probability distribution over the instance\nspaceX\u00d7Y, and the loss ` is a potentially non-convex function of w. The learner cannot observeD\ndirectly and instead must \ufb01nd a model\u0302w\u2208W that minimizes LD given only access to i.i.d. samples\n(x1, y1), . . . ,(xn, yn)\u223cD. Their performance is quanti\ufb01ed by the excess risk LD(\u0302w)\u2212 L\uffff, where\nL\uffff= inf w\u2208W LD(w).\nto minimize the empirical risk\u0302Ln(w)\u2236= 1\nt=1 `(w ; xt, yt). If one succeeds at minimizing\u0302Ln,\nn\u2211n\n\nclassical statistical learning theory provides a comprehensive set of tools to bounds the excess risk\nof the procedure. The caveat is that when ` is non-convex, global optimization of the empirical risk\nmay be far from easy. It is not typically viable to even verify whether one is at a global minimizer\n\nGiven only access to samples, a standard (\u201csample average approximation\u201d) approach is to attempt\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof successes showing that for non-convex problems arising in machine learning, iterative optimizers\n\nmight be challenging, there is an abundance of gradient methods that provably \ufb01nd approximate\n\ngradient-based optimization, the empirical risk may not inherit these properties due to stochasticity. In\n\nof\u0302Ln. Moreover, even if the population risk LD has favorable properties that make it amenable to\nthe worst case, minimizing LD or\u0302Ln is simply intractable. However, recent years have seen a number\ncan succeed both in theory and in practice (see [17] for a survey). Notably, while minimizing\u0302Ln\nstationary points of the empirical risk, i.e.\uffff\u2207\u0302Ln(w)\uffff\u2264 \" [33, 10, 38, 3, 26]. In view of this, the\nhigh probability over samples, simultaneously over all w\u2208W,\uffff\u2207LD(w)\uffff\u2264\uffff\u2207\u0302Ln(w)\uffff+ \"; Second,\nto explore concrete non-convex problems where one can establish that the excess risk LD(\u0302w)\u2212 L\uffff is\n\npresent work has two aims: First, to provide a general set of tools to prove uniform convergence\nresults for gradients, with the goal of bounding how many samples are required to ensure that with\n\nsmall a consequence of this gradient uniform convergence. Together, these two directions yield direct\nbounds on the convergence of non-convex gradient-based learning algorithms to low excess risk.\nOur precise technical contributions are as follows:\n\n\u2022 We bring vector-valued Rademacher complexities [30] and associated vector-valued contrac-\ntion principles to bear on the analysis of uniform convergence for gradients. This approach\nenables norm-based capacity control, meaning that the bounds are independent of dimension\nwhenever the predictor norm and data norm are appropriately controlled. We introduce a\n\u201cchain rule\u201d for Rademacher complexity, which enables one to decompose the complexity\nof gradients of compositions into complexities of their components, and makes deriving\ndimension-independent complexity bounds for common non-convex classes quite simple.\n\n\u2022 We establish variants of the Gradient Domination condition for the population risk in certain\nnon-convex learning settings. The condition bounds excess risk in terms of the magnitude\nof gradients, and is satis\ufb01ed in non-convex learning problems including generalized linear\nmodels and robust regression. As a consequence of the gradient uniform convergence\nbounds, we show how to use any algorithm that \ufb01nds approximate stationary points for\nsmooth functions in a black-box fashion to obtain optimal sample complexity for these\nmodels\u2014both in high- and low-dimensional regimes. In particular, standard algorithms\nincluding gradient descent [33], SGD [10], Non-convex SVRG [38, 3], and SCSG [26] enjoy\noptimal sample complexity, even when allowed to take multiple passes over the dataset.\n\n\u2022 We show that for non-smooth losses dimension-independent uniform convergence is not\npossible in the worst case, but that this can be circumvented using a new type of margin\nassumption.\n\nRelated Work This work is inspired by [31], who gave dimension-dependent gradient and Hessian\nconvergence rates and optimization guarantees for the generalized linear model and robust regression\nsetups we study. We move beyond the dimension-dependent setting by providing norm-based capacity\ncontrol. Our bounds are independent of dimension whenever the predictor norm and data norm are\nsuf\ufb01ciently controlled (they work in in\ufb01nite dimension in the `2 case), but even when the norms are\nlarge we recover the optimal dimension-dependent rates.\nOptimizing the empirical risk under assumptions on the population risk has begun to attract signi\ufb01cant\nattention (e.g. [12, 18]). Without attempting a complete survey, we remark that these results typically\n\neffect. We view these works as complementary to our norm-based analysis.\n\ndepend on dimension, e.g. [18] require poly(d) samples before their optimization guarantees take\nNotation For a given norm\uffff\u22c5\uffff, the dual norm is denoted\uffff\u22c5\uffff\uffff.\uffff\u22c5\uffffp represents the standard `p norm\non Rd and\uffff\u22c5\uffff denotes the spectral norm. 1 denotes the all-ones vector, with dimension made clear\nfrom context. For a function f\u2236 Rd\u2192 R,\u2207f(x)\u2208 Rd and\u22072f(x)\u2208 Rd\u00d7d will denote the gradient\nand the Hessian of f at x respectively. f is said to be L-Lipschitz with respect to a norm\uffff\u22c5\uffff if\n\ufffff(x)\u2212 f(y)\uffff\u2264 L\uffffx\u2212 y\uffff\u2200x, y. Similarly, f is said to be H-smooth w.r.t norm\uffff\u22c5\uffff if its gradients are\nH-Lipschitz with respect to\uffff\u22c5\uffff, i.e.\uffff\u2207f(x)\u2212\u2207f(y)\uffff\uffff\u2264 H\uffffx\u2212 y\uffff for some H.\n\n2\n\n\f2 Gradient Uniform Convergence: Why and How\n\n2.1 Utility of Gradient Convergence: The Why\nBefore introducing our tools for establishing gradient uniform convergence, let us introduce a family\nof losses for which this convergence has immediate consequences for the design of non-convex\nstatistical learning algorithms.\n\nLD(w)\u2212 LD(w\uffff)\u2264 \u00b5\uffff\u2207LD(w)\uffff\u21b5 \u2200w\u2208W,\n\nDe\ufb01nition 1. The population risk LD satis\ufb01es the(\u21b5, \u00b5)-Gradient Domination condition with respect\nto a norm\uffff\u22c5\uffff if there are constants \u00b5> 0, \u21b5\u2208[1, 2] such that\nwhere w\uffff\u2208 arg min w\u2208W LD(w) is any population risk minimizer.\nThe case \u21b5= 2 is often referred to as the Polyak-\u0141ojasiewicz inequality [36, 23]. The general GD\n\ncondition implies that all critical points are global, and is itself implied (under technical restrictions)\nby many other well-known conditions including one-point convexity [28], star convexity and \u2327-star\nconvexity [14], and so-called \u201cregularity conditions\u201d [44]; for more see [23]. The GD condition is\nsatsi\ufb01ed\u2014sometimes locally rather than globally, and usually under distributional assumptions\u2014\nby the population risk in settings including neural networks with one hidden layer [28], ResNets\nwith linear activations [13], phase retrieval [44], matrix factorization [29], blind deconvolution [27],\nand\u2014as we show here\u2014generalized linear models and robust regression.\nThe GD condition states that to optimize the population risk it suf\ufb01ces to \ufb01nd a (population)\nstationary point. What are the consequences of the statement for the learning problem, given that\n\n(GD)\n\nw\u2208W\uffff\u2207\u0302Ln(w)\u2212\u2207LD(w)\uffff\u21b5+ c\uffff\uffff\n\nproposition shows, via gradient uniform convergence, that GD is immediately useful for non-convex\nlearning even when it is only satis\ufb01ed at the population level.\n\nthe learner only has access to the empirical risk\u0302Ln which itself may not satisfy GD? The next\nProposition 1. Suppose that LD satis\ufb01es the(\u21b5, \u00b5)-GD condition. Then, for any > 0, with\nt=1, every algorithm\u0302walg satis\ufb01es\nprobability at least 1\u2212 over the draw of the data{(xt, yt)}n\nLD(\u0302walg)\u2212 L\uffff\u2264 2 \u00b5\uffff\uffff\uffff\uffff\u2207\u0302Ln(\u0302walg)\uffff\u21b5+ E sup\n2\uffff\uffff\uffff ,\n\uffff\uffff\nwhere the constant c depends only on the range of\uffff\u2207`\uffff.\nNote that ifW is a \ufb01nite set, then standard concentration arguments for norms along with the union\nbound imply that E supw\u2208W\uffff\u2207\u0302Ln(w)\u2212\u2207LD(w)\uffff\u2264 O\uffff\uffff log\uffffW\uffffn \uffff. For smooth losses, ifW\u2282 Rd\nis contained in a bounded ball, then by simply discretizing the setW up to precision \" (with O(\"\u2212d)\nelements), one can easily obtain a bound of E supw\u2208W\uffff\u2207\u0302Ln(w)\u2212\u2207LD(w)\uffff\u2264 O\uffff\uffff d\nn\uffff. This\nfollowing \u201cnorm-based capacity control\u201d form: E supw\u2208W\uffff\u2207\u0302Ln(w)\u2212\u2207LD(w)\uffff\u2264 O\uffff\uffffC(W)n \uffff\nwhereC(W) is a norm-dependent, but dimension-independent measure of the size ofW. Given such\na bound, any algorithm that guarantees\uffff\u2207\u0302Ln(\u0302walg)\uffff\u2264 O\uffff 1\u221an\uffff for a(\u21b5, \u00b5)-GD loss will obtain\nan overall excess risk bound of order O\uffff \u00b5\nn\u21b5\uffff2\uffff. For(1, \u00b51)-GD this translates to an overall O\uffff \u00b51\u221an\uffff\nn\uffff rate. The \ufb01rst rate becomes favorable when \u00b51\u2264\u221an\u22c5 \u00b52\nrate, whereas(2, \u00b52)-GD implies a O\uffff \u00b52\nrelated only to the radius of the setW, while \u00b52 depends inversely on the smallest eigenvalue of the\n\napproach recovers the dimension-dependent gradient convergence rates obtained in [31].\nOur goal is to go beyond this type of analysis and provide dimension-free rates that apply even\nwhen the dimension is larger than the number of examples, or possibly in\ufb01nite. Our bounds take the\n\npopulation covariance and so is well-behaved only for low-dimensional problems unless one makes\nstrong distributional assumptions.\nAn important feature of our analysis is that we need to establish the GD condition only for the\npopulation risk; for the examples we consider this is easy as long as we assume the model is well-\nspeci\ufb01ed. Once this is done, our convergence results hold for any algorithm that works on the dataset\n\nwhich typically happens for very high dimensional problems. For the examples we study, \u00b51 is\n\n\u21b5\n\nlog\uffff 1\n\uffffn\n\n(1)\n\n3\n\n\ft=1 and \ufb01nds an approximate \ufb01rst-order stationary point with\uffff\u2207\u0302Ln(\u0302walg)\uffff\u2264 \". First-order\n\nalgorithms that \ufb01nd approximate stationary points assuming only smoothness of the loss have enjoyed\na surge of recent interest [10, 38, 3, 1], so this is an appealing proposition.\n\n{(xt, yt)}n\n\n2.2 Vector Rademacher Complexities: The How\nThe starting point for our uniform convergence bounds for gradients is to apply the standard tool of\nsymmetrization\u2014a vector-valued version, to be precise. To this end let us introduce a normed variant\nof Rademacher complexity.\n\nDe\ufb01nition 2 (Normed Rademacher Complexity). Given a vector valued class of functionF that\nmaps the spaceZ to a vector space equipped with norm\uffff\u22c5\uffff, we de\ufb01ne the normed Rademacher\ncomplexity forF on instances z1\u2236n via\n\n(2)\n\nR\uffff\u22c5\uffff(F ; z1\u2236n)\u2236= E\u270f sup\nf\u2208F\uffff n\ufffft=1\n\n\u270ftf(zt)\uffff.\n\n(3)\n\n(4)\n\n1\n2\n\nE\u270f sup\n\nWith this de\ufb01nition we are ready to provide a straightforward generalization of the standard real-\nvalued symmetrization lemma.\n\nRademacher complexity that allows to easily control gradients of composition of functions.\n\nk=1\uffff\u2207Ft,k(w)\uffff2\u2264 LF . Then,\n\n\u270ft\u2207(Gt(Ft(w)))\uffff\u2264 LF E\u270f sup\nw\u2208W\n\nE sup\n\nw\u2208W\uffff\u2207\u0302Ln(w)\u2212\u2207LD(w)\uffff\u2264 4\n\nn\n\nn\ufffft=1\uffff\u270ft,\u2207Gt(Ft(w))\uffff+ LG E\u270f sup\n\nR\uffff\u22c5\uffff(\u2207`\u25cbW ; x1\u2236n, y1\u2236n)+ c\uffff\uffff\n\nRademacher random variables with \u270ft denoting the tth column\nThe concrete learning settings we study\u2014generalized linear models and robust regression\u2014all\ninvolve composing non-convex losses and non-linearities or transfer functions with a linear predictor.\n\nProposition 2. For any > 0, with probability at least 1\u2212 over the data{(xt, yt)}n\nt=1,\nlog\uffff 1\n\uffffn\n\uffff\uffff ,\nwhere the constant c depends only on the range of\uffff\u2207`\uffff.\nTo bound the complexity of the gradient class\u2207`\u25cbW, we introduce a chain rule for the normed\nTheorem 1 (Chain Rule for Rademacher Complexity). Let sequences of functions Gt\u2236 RK \u2192 R\nand Ft\u2236 Rd\u2192 RK be given. Suppose there are constants LG and LF such that for all 1\u2264 t\u2264 n,\n\uffff\u2207Gt\uffff2\u2264 LG and\uffff\u2211K\nw\u2208W\uffff n\ufffft=1\u2207Ft(w)\u270ft\uffff,\nw\u2208W\uffff n\ufffft=1\nwhere\u2207Ft denotes the Jacobian of Ft, which lives in Rd\u00d7K, and \u270f \u2208 {\u00b11}K\u00d7n is a matrix of\nThat is, `(w; xt, yt) can be written as `(w; xt, yt)= Gt(Ft(w)) where Gt(a) is some L-Lipschitz\nfunction that possibly depends on xt and yt and Ft(w)=\uffffw, xt\uffff. In this case, the chain rule for\nderivatives gives us that\u2207`(w; xt, yt)= G\u2032t(Ft(w))\u22c5\u2207Ft(w)= G\u2032t(\uffffw, xt\uffff))xt. Using the chain\nrule (with K= 1), we conclude that\n\u270ftG\u2032t(\uffffw, xt\uffff))\uffff+ L\u22c5 E\u270f\uffff n\ufffft=1\n\u270ftxt\uffff.\nt=1 \u270ftxt\uffff. The \ufb01rst\nclass of linear predictors and controlling the vector-valued random average E\u270f\uffff\u2211n\nt=1 \u270ftxt\uffff= O(\u221an);\nspaces, and more generally Banach spaces of Rademacher type 2, one has E\u270f\uffff\u2211n\nDe\ufb01nition 3. For a function classG\u2286\uffffZ\u2192 RK\uffff, the vector-valued Rademacher complexity is\n\nsee Appendix A for details.\nThe key tool used to prove Theorem 1, which appears throughout the technical portions of this paper,\nis the vector-valued Rademacher complexity due to [30].\n\nterm is handled using classical Rademacher complexity tools. As for the second term, it is a standard\nresult ([35]; see [19] for discussion in the context of learning theory) that for all smooth Banach\n\nThus, we have reduced the problem to controlling the Rademacher average for a real valued function\n\nE\u270f sup\n\nw\u2208W\uffff n\ufffft=1\n\n\u270ft\u2207`(w ; xt, yt)\uffff\u2264 E\u270f\uffff sup\nw\u2208W\n\nn\ufffft=1\n\n(5)\n\n\u2192R(g ; z1\u2236n)\u2236= E\u270f sup\ng\u2208G\n\nn\ufffft=1\uffff\u270ft, g(zt)\uffff.\n\n4\n\n\fThe vector-valued Rademacher complexity arises through an elegant contraction trick due to Maurer.\n\nTheorem 2 (Vector-valued contraction [30]). LetG \u2286 (Z \u2192 RK), and let ht \u2236 RK \u2192 R be a\nsequence of functions for t\u2208[n], each of which is L-Lipschitz with respect to `2. Then\n\n(6)\n\nE\u270f sup\n\ng\u2208G\n\nn\ufffft=1\n\n\u270ftht(g(zt))\u2264\u221a2L\u22c5\u2192R(G ; z1\u2236n).\n\nWe remark that while our applications require only gradient uniform convergence, we anticipate\nthat the tools of this section will \ufb01nd use in settings where convergence of higher-order derivatives\nis needed to ensure success of optimization routines. To this end, we have extended the chain rule\n(Theorem 1) to handle Hessian convergence; see Appendix E.\n\n3 Application: Smooth Models\n\nIn this section we instantiate the general gradient uniform convergence tools and the GD condition to\nderive optimization consequences for two standard settings previously studied by [31]: generalized\nlinear models and robust regression.\n\nGeneralized Linear Model We \ufb01rst consider the problem of learning a generalized linear model\n\nwith the square loss.1 Fix a norm\uffff\u22c5\uffff, takeX =\uffffx\u2208 Rd\uffff\uffffx\uffff\u2264 R\uffff,W =\uffffw\u2208 Rd\uffff\uffffw\uffff\uffff\u2264 B\uffff,\nandY = {0, 1}. Choose a link function \u2236 R \u2192 [0, 1] and de\ufb01ne the loss to be `(w ; x, y) =\n((\uffffw, x\uffff)\u2212 y)2. Standard choices for include the logistic link function (s)=(1+ e\u2212s)\u22121 and\nthe probit link function (s)= (s), where is the gaussian cumulative distribution function.\nAssumption 1 (Generalized Linear Model Regularity). LetS=[\u2212BR, BR].\n\nTo establish the GD property and provide uniform convergence bounds, we make the following\nregularity assumptions on the loss.\n\n(a)\u2203C\u2265 1 s.t. max{\u2032(s),\u2032\u2032(s)}\u2264 C for all s\u2208S.\n(b)\u2203c> 0 s.t. \u2032(s)\u2265 c for all s\u2208S.\n(c) E[y\uffff x]= (\uffffw\uffff, x\uffff) for some w\uffff\u2208W.\n\nleads to three \ufb01nal rates: a dimension-independent \u201cslow rate\u201d that holds for any smooth norm, a\ndimension-dependent fast rate for the `2 norm, and a sparsity-dependent fast rate that holds under an\nadditional restricted eigenvalue assumption. This gives rise to a family of generic excess risk bounds.\n\nAssumption (a) suf\ufb01ces to bound the normed Rademacher complexity R\uffff\u22c5\uffff(\u2207`\u25cbW), and combined\nwith (b) and (c) the assumption implies that LD satis\ufb01es three variants of GD condition, and this\nTo be precise, let us introduce some additional notation: \u2303= Ex[xx\uffff] is the data covariance matrix\nand min(\u2303) denotes the minimum non-zero eigenvalue. For sparsity dependent fast rates, de\ufb01ne\nC(S, \u21b5)\u2236=\uffff\u232b\u2208 Rd\uffff\uffff\u232bSC\uffff1\u2264 \u21b5\uffff\u232bS\uffff1\uffff and let min(\u2303)= inf \u232b\u2208C(S(w\uffff),1) \uffff\u232b,\u2303\u232b\uffff\n\uffff\u232b,\u232b\uffff be the restricted\neigenvalue.2 Lastly, recall that a norm\uffff\u22c5\uffff is said to be -smooth if the function (x)= 1\n2\uffffx\uffff2 has\n-Lipschitz gradients with respect to\uffff\u22c5\uffff.\nt=1 for any algorithm\u0302walg:\nhold with probability at least 1\u2212 over the draw of the data{(xt, yt)}n\n\u25cf Norm-Based/High-Dimensional Setup. WhenX is the ball for -smooth norm\uffff\u22c5\uffff andW is the\nLD(\u0302walg)\u2212 L\uffff\u2264 \u00b5h\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff+ Ch\u221an\n\u25cf Low-Dimensional `2\uffff`2 Setup. WhenX andW are both `2 balls:\nmin(\u2303)\uffff\u00b5l\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff2+ Cl\nn\uffff.\n\nTheorem 3. For the generalized linear model setting, the following excess risk inequalities each\n\n1[31] refer to the model as \u201cbinary classi\ufb01cation\u201d, since can model conditional probabilities of two classes.\n\nLD(\u0302walg)\u2212 L\uffff\u2264\n\n2Recall that S(w\uffff)\u2286[d] is the set of non-zero entries of w\uffff, and for any vector w, wS\u2208 Rd refers to the\n\nvector w with all entries in SC set to zero (as in [37]).\n\ndual ball,\n\n1\n\n.\n\n5\n\n\fModel\n\nAlgorithm\n\nSample Complexity\n\nNorm-based/In\ufb01nite dim. Low-dim.\n\nn/a\n\n[31] Theorem 6 n/a\n\nGeneralized Linear Proposition 3\n\nO(\"\u22122)\nO(\"\u22122)\nO(\"\u22122)\n\nTable 1: Sample complexity comparison. Highlighted cells indicate optimal sample complexity.\n\n[31] Theorem 4 n/a\nGLMtron [21]\nRobust Regression Proposition 3\n\nO(d\"\u22121)\nO(d\"\u22121)\nO(d\"\u22121)\nO(d\"\u22121)\n\u25cf Sparse `\u221e\uffff`1 Setup. WhenX is the `\u221e ball,W is the `1 ball, and\uffffw\uffff\uffff1= B:3\nThe quantities Ch\uffffCl\uffffCs and \u00b5h\uffff\u00b5l\uffff\u00b5s are constants depending on(B, R, C, c,, log(\u22121)) but\nnot explicitly on the dimension d (beyond logarithmic factors) or complexity of the classW (beyond\n\n min(\u2303)\uffff\u00b5s\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff2+ Cs\nn\uffff.\n\nLD(\u0302walg)\u2212 L\uffff\u2264 \uffffw\uffff\uffff0\n\nPrecise statements for the problem dependent constants in Theorem 3 including dependence on the\nnorms R and B can be found in Appendix C.\nWe now formally introduce the robust regression setting and provide a similar guarantee.\n\nB and R).\n\nSimilar to the generalized linear model setup, the robust regression setup satis\ufb01es three variants of\n\nTheorem 4. For the robust regression setting, the following excess risk inequalities each hold with\n\nwith a canonical example being Tukey\u2019s biweight loss.4 While optimization is clearly not possible\nfor arbitrary choices of \u21e2, the following assumption is suf\ufb01cient to guarantee that the population risk\n\nRobust Regression Fix a norm\uffff\u22c5\uffff and takeX =\uffffx\u2208 Rd\uffff\uffffx\uffff\u2264 R\uffff,W=\uffffw\u2208 Rd\uffff\uffffw\uffff\uffff\u2264 B\uffff,\nandY=[\u2212Y, Y] for some constant Y . We pick a potentially non-convex function \u21e2\u2236 R\u2192 R+ and\nde\ufb01ne the loss via `(w ; x, y)= \u21e2(\uffffw, x\uffff\u2212 y). Non-convex choices for \u21e2 arise in robust statistics,\nLD satis\ufb01es the GD property.\nAssumption 2 (Robust Regression Regularity). LetS=[\u2212(BR+ Y),(BR+ Y)].\n\n(a)\u2203C\u21e2\u2265 1 s.t. max{\u21e2\u2032(s),\u21e2\u2032\u2032(s)}\u2264 C\u21e2 for all s\u2208S.\n(b) \u21e2\u2032 is odd with \u21e2\u2032(s)> 0 for all s> 0 and h(s)\u2236= E\u21e3[\u21e2\u2032(s+ \u21e3)] has h\u2032(0)> c\u21e2.\n(c) There is w\uffff\u2208W such that y=\uffffw\uffff, x\uffff+ \u21e3, and \u21e3 is symmetric zero-mean given x.\nt=1 for any algorithm\u0302walg:\n\nthe GD depending on assumptions on the norm\uffff\u22c5\uffff and the data distribution.\nprobability at least 1\u2212 over the draw of the data{(xt, yt)}n\n\u25cf Norm-Based/High-Dimensional Setup. WhenX is the ball for -smooth norm\uffff\u22c5\uffff andW is the\nLD(\u0302walg)\u2212 L\uffff\u2264 \u00b5h\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff+ Ch\u221an\n\u25cf Low-Dimensional `2\uffff`2 Setup. WhenX andW are both `2 balls:\nmin(\u2303)\uffff\u00b5l\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff2+ Cl\nn\uffff.\n\u25cf Sparse `\u221e\uffff`1 Setup. WhenX is the `\u221e ball,W is the `1 ball, and\uffffw\uffff\uffff1= B:\n min(\u2303)\uffff\u00b5s\u22c5\uffff\u2207\u0302Ln(\u0302walg)\uffff2+ Cs\nn\uffff.\n3The constraint\uffffw\uffff\uffff1= B simpli\ufb01es analysis of generic algorithms in the vein of constrained LASSO [41].\n\ufffft\uffff\u2264 c.\n4For a \ufb01xed parameter c> 0 the biweight loss is de\ufb01ned via \u21e2(t)= c2\n\ufffft\uffff\u2265 c.\n\nLD(\u0302walg)\u2212 L\uffff\u2264\nLD(\u0302walg)\u2212 L\uffff\u2264 \uffffw\uffff\uffff0\n\n6 \u22c5\uffff 1\u2212(1\u2212(t\uffffc)2)3,\n\ndual ball,\n\n1,\n\n1\n\n.\n\n6\n\n\fThe constants Ch\uffffCl\uffffCs and \u00b5h\uffff\u00b5l\uffff\u00b5s depend on(B, R, C\u21e2, c\u21e2,, log(\u22121)), but not explicitly on\nthe dimension d (beyond log factors) or complexity of the classW (beyond the range parameters B\n\nand R).\n\nTheorem 3 and Theorem 4 immediately imply that standard non-convex optimization algorithms for\n\ufb01nding stationary points can be converted to non-convex learning algorithms with optimal sample\ncomplexity; this is summarized by the following theorem, focusing on the \u201chigh-dimensional\u201d and\n\u201clow-dimensional\u201d setups above in the case of the `2 norm for simplicity.\n\nconvex generalized linear model (under Assumption 1) and robust regression (under Assumption 2)\nsetting.\n\nd I,\uffff\u22c5\uffff= `2. Consider the following meta-algorithm for the non-\n\nProposition 3. Suppose that \u2303\uffff 1\n\ndimension-independent5 constant.\n\n\" samples{(xt, yt)}n\n\"2\u2227 d\n1. Gather n= 1\nt=1.\n2. Find a point\u0302walg\u2208W with\uffff\u2207\u0302Ln(\u0302walg)\uffff\u2264 O\uffff 1\u221an\uffff, which is guaranteed to exist.\n\nThere are many non-convex optimization algorithms that provably \ufb01nd an approximate stationary\npoint of the empirical risk, including gradient descent [33], SGD [10], and Non-convex SVRG [38, 3].\n\na-priori. We can circumvent this dif\ufb01culty and take advantage of these generic algorithms by instead\n\ufb01nding stationary points of the regularized empirical risk. We show that any algorithm that \ufb01nds a\n\nThis meta-algorithm guarantees E LD(\u0302walg)\u2212 L\uffff \u2264 C\u22c5 \", where C is a problem-dependent but\nNote, however, that these algorithms are not generically guaranteed to satisfy the constraint\u0302walg\u2208W\n(unconstrained) stationary point of the regularized empirical risk indeed the obtains optimal O\uffff 1\n\"2\uffff\nn(w) = \u0302Ln(w)+ \nsetting. Let \u0302L\n2. For any > 0 there is a setting of the regularization\n2\uffffw\uffff2\nn(\u0302walg)= 0 guarantees LD(\u0302walg)\u2212 L\uffff\u2264 \u02dcO\uffff\uffff log(\u22121)n\nparameter such that any \u0302walg with\u2207\u0302L\n\uffff\nwith probability at least 1\u2212 .\n\nsample complexity in the norm-based regime.\nTheorem 9 (informal). Suppose we are in the generalized linear model setting or robust regression\n\nSee Appendix C.3 for the full theorem statement and proof.\nNow is a good time to discuss connections to existing work in more detail.\n\nregime is particularly interesting, and goes beyond recent analyses to non-convex statistical\n\nhave unavoidable dimension-dependence. This highlights the power of the norm-based\ncomplexity analysis.\n\na) The sample complexity O\uffff 1\n\"\uffff for Proposition 3 is optimal up to dependence on Lip-\n\"2\u2227 d\nschitz constants and the range parameters B and R [42]. The \u201chigh-dimensional\u201d O\uffff 1\n\"2\uffff\nlearning [31], which use arguments involving pointwise covers of the spaceW and thus\nb) In the low-dimensional O\uffff d\n\"\uffff sample complexity regime, Theorem 3 and Theorem 4 recovers\npossible to guarantee c\u2265 e\u2212BR, and so it may be more realistic to assume BR is constant.\n\"2\uffff for the GLM setting. Our analysis\nc) The GLMtron algorithm of [21] also obtains O\uffff 1\nstationary point \ufb01nding algorithm will do. GLMtron has no guarantees in the O\uffff d\n\"\uffff regime,\n\nthe rates of [31] under the same assumptions\u2014see Appendix C.3 for details. Notably, this is\nthe case even when the radius R is not constant. Note however that when B and R are large\nthe constants in Theorem 3 and Theorem 4 can be quite poor. For the logistic link it is only\n\nwhereas our meta-algorithm works in both high- and low-dimensional regimes. A signi\ufb01cant\nbene\ufb01t of GLMtron, however, is that it does not require a lower bound on the derivative of\nthe link function . It is not clear if this assumption can be removed from our analysis.\n\nshows that this sample complexity does not require specialized algorithms; any \ufb01rst-order\n\nd) As an alternative approach, stochastic optimization methods for \ufb01nding \ufb01rst-order stationary\npoints can be used to directly \ufb01nd an approximate stationary point of the population risk\n\n5Whenever B and R are constant.\n\n7\n\n\fregime it is possible to show that stochastic gradient descent (and for general smooth norms,\n\nin Appendix C.3. This approach relies on returning a randomly selected iterate from the\nsequence and only gives an in-expectation sample complexity guarantee, whereas Theorem 9\ngives a high-probability guarantee.\n\n\uffffLD(w)\uffff\u2264 \", so long as they draw a fresh sample at each step. In the high-dimensional\nmirror descent) obtains O\uffff 1\n\"2\uffff sample complexity through this approach; this is sketched\nAlso, note that many stochastic optimization methods can exploit the(2, \u00b5)-GD condition.\nSuppose we are in the low-dimensional regime with \u2303\uffff 1\noptimization method that we are aware of is SNVRG [45], which under the(2, O(d))-GD\ncondition will obtain \" excess risk with O\uffff d\n\n\"+ d3\uffff2\n\"1\uffff2\uffff sample complexity.\n\nd I. The fastest GD-based stochastic\n\nThis discussion is summarized in Table 3.\n\n4 Non-Smooth Models\n\nIn the previous section we used gradient uniform convergence to derive immediate optimization\nand generalization consequences by \ufb01nding approximate stationary points of smooth non-convex\nfunctions. In practice\u2014notably in deep learning\u2014it is common to optimize non-smooth non-convex\nfunctions; deep neural networks with recti\ufb01ed linear units (ReLUs) are the canonical example [24, 16].\nIn theory, it is trivial to construct non-smooth functions for which \ufb01nding approximate stationary\npoints is intractable (see discussion in [2]), but it appears that in practice stochastic gradient descent\ncan indeed \ufb01nd approximate stationary points of the empirical loss in standard neural network\narchitectures [43]. It is desirable to understand whether gradient generalization can occur in this\nsetting.\nThe \ufb01rst result of this section is a lower bound showing that even for the simplest possible non-smooth\nmodel\u2014a single ReLU\u2014it is impossible to achieve dimension-independent uniform convergence\nresults similar to those of the previous section. On the positive side, we show that it is possible to\nobtain dimension-independent rates under an additional margin assumption.\n\nPerceptron setup. Note that the loss is not smooth, and so the gradient is not well-de\ufb01ned everywhere.\nThus, to make the problem well-de\ufb01ned, we consider convergence for the following representative\n\nuniform convergence for this setup must depend on dimension, even when the weight norm B and\ndata norm R are held constant.\n\nThe full setting is as follows: X \u2286\uffffx\u2208 Rd\uffff\uffffx\uffff2\u2264 1\uffff,W \u2286\uffffw\u2208 Rd\uffff\uffffw\uffff2\u2264 1\uffff,Y ={\u22121,+1},\nand `(w ; x, y)= (\u2212\uffffw, x\uffff\u22c5 y), where (s)= max{s, 0}; this essentially matches the classical\nfrom the subgradient:\u2207`(w ; x, y)\u2236=\u2212y {y\uffffw, x\uffff\u2264 0}\u22c5 x.6 Our \ufb01rst theorem shows that gradient\nTheorem 5. Under the problem setting de\ufb01ned above, for all n\u2208 N there exist a sequence of instances\n{(xt, yt)}n\na dimension-independent O(\u221an) upper bound on the Rademacher complexity. This is perhaps not\n\nsurprising since the gradients are discrete functions of w, and indeed VC-style arguments suf\ufb01ce to\nestablish the lower bound.\nIn the classical statistical learning setting, the main route to overcoming dimension dependence\u2014e.g.,\nfor linear classi\ufb01ers\u2014is to assume a margin, which allows one to move from a discrete class to\na real-valued class upon which a dimension-independent Rademacher complexity bound can be\napplied [39]. Such arguments have recently been used to derive dimension-independent function\nvalue uniform convergence bounds for deep ReLU networks as well [6, 11]. However, this analysis\nrelies on one-sided control of the loss, so it is not clear whether it extends to the inherently directional\nproblem of gradient convergence. Our main contribution in this section is to introduce additional\nmachinery to prove dimension-free gradient convergence under a new type of margin assumption.\n\n\u270ft\u2207`(w ; xt, yt)\uffff2= \u2326\uffff\u221adn\u2227 n\uffff.\n\nThis result contrasts the setting where is smooth, where the techniques from Section 2 easily yield\n\nt=1 such that\n\nE\u270f sup\n\nw\u2208W\uffff n\ufffft=1\n\n6For general non-convex and non-smooth functions one can extend this approach by considering convergence\n\nfor a representative from the Clarke sub-differential [7, 8].\n\n8\n\n\fDe\ufb01nition 4. Given a distribution P over the supportX and an increasing function \u2236[0, 1]\u2192[0, 1],\nany w\u2208W is said to satisfy the -soft-margin condition with respect to P if\n\uffffw\uffff2\uffffx\uffff2 \u2264 \uffff\uffff\u2264 ().\n\n\u2200\u2208[0, 1], Ex\u223cP\uffff \uffff \uffff\uffffw,x\uffff\uffff\n\nWe call a margin function. We de\ufb01ne the set of all weights that satisfy the -soft-margin condition\nwith respect to a distribution P via:\n\n(7)\n\nW(, P)=\uffffw\u2208W \u2236 \u2200\u2208[0, 1], Ex\u223cP\uffff \uffff \uffff\uffffw,x\uffff\uffff\n\n\uffffw\uffff2\uffffx\uffff2 \u2264 \uffff\uffff\u2264 ()\uffff.\n\n(8)\n\nfunction \ufb01xed in advance.\n\nOf particular interest isW(,\u0302Dn), the set of all the weights that satisfy the -soft-margin condition\nwith respect to the empirical data distribution. That is, any w\u2208W(,\u0302Dn) predicts with at least\na margin on all but a () fraction of the data. The following theorem provides a dimension-\nindependent uniform convergence bound for the gradients over the classW(,\u0302Dn) for any margin\nTheorem 6. Let \u2236[0, 1]\u2192[0, 1] be a \ufb01xed margin function. With probability at least 1\u2212 over\nthe draw of the data{(xt, yt)}n\nt=1,\n\n>0\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nw\u2208W(,\u0302Dn)\uffff\u2207LD(w)\u2212\u2207\u0302Ln(w)\uffff2\u2264 \u02dcO\uffff\uffff\uffffinf\n\uffff(4)+ 1\n) and log n factors.\nwhere \u02dcO(\u22c5) hides log log( 1\nAs a concrete example, when () = 1\n12), thus circumventing the lower bound of Theorem 5 for large values\nconvergence bound of O(n\u2212 1\n\n2 Theorem 6 yields a dimension-independent uniform\n\n\uffff\uffff\uffff\uffff log\uffff 1\n\uffffn\n\n4\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n\uffff\uffff\uffff ,\n\n1\n 1\n2 n 1\n\n+\n\nsup\n\nof d.\n\n5 Discussion\n\nWe showed that vector Rademacher complexities are a simple and effective tool for deriving\ndimension-independent uniform convergence bounds and used these bounds in conjunction with the\n(population) Gradient Domination property to derive optimal algorithms for non-convex statistical\nlearning in high and in\ufb01nite dimension. We hope that these tools will \ufb01nd broader use for norm-based\ncapacity control in non-convex learning settings beyond those considered here. Of particular interest\nare models where convergence of higher-order derivatives is needed to ensure success of optimization\nroutines. Appendix E contains an extension of Theorem 1 for Hessian uniform convergence, which\nwe anticipate will \ufb01nd use in such settings.\n\ndimension-independent norm-based capacity control. While there are many examples of models for\n\nIn Section 3 we analyzed generalized linear models and robust regression using both the(1, \u00b5)-GD\nproperty and the(2, \u00b5)-GD property. In particular, the(1, \u00b5)-GD property was critical to obtain\nwhich the population risk satis\ufb01es(2, \u00b5)-GD property (phase retrieval [40, 44], ResNets with linear\n(1, \u00b5)-GD property holds for these models. Establishing this property and consequently deriving\n\ndimension-independent optimization guarantees is an exciting future direction.\nLastly, an important question is to analyze non-smooth problems beyond the simple ReLU example\nconsidered in Section 4. See [9] for subsequent work in this direction.\n\nactivations [13], matrix factorization [29], blind deconvolution [27]), we do not know whether the\n\nAcknowledgements K.S acknowledges support from the NSF under grants CDS&E-MSS 1521544\nand NSF CAREER Award 1750575, and the support of an Alfred P. Sloan Fellowship. D.F. acknowl-\nedges support from the NDSEG PhD fellowship and Facebook PhD fellowship.\n\n9\n\n\fReferences\n[1] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. Advances in\n\nNeural Information Processing Systems, 2018.\n\n[2] Zeyuan Allen-Zhu. How To Make the Gradients Small Stochastically. Advances in Neural\n\nInformation Processing Systems, 2018.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Peter L Bartlett and John Shawe-Taylor. Generalization performance of support vector machines\n\nand other pattern classi\ufb01ers. In Advances in kernel methods, pages 43\u201354. MIT Press, 1999.\n\n[5] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities.\n\nThe Annals of Statistics, 33(4):1497\u20131537, 2005.\n\n[6] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems, pages 6241\u20136250,\n2017.\n\n[7] Jonathan Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization: theory\n\nand examples. Springer Science & Business Media, 2010.\n\n[8] Frank H Clarke. Optimization and nonsmooth analysis, volume 5. Siam, 1990.\n\n[9] Damek Davis and Dmitriy Drusvyatskiy. Uniform graphical convergence of subgradients in\n\nnonconvex optimization and learning. arXiv preprint arXiv:1810.07590, 2018.\n\n[10] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[11] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. Conference on Learning Theory, 2018.\n\n[12] Alon Gonen and Shai Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle\n\nproblems. Conference on Learning Theory, 2017.\n\n[13] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conference on\n\nLearning Representations, 2017.\n\n[14] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical\n\nsystems. Journal of Machine Learning Research, 2018.\n\n[15] Elad Hazan. Introduction to online convex optimization. Foundations and Trends\u00ae in Opti-\n\nmization, 2(3-4):157\u2013325, 2016.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[17] Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning. Foundations\n\nand Trends\u00ae in Machine Learning, 10(3-4):142\u2013336, 2017.\n\n[18] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. Minimizing nonconvex population risk\n\nfrom rough empirical risk. Advances in Neural Information Processing Systems, 2018.\n\n[19] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:\nRisk bounds, margin bounds, and regularization. In Advances in Neural Information Processing\nSystems 21, pages 793\u2013800. MIT Press, 2009.\n\n[20] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:\nRisk bounds, margin bounds, and regularization. In Advances in neural information processing\nsystems, pages 793\u2013800, 2009.\n\n10\n\n\f[21] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Ef\ufb01cient learning of general-\nized linear and single index models with isotonic regression. In Advances in Neural Information\nProcessing Systems, pages 927\u2013935, 2011.\n\n[22] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for\n\nlearning with matrices. Journal of Machine Learning Research, 13(Jun):1865\u20131890, 2012.\n\n[23] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[25] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer-Verlag, New\n\nYork, 1991.\n\n[26] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[27] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind\ndeconvolution via nonconvex optimization. Applied and Computational Harmonic Analysis,\n2018.\n\n[28] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[29] Huikang Liu, Weijie Wu, and Anthony Man-Cho So. Quadratic optimization with orthogonality\nconstraints: Explicit lojasiewicz exponent and linear convergence of line-search methods. In\nProceedings of The 33rd International Conference on Machine Learning, volume 48 of PMLR,\npages 1158\u20131167, 2016.\n\n[30] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International\n\nConference on Algorithmic Learning Theory, pages 3\u201317. Springer, 2016.\n\n[31] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex\n\nlosses. To appear in Annals of Statistics, 2016.\n\n[32] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.\n\n2012.\n\n[33] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[34] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals\n\nof Probability, 22(4):1679\u20131706, 1994.\n\n[35] Gilles Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics,\n\n20:326\u2013350, 1975. ISSN 0021-2172.\n\n[36] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[37] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for\n\ncorrelated gaussian designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[38] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[39] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n11\n\n\f[40] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In Information\n\nTheory (ISIT), 2016 IEEE International Symposium on, pages 2379\u20132383. IEEE, 2016.\n\n[41] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the\n\nlasso and generalizations. Chapman and Hall/CRC, 2015.\n\n[42] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Science & Business\n\nMedia, 2008.\n\n[43] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\ning deep learning requires rethinking generalization. International Conference on Learning\nRepresentations, 2017.\n\n[44] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Median-truncated nonconvex approach for\n\nphase retrieval with outliers. International Conference on Machine Learning, 2016.\n\n[45] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex\n\noptimization. Advances in Neural Information Processing Systems, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5274, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Ayush", "family_name": "Sekhari", "institution": "Cornell University"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "Cornell University"}]}