{"title": "Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals", "book": "Advances in Neural Information Processing Systems", "page_first": 8584, "page_last": 8593, "abstract": "We consider the problem of computing the best-fitting ReLU with\n respect to square-loss on a training set when the examples have been\n drawn according to a spherical Gaussian distribution (the labels can\n be arbitrary). Let $\\opt < 1$ be the population loss of the\n best-fitting ReLU. We prove:\n\\begin{itemize}\n\\item Finding a ReLU with square-loss $\\opt + \\epsilon$ is as\n hard as the problem of learning sparse parities with noise, widely thought\n to be computationally intractable. This is the first hardness\n result for learning a ReLU with respect to Gaussian marginals, and\n our results imply --{\\em unconditionally}-- that gradient descent cannot\n converge to the global minimum in polynomial time.\n\\item There exists an efficient approximation algorithm for finding the\n best-fitting ReLU that achieves error $O(\\opt^{2/3})$. The\n algorithm uses a novel reduction to noisy halfspace learning with\n respect to $0/1$ loss. \n\\end{itemize}\nPrior work due to Soltanolkotabi \\cite{soltanolkotabi2017learning} showed that gradient descent {\\em can} find the best-fitting ReLU with respect to Gaussian marginals, if the training set is {\\em exactly} labeled by a ReLU.", "full_text": "Time/Accuracy Tradeoffs for Learning a ReLU with\n\nrespect to Gaussian Marginals\n\nSurbhi Goel\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\nsurbhi@cs.utexas.edu\n\nSushrut Karmalkar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\nsushrutk@cs.utexas.edu\n\nAdam R. Klivans\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\nklivans@cs.utexas.edu\n\nAbstract\n\nWe consider the problem of computing the best-\ufb01tting ReLU with respect to\nsquare-loss on a training set when the examples have been drawn according to\na spherical Gaussian distribution (the labels can be arbitrary). Let opt < 1 be the\npopulation loss of the best-\ufb01tting ReLU. We prove:\n\u2022 Finding a ReLU with square-loss opt+\u03f5 is as hard as the problem of learning\nsparse parities with noise, widely thought to be computationally intractable.\nThis is the \ufb01rst hardness result for learning a ReLU with respect to Gaus-\nsian marginals, and our results imply \u2013unconditionally\u2013 that gradient descent\ncannot converge to the global minimum in polynomial time.\n\u2022 There exists an ef\ufb01cient approximation algorithm for \ufb01nding the best-\ufb01tting\nReLU that achieves error O(opt2/3). The algorithm uses a novel reduction\nto noisy halfspace learning with respect to 0/1 loss.\n\nPrior work due to Soltanolkotabi [Sol17] showed that gradient descent can \ufb01nd the\nbest-\ufb01tting ReLU with respect to Gaussian marginals, if the training set is exactly\nlabeled by a ReLU.\n\nIntroduction\n\n1\nA Recti\ufb01ed Linear Unit (ReLU) is a function parameterized by a weight vector w \u2208 Rd that maps\nRd \u2192 R as follows: ReLUw(x) = max(0, w \u00b7 x). ReLUs are now the nonlinearity of choice in\nmodern deep networks. The computational complexity of learning simple neural networks that use\nthe ReLU activation is an intensely studied area, and many positive results rely on assuming that the\nmarginal distribution on the examples is a spherical Gaussian [ZYWG19, GLM18, ZSJ+17, MR18].\nRecent work due to Soltanolkotabi [Sol17] shows that gradient descent will learn a single ReLU in\npolynomial time, if the marginal distribution is Gaussian (see also [BG17]). His result, however,\nrequires that the training set is noiseless; i.e., there is a ReLU that correctly classi\ufb01es all elements of\nthe training set.\nHere we consider the more realistic scenario of empirical risk minimization or learning a ReLU\nwith noise (often referred to as agnostically learning a ReLU). We assume that a learner has ac-\ncess to a training set from a joint distribution D on Rd \u00d7 R where the marginal distribution\non Rd is Gaussian but the distribution on the labels can be arbitrary within [0, 1]. We de\ufb01ne\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fopt = minw,\u2225w\u2225\u22641 Ex,y\u223cD[(ReLUw(x) \u2212 y)2], and the goal is to output a function of the form\nmax(0, w \u00b7 x) with square-loss at most opt + \u03f5.\n\n1.1 Our Results\n\nOur main results give a trade-off between the accuracy of the output hypothesis and the running time\nof the algorithm. We give the \ufb01rst evidence that there is no polynomial-time algorithm for \ufb01nding a\nReLU with error opt + \u03f5, even when the marginal distribution is Gaussian:\nTheorem 1 (Informal version of Theorem 3). Assuming hardness of the problem of learning sparse\nparities with noise, any algorithm for \ufb01nding a ReLU on data drawn from a distribution with Gaus-\nsian marginals that has error at most opt + \u03f5 runs in time d\u2126(log(1/\u03f5)).\n\nSince gradient descent is known to be a statistical-query algorithm (see Section 4), a consequence\nof Theorem 1 is the following:\nCorollary 1. Gradient descent fails to converge to the global minimum for learning the best-\ufb01tting\nReLU with respect to square-loss in polynomial time, even when the marginals are Gaussian.\n\nThis above corollary is unconditional (i.e. does not rely on any hardness assumptions) and shows\nthe necessity of the realizable/noiseless setting in the work of Soltanolkotabi [Sol17] and Brutzkus\nand Globerson [BG17]. We also give the \ufb01rst approximation algorithm for \ufb01nding the best-\ufb01tting\nReLU with respect to Gaussian marginals:\nTheorem 2 (Informal version of Theorem 5). There exists a polynomial-time algorithm for \ufb01nding\na ReLU with error O(opt2/3) + \u03f5.\n\nThe above result uses a novel reduction from learning a ReLU to the problem of learning a halfspace\nwith respect to 0/1 loss. We note that the problem of \ufb01nding a ReLU with error O(opt) + \u03f5 remains\nan outstanding open problem.\n\n1.2 Our Techniques\n\nHardness Result. For our hardness result, we follow the same approach as Klivans and Kothari\n[KK14] who gave a reduction from learning sparse parity with noise to the problem of agnostically\nlearning halfspaces with respect to Gaussian distributions. The idea is to embed examples drawn\nfrom {\u22121, 1}d into Rd by multiplying each coordinate with a random draw from a half-normal\ndistribution. The key technical component in their result is a correlation lemma showing that for a\nparity function on variables indicated by index set S, the majority function on the same index set is\nweakly correlated with a Gaussian lift of the parity function on S.\nIn our work we must overcome two technical dif\ufb01culties. First, in the Klivans and Kothari result, it\nis obvious that for distributions induced by learning sparse parity with noise, the best \ufb01tting majority\nfunction will be the one that is de\ufb01ned on inputs speci\ufb01ed by S. In our setting with respect to ReLUs,\nhowever, the constant function 1/2 will have square-loss 1/4, and this may be much lower than the\nsquare-loss of any function of the form max(0, w\u00b7 x). Thus, we need to prove the existence of a gap\nbetween the correlation of ReLUs with random noise (see Claim 3) versus the correlation of ReLUs\nwith parity (see Claim 4).\nSecond, Klivans and Kothari use known formulas on the discrete Fourier coef\ufb01cients of the majority\nfunction and an application of the central limit theorem to analyze how much the best-\ufb01tting majority\ncorrelates with the Gaussian lift of parity. No such bounds are known, however, for the ReLU\nfunction. As such we must perform a (somewhat involved) analysis of the ReLU function\u2019s Hermite\nexpansion in order to obtain quantitative correlation bounds.\n\nApproximation Algorithm. For our polynomial-time algorithm that outputs a ReLU with error\nO(opt2/3) + \u03f5, we apply a novel reduction to agnostically learning halfspaces. We give a simple\ntransformation on the training set to a Boolean learning problem and show that the weight vector w\ncorresponding to the best \ufb01tting halfspace on this transformed data set is not too far from the weight\nvector corresponding to the best \ufb01tting ReLU. We can then apply recent work for agnostically learn-\ning halfspaces with respect to Gaussians that have constant-factor approximation error guarantees.\nThe exponent 2/3 appears due to the use of an averaging argument (see Section 5).\n\n2\n\n\f1.3 Related Work\n\nSeveral recent works have proved hardness results for \ufb01nding the best-\ufb01tting ReLU with respect to\nsquare loss (equivalently, agnostically learning a ReLU with respect to square loss). Results show-\ning NP-hardness (e.g., [MR18, BDL18]) use marginal distributions that encode hard combinatorial\nproblems. The resulting marginals are far from Gaussian. Work due to Goel et al. [GKKT17] uses\na reduction from sparse parity with noise but only obtains hardness results for learning with respect\nto discrete distributions (uniform on {0, 1}d).\nUsing parity functions as a source of hardness for learning deep networks has been explored recently\nby Shalev-Shwartz et. al. [SSSS17] and Abbe and Sandon [AS18]. Their results, however, do\nnot address the complexity of learning a single ReLU or consider the case of Gaussian marginals.\nShamir [Sha18] proved that gradient descent fails to learn certain classes of neural networks with\nrespect to Gaussian marginals, but these results do not apply to learning a single ReLU [VW19].\nIn terms of positive results for learning a ReLU, work due to Kalai and Sastry [KS09] (and follow-\nup work [KKKS11]) gave the \ufb01rst ef\ufb01cient algorithm for learning any generalized linear model\n(GLM) that is monotone and Lipschitz, a class that includes ReLUs. Their algorithms work for\nany distribution and can tolerate bounded, mean-zero and additive noise. Soltanolkotabi [Sol17]\nand Brutzkus and Globerson [BG17] were the \ufb01rst to prove that gradient descent converges to the\nunknown ReLU in polynomial time with respect to Gaussian marginals as long as the labels have\nno noise. Other works for learning one-layer ReLU networks with respect to Gaussian marginals or\nmarginals with milder distribution assumptions [ZYWG19, GLM18, ZSJ+17, GKLW19, GKM18,\nMR18] also assume a noiseless training set or training set with mean-zero i.i.d.\n(typically sub-\nGaussian) noise. This is in contrast to the setting here (agnostic learning), where we assume nothing\nabout the noise model.\nThere are several works for the related (but different) problem of agnostically learning halfspaces\nwith respect to Gaussian marginals [KKMS08, ABL14, Zha18, DKS18]. While agnostically learn-\ning ReLUs may seem like an easier problem than agnostically learning halfspaces (at \ufb01rst glance the\nlearner sees \u201cmore information\u201d from the ReLU\u2019s real-valued labels), the quantitative relationship\nbetween the two problems is still open. In the halfspace setting, we can assume without loss of\ngenerality that an adversary has \ufb02ipped an opt fraction of the labels. In contrast, in the setting with\nReLUs and square loss, it is possible for the adversary to corrupt every label.\n\n2 Preliminaries\nDe\ufb01ne ReLU(a) = max(0, a) and the set of functions CReLU := {ReLUw|w \u2208 Rd,\u2225w\u22252 \u2264 1}\nwhere ReLUw(x) = max(0, w\u00b7 x). De\ufb01ne sign(a) to be 1 if a \u2265 0 and -1 otherwise. Let errD(h) :=\nE(x,y)\u223cD[(h(x) \u2212 y)2], Also de\ufb01ne optD(C) = minc\u2208C errD(c) to be the error of the best-\ufb01tting\nc \u2208 C for distribution D. We will use x\u2212i to denote the vector x restricted to the indices except\ni. The \u2018half-normal\u2019 distribution will refer to the standard normal distribution truncated to R\u22650.\nWe will use n and its subscripted versions to denote natural numbers unless otherwise stated. In\nthis paper, we will suppress the con\ufb01dence parameter \u03b4, since one can use standard techniques to\namplify the probability of success of our learning algorithms.\n\nAgnostic learning. The model of learning we work with in the paper is the agnostic model of\nlearning. In this model the labels are allowed to be arbitrary and the task of the learner is to output\na hypothesis within an \u03f5 error of the optimal. More formally,\nDe\ufb01nition 1. A class C is said to be agnostically learnable in time t over the Gaussian distribution to\nerror \u03f5 if there exists an algorithm A such that for any distribution D on X\u00d7Y with the marginal on\nX being Gaussian, A uses at most t draws from D, runs in time at most t, and outputs a hypothesis\nh \u2208 C such that errD(h) \u2264 optD(C) + \u03f5.\nWe assume that A succeeds with constant probability. Note that the algorithm above outputs the\nerror of h over samples S.\n\n\u201cbest-\ufb01tting\u201d c \u2208 C with respect to D up to an additive \u03f5. We will denotecerrS (h) to be the empirical\n\n3\n\n\fIn this work we will show that agnostically learning CReLU\nLearning Sparse Parities with Noise.\nover the Gaussian distribution is as hard as the problem of learning sparse parities with noise over\nthe uniform distribution on the hypercube.\nDe\ufb01nition 2 (k-SLPN). Given access to samples drawn from the uniform distribution over {\u00b11}d\nand target function y being the parity function over an unknown set S \u2286 [d] of size k, the problem\nof learning sparse parities with noise is the problem of recovering the set S given access to noisy\nlabels where the label is \ufb02ipped with probability \u03b7.\n\nLearning sparse parities with noise is generally considered to be a computationally hard problem\nand has been used to give hardness results for both supervised [GKKT17] and unsupervised learning\nproblems [BGS14]. The current best known algorithm for solving sparse parities with constant noise\nrate is due to Valiant [Val15] and runs in time \u2248 d0.8k.\nAssumption 1. Any algorithm for solving k-SLPN up to constant error must run in time d\u2126(k).\n\nP\u221e\n\nGaussian Lift of a Function Our reduction will require the following de\ufb01nition of a Gaussian lift\nof a boolean function from [KK14].\nDe\ufb01nition 3 (Gaussian lift [KK14]). The Gaussian lift of a function f : {\u00b11}d \u2192 R is the function\nf \u03b3 : Rd \u2192 R such that for any x \u2208 Rd, f \u03b3(x) = f (sign(x1), . . . , sign(xd)).\nHermite Analysis and Gaussian Density We will assume that the marginal over our samples x\nis the standard normal distribution N (0, Id). This implies that w \u00b7 x for a vector w is distributed\nas N (0,\u2225w\u22252). We recall the basics of Hermite analysis. We say a function f : R \u2192 R is\nsquare integrable if EN (0,1)[f 2] < \u221e. For any square integrable function f de\ufb01ne its Hermite\nexpansion as f (x) =\nare the normalized Hermite poly-\nnomials, and Hi the unnormalized (probabilists) Hermite polynomials. The normalized Hermite\npolynomials form an orthonormal basis with respect to the univariate standard normal distribution\n(E[ \u00afHi(x) \u00afHj(x)] = \u03b4ij). The associated inner product for square integrable functions f, g : R \u2192 R\n\nbfi \u00afHi(x) where \u00afHi(x) = Hi(x)\u221a\n\nis de\ufb01ned as \u27e8f, g\u27e9 := Ex\u223cN (0,1)[f (x)g(x)]. Each coef\ufb01cient bfi in the expansion of f (x) satis\ufb01es\nbfi = Ex\u223cN (0,1)[f (x) \u00afHi(x)]. We will need the following facts about Hermite polynomials.\nFact 2 ([KKMS08]). dsign0 = 0 and for i \u2265 1,dsigni =\n\nFact 1. For all m \u2265 0, H2m+1(0) = 0 and H2m = (\u22121)m (2m)!\nm!2m .\n\u03c0i! Hi\u22121(0).\n\nq\n\ni=0\n\ni!\n\n2\n\n3 Hardness of Learning ReLU\n\nIn this section, we will show that if there is an algorithm that agnostically learns a ReLU in polyno-\nmial time, then there is an algorithm for learning sparse parities with noise in time do(k), violating\nAssumption 1. We follow the approach of [KK14]. Let \u03c7S be an unknown parity for some S \u2286 [d].\nWe will show that there is an unbiased ReLU that is correlated with the Gaussian lift of the unknown\nsparse parity function. Notice that dropping a coordinate j \u2208 S from the input samples makes the\nlabels of the resulting training set totally independent from the input. In contrast, dropping j /\u2208 S\nresults in a training set that is still labeled by a noisy parity. Therefore, we can use an agnostic\nlearner for ReLUs to detect a correlated ReLU and distinguish between the two cases. This allows\nus to identify the variables in S one by one.\nWe formalize the above approach by \ufb01rst proving the following key property,\nLemma 1 (ReLU Correlation Lemma). Let \u03c7\u03b3\nS denote the Gaussian lift of the parity on variables in\nS \u2282 [d]. For every S \u2282 [d] with |S| \u2264 k and k = 4l + 2 for some l \u2265 0, there exists ReLUwS such\nthat \u27e8ReLUwS , \u03c7\u03b3\n\u27e9 \u2265 2\nProof. Let wS = 1\u221a\nshow that\n\nP\n\u2212O(k) where ReLUwS only depends on variables in S.\n!#\n\ni\u2208S e(i) where e(i) is 1 at coordinate i and 0 everywhere else. We will\n\n\"\n\n2\u03c0k\n\n\u03b1\n\n(cid:18)P\n\n(cid:19) Y\n\n\u27e8ReLUwS , \u03c7\u03b3\n\nS\n\n\u27e9 =\n\n1\u221a\n2\u03c0\n\nEz\u223cN (0,Id)\n\nReLU\n\n4\n\ni\u2208S zi\u221a\nk\n\nsign(zi)\n\ni\u2208S\n\n\u2265 2\n\n\u2212O(k).\n\n\fLetdsignn and [ReLUn denote the degree n Hermite coef\ufb01cients of the sign function and ReLU func-\n\ntion respectively. It is easy to see that the Hermite expansion of the Gaussian lift of a parity supported\non S is,\n\nY\n\n \u221eX\n\nY\n\ndsignn\n\n!\n\nX\n\nY\n\nn1,...,nk\n\ni\u2208S\n\nS(z) =\n\u03c7\u03b3\n\nsign(zi) =\n\ni\u2208S\n\ni\u2208S\n\nn=0\n\n\u00afHn(zi)\n\n=\n\n\u00afHni(zi)\n\n(1)\n\nIn order to \ufb01nish the proof of Lemma 1 we will need the expansion of ReLU\nin terms of\nproducts of univariate Hermite polynomials. Toward this end we establish the following two claims\n(see proofs in the supplemental).\n\u221a\nClaim 1 (Hermite expansion: univariate ReLU). [ReLU0 = 1/\n[ReLUi = 1\u221a\nClaim 2 (Hermite expansion: multivariate ReLU). For any S \u2286 [d] with |S| = k,\n\n2\u03c0, [ReLU1 = 1/2 and for i \u2265 2,\n\n(Hi(0) + iHi\u22122(0)).\n\n2\u03c0i!\n\n(cid:18)\n\ndsignni\n(cid:16)\u2211\n\n(cid:17)\n\ni zi\u221a\nk\n\nn!\n\nn1!\u00b7\u00b7\u00b7 nk!\n\n\u00afHnj (zj)\n\nn1+...+nk=n\n\nCombining Equation 1 and Claim 2 now yields,\n\nReLU\n\nEz\u223cN (0,Id)\n\n= Ez\u223cN (0,Id)\n\nX\n#\n\n(cid:19)\n\n\u221eX\n\n(cid:18)P\n\"\n\ni\u2208S zi\u221a\nk\n\n=\n\nn=0\n\nReLU\n\ni\u2208S zi\u221a\nk\n\n[ReLUn\nkn/2\n\n(cid:18)P\n\" \u221eX\n0@ X\nkY\ndsignmj\nX\n(cid:18)\nX\n\nn1+...+nk=n\n\nm1,...,mk\n\nn=0\n\nj=1\n\n[ReLUn\nkn/2\n[ReLUn\nkn/2\n\n[ReLUn\nkn/2\n\n\u00b7\n\n(cid:19)\n\n\u00b7\n\nY\nX\n\ni\u2208S\n\nsign(zi)\n\n(cid:18)\n\n1A35\n(cid:19)1/2 kY\n\nn1!\u00b7\u00b7\u00b7 nk!\n\nn!\n\nn1+...+nk=n\n\n\u00afHmj (zj)\n\nX\n\n(cid:18)\n\nm1,...,mk\n\nn!\n\nn1!\u00b7\u00b7\u00b7 nk!\n\n\u00d7\n\n\u221eX\n\u221eX\n\nn=0\n\n=\n\n=\n\n(cid:19)1/2 kY\n\nj=1\n\n!\n\n\u00afHni(zi)\n\n(cid:19)1/2 kY\n\nn!\n\nn1!\u00b7\u00b7\u00b7 nk!\n\ni=1\n\ndsignmi\n\nE[ \u00afHni(zi) \u00afHmi(zi)]\n\n(cid:19)1/2 kY\ndsignni\n\ni=1\n\ni=1\n\nn=0\n\n\"\n\n(cid:19)\n\nn1+...+nk=n\n\n(cid:18)P\n\nFrom Fact 2 and Claim 1 we see that dsign2m = 0 and [ReLU2m+1 = 0 for m \u2265 1. Additionally,\nsincedsign0 = 0 we see that each ni \u2265 1. This gives us,\nX\nX\n\ns\n(cid:19)1/2 kY\nr\n(cid:19)3/2 kY\n\n1\u221a\n2\u03c0n!kn/2\n\n(Hn(0) + nHn\u22122(0))\n\n\u221eX\n\u221eX\n\nn1,...,nk\u22651\nn1+...+nk=n\n\nn1!\u00b7\u00b7\u00b7 nk!\n\nEz\u223cN (0,Id)\n\ni\u2208S zi\u221a\nk\n\nHnj\u22121(0)\n\nY\n\nsign(zi)\n\n(cid:18)\n\n(cid:18)\n\nReLU\n\n#\n\n\u03c0nj!\n\ni\u2208S\n\nn=k\n\ni=1\n\nn!\n\n=\n\n2\n\n\u00b7\n\n1\u221a\n2\u03c0kn/2\n\n=\n\nn=k\n\n(Hn(0) + nHn\u22122(0))\n\nn1,...,nk\u22651\nn1+...+nk=n\n\n2\n\u03c0\n\nHnj\u22121(0).\n\ni=1\n\nTo \ufb01nish the proof of Lemma 1, we will look at each term in the outer summation above. Let the\nterm for any \ufb01xed n \u2265 k be denoted by Tn. Since \u00afHi(0) = 0 for odd i, observe that Tn is non-zero\nif and only if n is even and each ni = 2n\n\n(cid:18)\n(\u22121)\n\nTn =\n\n1\u221a\n2\u03c0k n\n\n2\n\nn\n2\n\nn!\n\n(n/2)!2 n\n\n2\n\n1\n\nn1!\u00b7\u00b7\u00b7 nk!\n(cid:19)\n\u2265 0. We have\n\u2032\ni\n(n \u2212 2)!\n\u2212 1)!2 n\n\n\u22121\n\n2\n\n\u2032\ni + 1 for n\n+ n(\u22121)\n\n\u22121\n\nn\n2\n\n( n\n2\n\n5\n\n\f(cid:19)3/2 kY\n\nr\n\n\u2032\nk + 1)!\n\nj=1\n\n(\u22121)n\n\n\u2032\nj\n\n2\n\u03c0\n\n\u2032\nj)!\n(2n\n\u2032\n\u2032\nj!2n\nj\n\nn\n\n2\n\nn\u2212k\n\n(cid:19)3/2 (2n\n(cid:18)\n\n1)!\u00b7\u00b7\u00b7 (2n\n\u2032\n1!\u00b7\u00b7\u00b7 n\n\u2032\n\u2032\nk!\n\nn\n\n\u2032\nk)!\n\n1\n\n1 + 1)!\u00b7\u00b7\u00b7 (2n\n\u2032\n\n\u2032\nk + 1)!\n\n(2n\n\n(cid:19)3/2 (2n\n\n1)!\u00b7\u00b7\u00b7 (2n\n\u2032\n1!\u00b7\u00b7\u00b7 n\n\u2032\n\u2032\nk!\n\nn\n\n\u2032\nk)!\n\nX\n\n(cid:18)\n\n\u00d7\n\n2\n\n\u22650\n\u2032\nn\nj = n\u2212k\n\u2032\ni\nj n\n\u22121n!\n\n\u2211\n(\u22121) n\nX\n2\u03c0k n\n\u00d7\n\n2\n\n2 (n/2)!2 n\n\n2\n\n(cid:18)\n\n\u221a\n\n=\n\n1\n\n1 + 1)!\u00b7\u00b7\u00b7 (2n\n\u2032\n(cid:19)\n\n(2n\n\n(cid:18)\n\n\u22121 +\n\nn\nn \u2212 1\n1\n\n2 (\u22121)\n2 k\n\u03c0 k\n\n2\n\n\u2211\n\n(2n\n\n\u22650\n\u2032\nn\nj\nj = n\u2212k\n\u2032\nj n\n(\u22121)1+ k\n\n2\n\n2 n!\n2 (n \u2212 1)(n/2)!2\n\n\u2032\nk + 1)!\n\n1 + 1)!\u00b7\u00b7\u00b7 (2n\n\u2032\nX\n\nn\u2212k\n2 \u03c0 k\n\n2\n\n\u2211\n\n\u22650\n\u2032\nn\nj = n\u2212k\n\u2032\ni\nj n\n\n\u221a\n\n=\n\n2\u03c0k n\n\nP\u221e\nSince k = 4l + 2 (by assumption), Tn > 0 for all even n \u2265 k and equal to 0 for all odd n. Thus\n\n2\n\nn=k Tn > Tk. Lower bounding Tk, we have\n\n\u221a\n\nTk =\n\n(4l + 2)!\n\n2\u03c0(4l + 2)2l+1(4l + 1)(2l + 1)!\u03c02l+1\n\n\u221a\n\n\u2248\n\n=\n\n1\n\n(cid:18)\n\n(cid:19)2l+1\n\n2\u03c0(4l + 2)2l+1(4l + 1)\u03c02l+1\n1\u221a\n\u03c0(4l + 1)\n\n2\ne\u03c0\n\n= 2\n\n\u2212O(k)\n\np\n\n2\u03c0(4l + 2)\n\n(cid:18)\n\n4l + 2\n\ne\n\n(cid:19)4l+2\n\n1p\n\n2\u03c0(2l + 1)\n\n(cid:19)2l+1\n\n(cid:18)\n\ne\n\n2l + 1\n\nNow we present our main algorithm (Algorithm 1) that reduces learning sparse parities with noise\nto agnostically learning ReLUs and a proof of its correctness.\n\nAlgorithm 1 Learning Sparse Parities with Noise using Agnostic ReLU learner\n\nInput Training set S of M1 samples (xi, yi)M1\n\ni=1, validation set V of M2 samples\n\n(xi, yi)M1+M2\n\ni=M1+1, error parameter \u03f5, Agnostic ReLU learner A\n\nOutput Set of relevant variables Vrel\n\n1: Set Vrel = \u2205\n2: Set S1, . . . ,Sd := \u2205\n3: for i = 1 to M1 + M2 do\n4:\n5:\n6:\n7: for j \u2208 [d] do\n8:\n9:\n10:\n11:\n12: Return Vrel\n\nComputecerrVj (hj)\nifcerrVj (hj) \u2265 1\n\n2\nAdd j to Vrel\n\n4\u03c0\n\nDraw n independent univariate half Gaussians g1, . . . , gd\nConstruct x\u2032 such that for all j \u2208 [d], x\n\u2032\nj and set y\nj := gjxi\nFor all j \u2208 [d], if i \u2264 M1 add (x\u2032\n) to Sj else to Vj\n\u2032\n\u2212j, y\nRun A on Sj to obtain hypothesis hj\n\u2212 \u03f5/4 then\n\n\u2212 1\n\n\u2032\n\n= yi+1\n\n2\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nTheorem 3. If there is an algorithm to agnostically learn unbiased ReLUs on the Gaussian dis-\ntribution in time and samples T (d, 1/\u03f5), then there is an algorithm to solve k-SLPN in time\nO\n\nwhere \u03b7 is the noise rate.\n\n+ O(d)T\n\n2O(k)\n(1\u22122\u03b7)2 log(d)\n\nd, 2O(k)\n1\u22122\u03b7\n\nIn particular, if Assumption 1 is true, then any algorithm for agnostically learning (unbiased) ReLUs\non the Gaussian distribution must run in time d\u2126(log(1/\u03f5)).\n\n6\n\n\f\u220f\n\nProof. Given a set of samples from the k-SPLN problem, we claim that Algorithm 1 can recover all\nindices j belonging to the sparse parity when run with appropriate parameters. We will \ufb01rst show\nthat if a variable is relevant then the error is smaller compared to when it is irrelevant. It is easy to\notherwise. Let Dj denote the\nsee that y\ndistribution obtained by dropping the jth coordinate from the lifted distribution and let S denote the\nset of active indices of the parity. The proof of the theorem follows from the following claims,\n\nwith probability 1\u2212 \u03b7 and 1\u2212\u220f\n\n\u2032\ni\u2208S sign(x\ni)+1\n\ni\u2208S sign(x\n\n\u2032 is\n\n\u2032\ni)\n\n2\n\n2\n\nClaim 3. If j \u2208 S then for all w, errDj (ReLUw) =\n\n\u2225w\u22252\n2\n\n\u2212 \u2225w\u2225\u221a\n\n2\u03c0\n\n\u2265 1\n\n2\n\n\u2212 1\n4\u03c0 .\n\n+ 1\n2\n\nClaim 4. If j /\u2208 S then there exists w\u2217 with \u2225w\u2217\u2225 = 1\u221a\n\u2212O(k)\n2\n1\u22122\u03b7 .\n\n2\u03c0\n\nsuch that errDj (ReLUw\u2217 ) < 1\n\n2\n\n\u2212 1\n\n4\u03c0\n\n\u2212\n\n\u2212O(k)\n1\u22122\u03b7 = 2\n\n\u2212ck\nClaims 3 and 4 imply that we have a gap of at least 2\n1\u22122\u03b7 for some c > 0 between the\n\u2212ck\nrelevant and irrelevant variable case. Setting \u03f5 = 2\n1\u22122\u03b7 in Algorithm 1 will let us detect this gap.\nSince A is an agnostic learner for ReLU, as long as M1 = T (d, 2/\u03f5) we know that with probability\n2/3, for all j \u2208 S, A runs on Sj and outputs hj such that errDj (hj) \u2264 minw errDj (ReLUw) \u2264\n\u2212 1\n[Ver] we see that using a validation set of M2 = 100/\u03f52 samples, we have for all j, |cerrVj (hj) \u2212\n1\n2\nUsing standard concentration inequalities for sub-Gaussian and subexponential random variables\nerrDj (hj)| \u2264 \u03f5/4. Therefore, we can differentiate the two cases as in the Algorithm with con\ufb01dence\n> 1/2. It is easy to see that the run time of the algorithm is O(d)T (d, 2/\u03f5) + O(1/\u03f52), and that this\ncan be ampli\ufb01ed to obtain an algorithm with any desired con\ufb01dence using standard techniques.\n\n\u2212 \u03f5/2, and for all j /\u2208 S, errDj (hj) \u2265 1\n\n\u2212 1\n4\u03c0 .\n\n4\u03c0\n\n2\n\n4 Lower Bounds for SQ Algorithms\n\nA consequence of Theorem 3 is that any statistical-query algorithm for agnostically learning a ReLU\nwith respect to Gaussian marginals yields a statistical-query algorithm for learning parity functions\non k unknown input bits. This implies that there is no polynomial time statistical-query (SQ) algo-\nrithm that learns a ReLU with respect to Gaussian marginals for a certain restricted class of queries.\nWe present the formal theorem and defer the proof to the supplemental.\nTheorem 4. Any SQ algorithm for agnostically learning a ReLU with respect to any distribution D\nsatisfying Gaussian marginals over the attributes, requires d\u2126(log(1/\u03f5)) unit norm correlation queries\nor queries independent of the target with tolerance\npoly(d,1/\u03f5) to an oracle that returns \u03c4-approximate\nexpectations with respect to D.\n\n1\n\nRemark: Note that this implies there is no do(1/\u03f5)-time gradient descent algorithm that can ag-\nnostically learn ReLU(w \u00b7 x), under the reasonable assumption that for every i the gradients of\nE(x,y)\u223cD\ncan be computed by O(d) queries whose norms are polyno-\nmially bounded.\n\nReLUw(\u03bd(x)\u2212i) \u2212 y+1\n\n2\n\nh(cid:0)\n\ni\n\n(cid:1)2\n\n5 Approximation Algorithm\n\nIn this section we give a learning algorithm that runs in polynomial time in all input parameters\nand outputs a ReLU that has error O(opt2/3) + \u03f5 where opt is the error of the best-\ufb01tting ReLU.\nThe main reduction is a hard thresholding of the labels to create a training set with Boolean labels.\nWe then apply a recent result giving a polynomial-time approximation algorithm for agnostically\nlearning halfspaces over the Gaussian distribution due to Awasthi et. al. [ABL14]. We present our\nalgorithm and give a proof of its correctness.\n\n7\n\n\f\u2217\n\nerr0/1(w\n\n) \u2264 Pr[x \u0338\u2208 Sgood\\{v : w\u2217 \u00b7 v \u2208 (0, 2\u03b1)}]\n\u2264 Pr[x \u0338\u2208 Sgood] + Pr[x \u2208 {v : w\u2217 \u00b7 v \u2208 (0, 2\u03b1)}]\n\u2264 opt\n\n\u2212g2/2dg \u2264 opt\n\nZ\n\n2\u03b1\n\ne\n\n\u03b12 +\n\n1\u221a\n2\u03c0\n\n0\n\n\u03b12 + 2\u03b1.\n\n(cid:0)\n\n(cid:1)\n\nAlgorithm 2\n\nInput Training set S of m samples (xi, yi)m\nA from [ABL14] and a parameter \u03b1\n\nOutput Weight vectorbw\n2: Run A to recoverbw close in err0/1.\n3: Returnbw\n\n:= {(x, sign(y \u2212 \u03b1)) | (x, y) \u2208 S}.\n\n1: Construct S\n\n\u2032\n\ni=1, the agnostic halfspace learning algorithm\n\nTheorem 5. There is an algorithm (Algorithm 2) that given O(poly(d, 1/\u03f5)) samples (x, y) such\nthat x is drawn from N (0, Id) and y \u2208 [0, 1] recovers a unit vector w such that err(ReLUw) \u2264\nO(opt2/3) + \u03f5 where opt := min\u2225w\u2225=1 err(ReLUw).\n\nProof. Let w\u2217\n= arg min\u2225w\u2225=1 err(ReLUw) and so, err(ReLUw\u2217 ) = opt. De\ufb01ne the Sgood to be the\nset of points that are \u03b1-close to the optimal ReLU, i.e. Sgood = {x : |y \u2212 ReLUw\u2217 (x)| \u2264 \u03b1}. By\nMarkov\u2019s inequality,\n\nPr[x \u0338\u2208 Sgood] = Pr[|y \u2212 ReLUw\u2217 (x)| \u2265 \u03b1] \u2264 opt\n\u03b12 .\n\nThis implies that all but an opt\n\u03b12 fraction of the points are \u03b1-close to their corresponding y\u2019s. In the\n\ufb01rst step of Algorithm 2, the labels become Boolean. De\ufb01ne the 0/1 error of the vector w as follows,\nerr0/1(w) = E[sign(y \u2212 \u03b1) \u0338= sign(w \u00b7 x)]. Let w\u2020 be the argmin of err0/1(w) over all vectors w\nwith \u2225w\u22252 \u2264 1. Since for all elements in Sgood\\{v : w\u2217 \u00b7 v \u2208 (0, 2\u03b1)}, sign(y \u2212 \u03b1) = sign(w\u2217 \u00b7 x),\n\nWe now apply Theorem 8 from [ABL14] which gives an algorithm with polynomial running time\nin d and 1/\u03f5 that outputs a w such that \u2225w\u2225 = 1 and \u2225w \u2212 w\u2020\u2225 \u2264 O(\n) + \u03f5. For unit\nvectors a, b, \u03b8(a, b) < C Pr[sign(a \u00b7 x) \u0338= sign(b \u00b7 x)] for some absolute constant C where \u03b8(a, b)\nis the angle between the vectors (see Lemma 2 in [ABL14]). The triangle inequality and the fact\nthat \u2225a \u2212 b\u2225 \u2264 \u03b8(a, b) implies that if err0/1(a), err0/1(b) < \u03b7 then \u2225a \u2212 b\u2225 \u2264 C Pr[sign(a \u00b7 x) \u0338=\nsign(b\u00b7 x)] \u2264 O(\u03b7). Applying this to w\u2020 and w\u2217 yields \u2225w\u2020 \u2212 w\u2217\u2225 < O( opt\n\u03b12 + 2\u03b1). Since the ReLU\nfunction is 1-Lipschitz, we have\n\nopt\n\u03b12 + 2\u03b1\n\nerr(ReLUw) = E[(y \u2212 ReLU(w \u00b7 x))2]\n\n\u2264 2E[(y \u2212 ReLU(w\u2217 \u00b7 x))2] + 2E[(ReLU(w\u2217 \u00b7 x) \u2212 ReLU(w \u00b7 x))2]\n(cid:18)\n\u2264 2opt + 2E[((w\u2217 \u2212 w) \u00b7 x)2]\n= 2opt + 2\u2225w\u2217 \u2212 w\u22252 \u2264 O\n\n(cid:17)2\nSetting \u03b1 = opt1/3 and rescaling \u03f5 we have err(ReLUw) \u2264 O(opt2/3) + \u03f5.\n\nopt\n\u03b12 + 2\u03b1\n\n(cid:19)\n\nopt +\n\n(cid:16)\n\n+ \u03f5\n\n6 Conclusions and Open Problems\n\nWe have shown hardness for solving the empirical risk minimization problem for just one ReLU\nwith respect to Gaussian distributions and given the \ufb01rst nontrivial approximation algorithm. Can\nwe achieve approximation O(opt) + \u03f5? Note our results holds only for the case of unbiased ReLUs,\nas the constant function 1/2 may achieve smaller square-loss than any unbiased ReLU. Interestingly,\nall positive results that we are aware of for learning ReLUs (or one-layer ReLU networks) with\nrespect to Gaussians also assume the ReLU activations are unbiased (e.g., [BG17, Sol17, GKM18,\nGKLW19, GLM18, ZYWG19]). How dif\ufb01cult is the biased case?\n\n8\n\n\fAcknowledgments\n\nSurbhi Goel and Adam R. Klivans were supported by NSF Award CCF-1717896. Sushrut Karmalkar\nwas supported by NSF Award CNS-1414023.\n\nReferences\n\n[ABL14] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization\nfor ef\ufb01ciently learning linear separators with noise. In Proceedings of the forty-sixth\nannual ACM symposium on Theory of computing, pages 449\u2013458. ACM, 2014.\n\n[AS18] Emmanuel Abbe and Colin Sandon. Provable limitations of deep learning. CoRR,\n\nabs/1812.06369, 2018.\n\n[BDL18] Digvijay Boob, Santanu S. Dey, and Guanghui Lan. Complexity of training relu neural\n\nnetwork. CoRR, abs/1809.10787, 2018.\n\n[BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet\nwith gaussian inputs. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 605\u2013614. JMLR. org, 2017.\n\n[BGS14] Guy Bresler, David Gamarnik, and Devavrat Shah. Structure learning of antiferromag-\nnetic ising models.\nIn Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D.\nLawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Process-\ning Systems 27: Annual Conference on Neural Information Processing Systems 2014,\nDecember 8-13 2014, Montreal, Quebec, Canada, pages 2852\u20132860, 2014.\n\n[DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts\nwith nasty noise. In Ilias Diakonikolas, David Kempe 0001, and Monika Henzinger,\neditors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Com-\nputing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 1061\u20131073. ACM,\n2018.\n\n[GKKT17] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the\nrelu in polynomial time. In Satyen Kale and Ohad Shamir, editors, Proceedings of the\n30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10\nJuly 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004\u20131042.\nPMLR, 2017.\n\n[GKLW19] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural\nnetworks with symmetric inputs. In International Conference on Learning Representa-\ntions, 2019.\n\n[GKM18] Surbhi Goel, Adam R. Klivans, and Raghu Meka. Learning one convolutional layer\nwith overlapping patches.\nIn Jennifer G. Dy and Andreas Krause 0001, editors,\nICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 1778\u20131786.\nJMLR.org, 2018.\n\n[GLM18] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks\nwith landscape design. In International Conference on Learning Representations, 2018.\n\n[KK14] Adam Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian\nspace. In Approximation, Randomization, and Combinatorial Optimization. Algorithms\nand Techniques (APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer\nInformatik, 2014.\n\n[KKKS11] Sham M. Kakade, Adam Kalai, Varun Kanade, and Ohad Shamir. Ef\ufb01cient learning\nof generalized linear and single index models with isotonic regression. In NIPS, pages\n927\u2013935, 2011.\n\n[KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Ag-\n\nnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777\u20131805, 2008.\n\n9\n\n\f[KS09] Adam Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regres-\n\nsion. In COLT, 2009.\n\n[MR18] Pasin Manurangsi and Daniel Reichman. The computational complexity of training\n\nrelu(s). CoRR, abs/1810.04207, 2018.\n\n[Sha18] Ohad Shamir. Distribution-speci\ufb01c hardness of learning neural networks. Journal of\n\nMachine Learning Research, 19:32:1\u201332:29, 2018.\n\n[Sol17] Mahdi Soltanolkotabi. Learning relus via gradient descent.\n\nInformation Processing Systems, pages 2007\u20132017, 2017.\n\nIn Advances in Neural\n\n[SSSS17] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based\ndeep learning. In International Conference on Machine Learning, pages 3067\u20133075,\n2017.\n\n[Val15] Gregory Valiant. Finding correlations in subquadratic time, with applications to learn-\ning parities and the closest pair problem. Journal of the ACM (JACM), 62(2):13, 2015.\n\n[Ver] Roman Vershynin. Four lectures on probabilistic methods for data science.\n\n[VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural net-\nworks: Polynomial convergence and sq lower bounds. In Alina Beygelzimer and Daniel\nHsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, vol-\nume 99 of Proceedings of Machine Learning Research, pages 3115\u20133117, Phoenix,\nUSA, 25\u201328 Jun 2019. PMLR.\n\n[Zha18] Chicheng Zhang. Ef\ufb01cient active learning of sparse halfspaces. In S\u00e9bastien Bubeck,\nVianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory,\nCOLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine\nLearning Research, pages 1856\u20131880. PMLR, 2018.\n\n[ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nIn Proceedings of the 34th Inter-\nguarantees for one-hidden-layer neural networks.\nnational Conference on Machine Learning-Volume 70, pages 4140\u20134149. JMLR. org,\n2017.\n\n[ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-\nIn The 22nd International Conference on\n\nlayer relu networks via gradient descent.\nArti\ufb01cial Intelligence and Statistics, pages 1524\u20131534, 2019.\n\n10\n\n\f", "award": [], "sourceid": 4630, "authors": [{"given_name": "Surbhi", "family_name": "Goel", "institution": "The University of Texas at Austin"}, {"given_name": "Sushrut", "family_name": "Karmalkar", "institution": "The University of Texas at Austin"}, {"given_name": "Adam", "family_name": "Klivans", "institution": "UT Austin"}]}