{"title": "Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity", "book": "Advances in Neural Information Processing Systems", "page_first": 15558, "page_last": 15569, "abstract": "We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $\\Omega(\\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $\\Theta(\\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = \\Omega(N)$. Combined with a recent upper bound $O(WL\\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.", "full_text": "Small ReLU networks are powerful memorizers:\n\na tight analysis of memorization capacity\n\nChulhee Yun\n\nMIT\n\nCambridge, MA 02139\nchulheey@mit.edu\n\nSuvrit Sra\n\nMIT\n\nCambridge, MA 02139\n\nsuvrit@mit.edu\n\nAli Jadbabaie\n\nMIT\n\nCambridge, MA 02139\njadbabai@mit.edu\n\nAbstract\n\nWe study \ufb01nite sample expressivity, i.e., memorization power of ReLU networks.\nRecent results require N hidden nodes to memorize/interpolate arbitrary N data\n\u221a\npoints. In contrast, by exploiting depth, we show that 3-layer ReLU networks with\n\u221a\nN ) hidden nodes can perfectly memorize most datasets with N points. We also\n\u2126(\nN ) is necessary and suf\ufb01cient for memorizing N data points,\nprove that width \u0398(\nproving tight bounds on memorization capacity. The suf\ufb01ciency result can be\nextended to deeper networks; we show that an L-layer network with W parameters\nin the hidden layers can memorize N data points if W = \u2126(N ). Combined\nwith a recent upper bound O(W L log W ) on VC dimension, our construction is\nnearly tight for any \ufb01xed L. Subsequently, we analyze memorization capacity\nof residual networks under a general position assumption; we prove results that\nsubstantially reduce the known requirement of N hidden nodes. Finally, we study\nthe dynamics of stochastic gradient descent (SGD), and show that when initialized\nnear a memorizing global minimum of the empirical risk, SGD quickly \ufb01nds a\nnearby point with much smaller empirical risk.\n\nIntroduction\n\n1\nRecent results in deep learning indicate that over-parameterized neural networks can memorize\narbitrary datasets [2, 53]. This phenomenon is closely related to the expressive power of neural\nnetworks, which have been long studied as universal approximators [12, 18, 21]. These results\nsuggest that suf\ufb01ciently large neural networks are expressive enough to \ufb01t any dataset perfectly.\nWith the widespread use of deep networks, recent works have focused on better understanding the\npower of depth [13, 17, 30, 33, 37, 38, 44, 45, 49, 50]. However, most existing results consider\nexpressing functions (i.e., in\ufb01nitely many points) rather than \ufb01nite number of observations; thus, they\ndo not provide a precise understanding the memorization ability of \ufb01nitely large networks.\nWhen studying \ufb01nite sample memorization, several questions arise: Is a neural network capable of\nmemorizing arbitrary datasets of a given size? How large must a neural network be to possess such\ncapacity? These questions are the focus of this paper, and we answer them by studying universal\n\ufb01nite sample expressivity and memorization capacity; these concepts are formally de\ufb01ned below.\nDe\ufb01nition 1.1. We de\ufb01ne (universal) \ufb01nite sample expressivity of a neural network f\u03b8(\u00b7)\n(parametrized by \u03b8) as the network\u2019s ability to satisfy the following condition:\n\nFor all inputs {xi}N\nexists a parameter \u03b8 such that f\u03b8(xi) = yi for 1 \u2264 i \u2264 N.\n\ni=1 \u2208 Rdx\u00d7N and for all {yi}N\n\ni=1 \u2208 [\u22121, +1]dy\u00d7N , there\n\nWe de\ufb01ne memorization capacity of a network to be the maximum value of N for which the network\nhas \ufb01nite sample expressivity when dy = 1.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMemorization capacity is related to, but is different from VC dimension of neural networks [3, 4].\nRecall the de\ufb01nition of VC dimension of a neural network f\u03b8(\u00b7):\n\nThe maximum value N such that there exists a dataset {xi}N\nthat for all {yi}N\n\ni=1 \u2208 Rdx\u00d7N such\ni=1 \u2208 {\u00b11}N there exists \u03b8 such that f\u03b8(xi) = yi for 1 \u2264 i \u2264 N.\n\nNotice that the key difference between memorization capacity and VC dimension is in the quanti\ufb01ers\nin front of the xi\u2019s. Memorization capacity is always less than or equal to VC dimension, which\nmeans that an upper bound on VC dimension is also an upper bound on memorization capacity.\nThe study of \ufb01nite sample expressivity and memorization capacity of neural networks has a long\nhistory, dating back to the days of perceptrons [6, 11, 22\u201324, 26, 36, 42, 48]; however, the older studies\nfocus on shallow networks with traditional activations such as sigmoids, delivering limited insights for\ndeep ReLU networks. Since the advent of deep learning, some recent results on modern architectures\nappeared, e.g., fully-connected neural networks (FNNs) [53], residual networks (ResNets) [20], and\nconvolutional neural networks (CNNs) [35]. However, they impose assumptions on architectures that\nare neither practical nor realistic. For example, they require a hidden layer as wide as the number of\ndata points N [35, 53], or as many hidden nodes as N [20], causing their theoretical results to be\napplicable only to very large neural networks; this can be unrealistic especially when N is large.\n1.1 Summary of our contributions\nBefore stating our contributions, a brief comment on \u201cnetwork size\u201d is in order. The size of a neural\nnetwork can be somewhat vague; it could mean width/depth, the number of edges, or the number of\nhidden nodes. We use \u201csize\u201d to refer to the number of hidden nodes in a network. This also applies to\nnotions related to size; e.g., by a \u201csmall network\u201d we mean a network with a small number of hidden\nnodes. For other measures of size such as width, we will use the words explicitly.\n1. Finite sample expressivity of neural networks. Our \ufb01rst set of results is on the \ufb01nite sample\nexpressivity of FNNs (Section 3), under the assumption of distinct data point xi\u2019s. For simplicity, we\nonly summarize our results for ReLU networks, but they include hard-tanh networks as well.\n\n\u2022 Theorem 3.1 shows that any 3-layer (i.e., 2-hidden-layer) ReLU FNN with hidden layer widths\nd1 and d2 can \ufb01t any arbitrary dataset if d1d2 \u2265 4N dy, where N is the number of data points\nand dy is the output dimension. For scalar outputs, this means d1 = d2 = 2\nN suf\ufb01ces to \ufb01t\narbitrary data. This width requirement is signi\ufb01cantly smaller than existing results on ReLU.\n\n\u221a\n\n\u2022 The improvement is more dramatic for classi\ufb01cation. If we have dy classes, Proposition 3.2\nshows that a 4-layer ReLU FNN with hidden layer widths d1, d2, and d3 can \ufb01t any dataset if\nd1d2 \u2265 4N and d3 \u2265 4dy. This means that 106 data points in 103 classes (e.g., ImageNet) can\nbe memorized by a 4-layer FNN with hidden layer widths 2k-2k-4k.\n\n\u2022 For dy = 1, note that Theorem 3.1 shows a lower bound of \u2126(d1d2) on memorization capacity.\nWe prove a matching upper bound in Theorem 3.3: we show that for shallow neural networks (2\nor 3 layers), lower bounds on memorization capacity are tight.\n\u2022 Proposition 3.4 extends Theorem 3.1 to deeper and/or narrower networks, and shows that if the\nsum of the number of edges between pairs of adjacent layers satis\ufb01es dl1dl1+1+\u00b7\u00b7\u00b7+dlmdlm+1 =\n\u2126(N dy), then universal \ufb01nite sample expressivity holds. This gives a lower bound \u2126(W ) on\nmemorization capacity, where W is the number of edges in the network. Due to an upper bound\nO(W L log W ) (L is depth) on VC dimension [4], our lower bound is almost tight for \ufb01xed L.\n\nNext, in Section 4, we focus on classi\ufb01cation using ResNets; here dx denotes the input dimension\nand dy the number of classes. We assume here that data lies in general position.\n\n\u2022 Theorem 4.1 proves that deep ResNets with 4N\n+ 6dy ReLU hidden nodes can memorize\ndx\narbitrary datasets. Using the same proof technique, we also show in Corollary 4.2 that a 2-layer\nReLU FNN can memorize arbitrary classi\ufb01cation datasets if d1 \u2265 4N\n+ 4dy. With the general\nposition assumption, we can reduce the existing requirements of N to a more realistic number.\n\ndx\n\n2. Trajectory of SGD near memorizing global minima. Finally, in Section 5 we study the\nbehavior of stochastic gradient descent (SGD) on the empirical risk of universally expressive FNNs.\n\u2022 Theorem 5.1 shows that for any differentiable global minimum that memorizes, SGD initialized\nclose enough (say \u0001 away) to the minimum, quickly \ufb01nds a point that has empirical risk O(\u00014)\n\n2\n\n\fand is at most 2\u0001 far from the minimum. We emphasize that this theorem holds not only for\nmemorizers explicitly constructed in Sections 3 and 4, but for all global minima that memorize.\nWe note that we analyze without replacement SGD that is closer to practice than the simpler with-\nreplacement version [19, 40]; thus, our analysis may be of independent interest in optimization.\n\n1.2 Related work\nUniversal \ufb01nite sample expressivity of neural networks. Literature on \ufb01nite sample expressivity\nand memorization capacity of neural networks dates back to the 1960s. Earlier results [6, 11, 26, 36,\n42] study memorization capacity of linear threshold networks.\nLater, results on 2-layer FNNs with sigmoids [24] and other bounded activations [23] show that N\nhidden nodes are suf\ufb01cient to memorize N data points. It was later shown that the requirement of\nN hidden nodes can be improved by exploiting depth [22, 48]. Since these two works are highly\nrelevant to our own results, we defer a detailed discussion/comparison until we present the precise\ntheorems (see Sections 3.2 and 3.3).\nWith the advent of deep learning, there have been new results on modern activation functions and\narchitectures. Zhang et al. [53] prove that one-hidden-layer ReLU FNNs with N hidden nodes can\nmemorize N real-valued data points. Hardt and Ma [20] show that deep ResNets with N + dy hidden\nnodes can memorize arbitrary dy-class classi\ufb01cation datasets. Nguyen and Hein [35] show that deep\nCNNs with one of the hidden layers as wide as N can memorize N real-valued data points.\nSoudry and Carmon [43] show that under a dropout noise setting, the training error is zero at every\ndifferentiable local minimum, for almost every dataset and dropout-like noise realization. However,\nthis result is not comparable to ours because they assume that there is a multiplicative \u201cdropout noise\u201d\nat each hidden node and each data point. At i-th node of l-th layer, the slope of the activation function\ni,l \u00b7 s (if input is negative, s (cid:54)= 0),\nfor the j-th data point is either \u0001(j)\nwhere \u0001(j)\ni,l is the multiplicative random (e.g., Gaussian) dropout noise. Their theorem statements hold\nfor all realizations of these dropout noise factors except a set of measure zero. In contrast, our setting\nis free of these noise terms, and hence corresponds to a speci\ufb01c realization of such \u0001(n)\n\ni,l \u00b7 1 (if input is positive) or \u0001(j)\n\ni,l \u2019s.\n\nConvergence to global minima. There exist numerous papers that study convergence of gradient\ndescent or SGD to global optima of neural networks. Many previous results [9, 14, 29, 41, 46, 54, 55]\nstudy settings where data points are sampled from a distribution (e.g., Gaussian), and labels are\ngenerated from a \u201cteacher network\u201d that has the same architecture as the one being trained (i.e.,\nrealizability). Here, the goal of training is to recover the unknown (but \ufb01xed) true parameters. In\ncomparison, we consider arbitrary datasets and networks, under a mild assumption (especially for\noverparametrized networks) that the network can memorize the data; the results are not directly\ncomparable. Others [10, 47] study SGD on hinge loss under a bit strong assumption that the data is\nlinearly separable.\nOther recent results [1, 15, 16, 28, 58] focus on over-parameterized neural networks. In these papers,\nthe widths of hidden layers are assumed to be huge, of polynomial order in N, such as \u2126(N 4), \u2126(N 6)\nor even greater. Although these works provide insights on how GD/SGD \ufb01nds global minima easily,\ntheir width requirement is still far from being realistic.\nA recent work [57] provides a mixture of observation and theory about convergence to global minima.\nThe authors assume that networks can memorize the data, and that SGD follows a star-convex path\nto global minima, which they validate through experiments. Under these assumptions, they prove\nconvergence of SGD to global minimizers. We believe our result is complementary: we provide\nsuf\ufb01cient conditions for networks to memorize the data, and our result does not assume anything\nabout SGD\u2019s path but proves that SGD can \ufb01nd a point close to the global minimum.\n\nRemarks on generalization. The ability of neural networks to memorize and generalize at the\nsame time has been one of the biggest mysteries of deep learning [53]. Recent results on interpolation\nand \u201cdouble descent\u201d phenomenon indicate that memorization may not necessarily mean lack of\ngeneralization [5, 7, 8, 31, 32, 34]. We note that our paper focuses mainly on the ability of neural\nnetworks to memorize the training dataset, and that our results are separate from the discussion of\ngeneralization.\n\n3\n\n\f2 Problem setting and notation\nIn this section, we introduce the notation used throughout the paper. For integers a and b, a < b, we\ndenote [a] := {1, . . . , a} and [a : b] := {a, a + 1, . . . , b}. We denote {(xi, yi)}N\ni=1 the set of training\ndata points, and our goal is to choose the network parameters \u03b8 so that the network output f\u03b8(xi) is\nequal to yi, for all i \u2208 [n]. Let dx and dy denote input and output dimensions, respectively. Given\ninput x \u2208 Rdx, an L-layer fully-connected neural network computes output f\u03b8(x) as follows:\n\na0(x) = x,\nzl(x) = W lal\u22121(x) + bl,\nf\u03b8(x) = W LaL\u22121(x) + bL.\n\nal(x) = \u03c3(zl(x)),\n\nfor l \u2208 [L \u2212 1],\n\nLet dl (for l \u2208 [L \u2212 1]) denote the width of l-th hidden layer. For convenience, we write d0 := dx\nand dL := dy. Here, zl \u2208 Rdl and al \u2208 Rdl denote the input and output (a for activation) of the l-th\nhidden layer, respectively. The output of a hidden layer is the entry-wise map of the input by the\nactivation function \u03c3. The bold-cased symbols denote parameters: W l \u2208 Rdl\u00d7dl\u22121 is the weight\nmatrix, and bl \u2208 Rdl is the bias vector. We de\ufb01ne \u03b8 := (W l, bl)L\nl=1 to be the collection of all\nparameters. We write the network output as f\u03b8(\u00b7) to emphasize that it depends on parameters \u03b8.\nOur results in this paper consider piecewise linear activation functions. Among them, Sections 3 and\n4 consider ReLU-like (\u03c3R) and hard-tanh (\u03c3H) activations, de\ufb01ned as follows:\n\nt \u2265 0,\nt < 0,\n\n\u03c3H(t) :=\n\nt \u2264 \u22121,\nt \u2208 (\u22121, 1],\nt > 1,\n\n=\n\n\u03c3R(t + 1) \u2212 \u03c3R(t \u2212 1) \u2212 s+ \u2212 s\u2212\n\ns+ \u2212 s\u2212\n\n,\n\n(cid:26)s+t\n\ns\u2212t\n\n\u03c3R(t) :=\n\n\uf8f1\uf8f2\uf8f3\u22121\n\nt\n1\n\nwhere s+ > s\u2212 \u2265 0. Note that \u03c3R includes ReLU and Leaky ReLU. Hard-tanh activation (\u03c3H) is a\npiecewise linear approximation of tanh. Since \u03c3H can be represented with two \u03c3R, any results on\nhard-tanh networks can be extended to ReLU-like networks with twice the width.\n\n3 Finite sample expressivity of FNNs\nIn this section, we study universal \ufb01nite sample expressivity of FNNs. For the training dataset, we\nmake the following mild assumption that ensures consistent labels:\nAssumption 3.1. In the dataset {(xi, yi)}N\ni=1 assume that all xi\u2019s are distinct and all yi \u2208 [\u22121, 1]dy.\n3.1 Main results\nWe \ufb01rst state the main theorems on shallow FNNs showing tight lower and upper bounds on memo-\nrization capacity. Detailed discussion will follow in the next subsection.\nTheorem 3.1. Consider any dataset {(xi, yi)}N\n\ni=1 that satis\ufb01es Assumption 3.1. If\n\n\u2022 a 3-layer hard-tanh FNN f\u03b8 satis\ufb01es 4(cid:98)d1/2(cid:99)(cid:98)d2/(2dy)(cid:99) \u2265 N; or\n\u2022 a 3-layer ReLU-like FNN f\u03b8 satis\ufb01es 4(cid:98)d1/4(cid:99)(cid:98)d2/(4dy)(cid:99) \u2265 N,\n\nthen there exists a parameter \u03b8 such that yi = f\u03b8(xi) for all i \u2208 [N ].\nTheorem 3.1 shows that if d1d2 = \u2126(N dy) then we can memorize arbitrary datasets; this means\n\nthat \u2126((cid:112)N dy) hidden nodes are suf\ufb01cient for memorization, in contrary to \u2126(N dy) requirements\n\n\u221a\nof recent results. By adding one more hidden layer, the next theorem shows that we can perfectly\nmemorize any classi\ufb01cation dataset using \u2126(\nProposition 3.2. Consider any dataset {(xi, yi)}N\nyi \u2208 {0, 1}dy is the one-hot encoding of dy classes. Suppose one of the following holds:\n\u2022 a 4-layer hard-tanh FNN f\u03b8 satis\ufb01es 4(cid:98)d1/2(cid:99)(cid:98)d2/2(cid:99) \u2265 N, and d3 \u2265 2dy; or\n\u2022 a 4-layer ReLU-like FNN f\u03b8 satis\ufb01es 4(cid:98)d1/4(cid:99)(cid:98)d2/4(cid:99) \u2265 N, and d3 \u2265 4dy.\n\ni=1 that satis\ufb01es Assumption 3.1. Assume that\n\nN + dy) hidden nodes.\n\nThen, there exists a parameter \u03b8 such that yi = f\u03b8(xi) for all i \u2208 [N ].\nNotice that for scalar regression (dy = 1), Theorem 3.1 proves a lower bound on memorization\ncapacity of 3-layer neural networks: \u2126(d1d2). The next theorem shows that this bound is in fact tight.\n\n4\n\n\fTheorem 3.3. Consider FNNs with dy = 1 and piecewise linear activation \u03c3 with p pieces. If\n\n\u2022 a 2-layer FNN f\u03b8 satis\ufb01es (p \u2212 1)d1 + 2 < N; or\n\u2022 a 3-layer FNN f\u03b8 satis\ufb01es p(p \u2212 1)d1d2 + (p \u2212 1)d2 + 2 < N,\n\nthen there exists a dataset {(xi, yi)}N\ni \u2208 [N ] such that yi (cid:54)= f\u03b8(xi).\nTheorems 3.1 and 3.3 together show tight lower and upper bounds \u0398(d1d2) on memorization capacity\nof 3-layer FNNs, which differ only in constant factors. Theorem 3.3 and the existing result on 2-layer\nFNNs [53, Theorem 1] also show that the memorization capacity of 2-layer FNNs is \u0398(d1).\n\ni=1 satisfying Assumption 3.1 such that for all \u03b8, there exists\n\nProof ideas. The proof of Theorem 3.1 is based on an intricate construction of parameters. Roughly\nspeaking, we construct parameters that make each data point have its unique activation pattern in\nthe hidden layers; more details are in Appendix B. The proof of Proposition 3.2 is largely based on\nTheorem 3.1. By assigning each class j a unique real number \u03c1j (which is similar to the trick in\nHardt and Ma [20]), we modify the dataset into a 1-D regression dataset; we then \ufb01t this dataset using\nthe techniques in Theorem 3.1, and use the extra layer to recover the one-hot representation of the\noriginal yi. Please see Appendix C for the full proof. The main proof idea of Theorem 3.3 is based\non counting the number of \u201cpieces\u201d in the network output f\u03b8(x) (as a function of x), inspired by\nTelgarsky [44]. For the proof, please see Appendix D.\n\n\u221a\n\n4(cid:112)N dy to 4\n\n3.2 Discussion\nDepth-width tradeoffs for \ufb01nite samples. Theorem 3.1 shows that if the two ReLU hidden layers\n\nsatisfy d1 = d2 = 2(cid:112)N dy, then the network can \ufb01t a given dataset perfectly. Proposition 3.2 is an\n\n\u221a\nimprovement for classi\ufb01cation, which shows that a 4-layer ReLU FNN can memorize any dy-class\nclassi\ufb01cation data if d1 = d2 = 2\nAs in other expressivity results, our results show that there are depth-width tradeoffs in the \ufb01nite\nsample setting. For ReLU FNNs it is known that one hidden layer with N nodes can memorize\n\u221a\nany scalar regression (dy = 1) dataset with N points [53]. By adding a hidden layer, the hidden\nnode requirement is reduced to 4\nN ) hidden nodes are\nnecessary and suf\ufb01cient. Ability to memorize N data points with N nodes is perhaps not surprising,\nbecause weights of each hidden node can be tuned to memorize a single data point. In contrast, the\n\u221a\nfact that width-2\nN networks can memorize is far from obvious; each hidden node must handle\n\n\u221a\nN, and Theorem 3.3 also shows that \u0398(\n\nN and d3 = 4dy.\n\nN /2 data points on average, thus a more elaborate construction is required.\n\n\u221a\n\nFor dy-class classi\ufb01cation, by adding one more hidden layer, the requirement is improved from\nN + 4dy nodes. This again highlights the power of depth in expressive power.\nProposition 3.2 tells us that we can \ufb01t ImageNet1 (N \u2248 106, dy = 103) with three ReLU hidden\nlayers, using only 2k-2k-4k nodes. This \u201csuf\ufb01cient\u201d size for memorization is surprisingly smaller\n(disregarding optimization aspects) than practical networks.\nImplications for ERM.\nIt is widely observed in experiments that deep neural networks can achieve\nzero empirical risk, but a concrete understanding of this phenomenon is still elusive. It is known that\nall local minima are global minima for empirical risk of linear neural networks [25, 27, 51, 52, 56],\nbut this property fails to extend to nonlinear neural networks [39, 52]. This suggests that studying\nthe gap between local minima and global minima could provide explanations for the success of deep\nneural networks. In order to study the gap, however, we have to know the risk value attained by global\nminima, which is already non-trivial even for shallow neural networks. In this regard, our theorems\nprovide theoretical guarantees that even a shallow and narrow network can have zero empirical risk\nat global minima, regardless of data and loss functions\u2014e.g., in a regression setting, for a 3-layer\n\nReLU FNN with d1 = d2 = 2(cid:112)N dy there exists a global minimum that has zero empirical risk.\n\nThe number of edges. We note that our results do not contradict the common \u201cinsight\u201d that at least\nN edges are required to memorize N data points. Our \u201csmall\u201d network means a small number of\nhidden nodes, and it still has more than N edges. The existing result [53] requires (dx + 2)N edges,\nwhile our construction for ReLU requires 4N + (2dx + 6)\n\nN + 1 edges, which is much fewer.\n\n\u221a\n\n1after omitting the inconsistently labeled items\n\n5\n\n\fRelevant work on sigmoid. Huang [22] proves that a 2-hidden-layer sigmoid FNNs with d1 =\nN/K + 2K and d2 = K, where K is a positive integer, can approximate N arbitrary distinct data\npoints. The author \ufb01rst partitions N data points into K groups of size N/K each. Then, from the\nfact that the sigmoid function is strictly increasing and non-polynomial, it is shown that if the weights\nbetween input and \ufb01rst hidden layer is sampled randomly, then the output matrix of \ufb01rst hidden\nlayer for each group is full rank with probability one. This is not the case for ReLU or hard-tanh,\nbecause they have \u201c\ufb02at\u201d regions in which rank could be lost. In addition, Huang [22] requires extra\n2K hidden nodes in d1 that serve as \u201c\ufb01lters\u201d which let only certain groups of data points pass through.\nOur construction is not an extension of this result because we take a different strategy (Appendix B);\nwe carefully choose parameters (instead of sampling) that achieve memorization with d1 = N/K\nand d2 = K (in hard-tanh case) without the need of extra 2K nodes, which enjoys a smaller width\nrequirement and allows for more \ufb02exibility in the architecture. Moreover, we provide a converse\nresult (Theorem 3.3) showing that our construction is rate-optimal in the number of hidden nodes.\n\n3.3 Extension to deeper and/or narrower networks\n\n\u221a\n\nWhat if the network is deeper than three layers and/or narrower than\nN? Our next theorem shows\nthat universal \ufb01nite sample expressivity is not limited to 3-layer neural networks, and still achievable\nby exploiting depth even for narrower networks.\nProposition 3.4. Consider any dataset {(xi, yi)}N\ni=1 that satis\ufb01es Assumption 3.1. For an L-layer\nFNN with hard-tanh activation (\u03c3H), assume that there exist indices l1, . . . , lm \u2208 [L \u2212 2] that satisfy\n\n(cid:106) dlj \u2212rj\n\n\u2022 lj + 1 < lj+1 for j \u2208 [m \u2212 1],\n\n(cid:107)(cid:106) dlj +1\u2212rj\n\u2022 4(cid:80)m\n\u2022 dk \u2265 dy + 1 for all k \u2208(cid:83)\n\nj=1\n\n2dy\n\n\u2022 dk \u2265 dy for all k \u2208 [lm + 2 : L \u2212 1],\n\n2\n\nj\u2208[m\u22121][lj + 2 : lj+1 \u2212 1].\n\n(cid:107) \u2265 N, where rj = dy1{j > 1} + 1{j < m}, for j \u2208 [m],\n\nwhere 1{\u00b7} is 0-1 indicator function. Then, there exists \u03b8 such that yi = f\u03b8(xi) for all i \u2208 [N ].\n\nAs a special case, note that for L = 3 (hence m = 1), the conditions boil down to that of Theorem 3.1.\nAn immediate corollary of this fact is that the same result holds for ReLU(-like) networks with twice\nthe width. Moreover, using the same proof technique as Proposition 3.2, this theorem can also be\nimproved for classi\ufb01cation datasets, by inserting one additional hidden layer between layer lm + 1\nand the output layer. Due to space limits, we defer the statement of these corollaries to Appendix A.\nThe proof of Proposition 3.4 is in Appendix E. We use Theorem 3.1 as a building block and construct a\nnetwork (see Figure 2 in appendix) that \ufb01ts a subset of dataset at each pair of hidden layers lj\u2013(lj + 1).\nIf any two adjacent hidden layers satisfy dldl+1 = \u2126(N dy), this network can \ufb01t N data points\n(m = 1), even when all the other hidden layers have only one hidden node. Even with networks\n\nnarrower than(cid:112)N dy (thus m > 1), we can still achieve universal \ufb01nite sample expressivity as long\n\nas there are \u2126(N dy) edges between disjoint pairs of adjacent layers. However, we have the \u201ccost\u201d\nrj in the width of hidden layers; this is because we \ufb01t subsets of the dataset using multiple pairs of\nlayers. To do this, we need rj extra nodes to propagate input and output information to the subsequent\nlayers. For more details, please refer to the proof.\n\nProposition 3.4 gives a lower bound \u2126((cid:80)L\u22122\n\nl=1 dldl+1) on memorization capacity for L-layer networks.\nFor \ufb01xed input/output dimensions, this is indeed \u2126(W ), where W is the number of edges in the\nnetwork. On the other hand, Bartlett et al. [4] showed an upper bound O(W L log W ) on VC\ndimension, which is also an upper bound on memorization capacity. Thus, for any \ufb01xed L, our lower\nbound is nearly tight. We conjecture that, as we have proved in 2- and 3-layer cases, the memorization\ncapacity is \u0398(W ), independent of L; we leave closing this gap for future work.\nFor sigmoid FNNs, Yamasaki [48] claimed that a scalar regression dataset can be memorized if\ndx(cid:100) d1\n2 \u2212 1(cid:101) \u2265 N. However, this claim was made under the\nstronger assumption of data lying in general position (see Assumption 4.1). Unfortunately, Yamasaki\n[48] does not provide a full proof of their claim, making it impossible to validate veracity of their\nconstruction (and we could not \ufb01nd their extended manuscript elsewhere).\n\n2 \u2212 1(cid:101) + \u00b7\u00b7\u00b7 + (cid:98) dL\u22122\n\n2 (cid:99)(cid:100) dL\u22121\n\n2 (cid:101) + (cid:98) d1\n\n2 (cid:99)(cid:100) d2\n\n6\n\n\f4 Classi\ufb01cation under the general position assumption\nThis section presents some results specialized in multi-class classi\ufb01cation task under a slightly stronger\nassumption, namely the general position assumption. Since we are only considering classi\ufb01cation in\nthis section, we also assume that yi \u2208 {0, 1}dy is the one-hot encoding of dy classes.\nAssumption 4.1. For a \ufb01nite dataset {(xi, yi)}N\nsame af\ufb01ne hyperplane. In other words, the data point xi\u2019s are in general position.\nWe consider residual networks (ResNets), de\ufb01ned by the following architecture:\n\ni=1, assume that no dx + 1 data points lie on the\n\nh0(x) = x,\nhl(x) = hl\u22121(x) + V l\u03c3(U lhl\u22121(x) + bl) + cl, l \u2208 [L \u2212 1],\ng\u03b8(x) = V L\u03c3(U LhL\u22121(x) + bL) + cL,\n\ndx\n\nwhich is similar to the previous work by Hardt and Ma [20], except for extra bias parameters cl. In\nthis model, we denote the number hidden nodes in the l-th residual layer as dl; e.g., U l \u2208 Rdl\u00d7dx.\nWe now present a theorem showing that any dataset can be memorized with small ResNets.\nTheorem 4.1. Consider any dataset {(xi, yi)}N\ndx \u2265 dy. Suppose one of the following holds:\n\ni=1 that satis\ufb01es Assumption 4.1. Assume also that\n\n\u2022 a hard-tanh ResNet g\u03b8 satis\ufb01es(cid:80)L\u22121\n\u2022 a ReLU-like ResNet g\u03b8 satis\ufb01es(cid:80)L\u22121\n\n+ 2dy and dL \u2265 dy; or\n+ 4dy and dL \u2265 2dy.\n\nhidden nodes (i.e.,(cid:80)L\u22121\n\nl=1 dl \u2265 2N\nl=1 dl \u2265 4N\nThen, there exists \u03b8 such that yi = g\u03b8(xi) for all i \u2208 [N ].\nThe previous work by Hardt and Ma [20] proves universal \ufb01nite sample expressivity using N + dy\nl=1 dl \u2265 N and dL \u2265 dy) for ReLU activation, under the assumption that xi\u2019s\nare distinct unit vectors. Note that neither this assumption nor Assumption 4.1 implies the other;\nhowever, our assumption is quite mild in the sense that for any given dataset, adding small random\nGaussian noise to xi\u2019s makes the dataset satisfy the assumption, with probability 1.\nThe main idea for the proof is that under the general position assumption, for any choice of dx points\nthere exists an af\ufb01ne hyperplane that contains only these dx points. Each hidden node can choose dx\ndata points and \u201cpush\u201d them to the right direction, making perfect classi\ufb01cation possible. We defer\nthe details to Appendix F.1. Using the same technique, we can also prove an improved result for\n2-layer (1-hidden-layer) FNNs. The proof of the following corollary can be found in Appendix F.2.\nCorollary 4.2. Consider any dataset {(xi, yi)}N\ni=1 that satis\ufb01es Assumption 4.1. Suppose one of the\nfollowing holds:\n\ndx\n\ndx\n\n+ 2dy; or\n\n\u2022 a 2-layer hard-tanh FNN f\u03b8 satis\ufb01es d1 \u2265 2N\n\u2022 a 2-layer ReLU-like FNN f\u03b8 satis\ufb01es d1 \u2265 4N\nThen, there exists \u03b8 such that yi = f\u03b8(xi) for all i \u2208 [N ].\nOur results show that under the general position assumption, perfect memorization is possible with\nonly \u2126(N/dx + dy) hidden nodes rather than N, in both ResNets and 2-layer FNNs. Considering that\ndx is typically in the order of hundreds or thousands, our results reduce the hidden node requirements\ndown to more realistic network sizes. For example, consider CIFAR-10 dataset: N = 50, 000,\ndx = 3, 072, and dy = 10. Previous results require at least 50k ReLUs to memorize this dataset,\nwhile our results require 126 ReLUs for ResNets and 106 ReLUs for 2-layer FNNs.\n\n+ 4dy.\n\ndx\n\n5 Trajectory of SGD near memorizing global minima\nIn this section, we study the behavior of without-replacement SGD near memorizing global minima.\nWe restrict dy = 1 for simplicity. We use the same notation as de\ufb01ned in Section 2, and introduce\nhere some additional de\ufb01nitions. We assume that each activation function \u03c3 is piecewise linear with\nat least two pieces (e.g., ReLU or hard-tanh). Throughout this section, we slightly abuse the notation\n\u03b8 to denote the concatenation of vectorizations of all the parameters (W l, bl)L\n\nl=1.\n\n7\n\n\fWe are interested in minimizing the empirical risk R(\u03b8), de\ufb01ned as the following:\n\n(cid:88)N\n\nR(\u03b8) := 1\nN\n\n(cid:96)(f\u03b8(xi); yi),\n\ni=1\n\n\u2217 is a memorizing global minimum of R(\u00b7) if (cid:96)(cid:48)(f\u03b8\u2217 (xi); yi) = 0, \u2200i \u2208 [N ].\n\u2217. Also,\n\nwhere (cid:96)(z; y) : R (cid:55)\u2192 R is the loss function parametrized by y. We assume the following:\nAssumption 5.1. The loss function (cid:96)(z; y) is a strictly convex and three times differentiable function\nof z. Also, for any y, there exists z \u2208 R such that z is a global minimum of (cid:96)(z; y).\nAssumption 5.1 on (cid:96) is satis\ufb01ed by standard losses such as squared error loss. Note that logistic loss\ndoes not satisfy Assumption 5.1 because the global minimum is not attained by any \ufb01nite z.\nGiven the assumption on (cid:96), we now formally de\ufb01ne the memorizing global minimum.\nDe\ufb01nition 5.1. A point \u03b8\nBy convexity, (cid:96)(cid:48)(f\u03b8\u2217 (xi); yi) = 0 for all i implies that R(\u03b8) is (globally) minimized at \u03b8\nexistence of a memorizing global minimum of R implies that all global minima are memorizing.\nAlthough (cid:96) is a differentiable function of z, the empirical risk R(\u03b8) is not necessarily differentiable\nin \u03b8 because we are using piecewise linear activations. In this paper, we only consider differentiable\npoints of R(\u00b7); since nondifferentiable points lie in a set of measure zero and SGD never reaches\nsuch points in reality, this is a reasonable assumption.\nWe consider minimizing the empirical risk R(\u03b8) using without-replacement mini-batch SGD. We\nuse B as mini-batch size, so it takes E := N/B steps to go over N data points in the dataset.\nFor simplicity we assume that N is a multiple of B. At iteration t = kE, it partitions the dataset\nat random, into E sets of cardinality B: B(kE), B(kE+1), . . . , B(kE+E\u22121), and uses these sets to\nestimate gradients. After each epoch (one pass through the dataset), the data is \u201creshuf\ufb02ed\u201d and\na new partition is used. Without-replacement SGD is known to be more dif\ufb01cult to analyze than\nwith-replacement SGD (see [19, 40] and references therein), although more widely used in practice.\nMore concretely, our SGD algorithm uses the update rule \u03b8(t+1) \u2190 \u03b8(t) \u2212 \u03b7g(t), where we \ufb01x the\nstep size \u03b7 to be a constant throughout the entire run and g(t) is the gradient estimate\n\n(cid:88)\n\ng(t) = 1\nB\n\n(cid:96)(cid:48)(f\u03b8(t)(xi); yi)\u2207\u03b8f\u03b8(t)(xi).\n\ni\u2208B(t)\n\nFor each k,(cid:83)kE+E\u22121\n\nt=kE\n\nB(t) = [N ]. Note also that if B = N, we recover vanilla gradient descent.\n\u2217. We de\ufb01ne vectors \u03bdi := \u2207\u03b8f\u03b8\u2217 (xi) for all\nNow consider a memorizing global minimum \u03b8\ni \u2208 [N ]. We can then express any iterate \u03b8(t) of SGD as \u03b8(t) = \u03b8\n+ \u03be(t), and then further\ndecompose the \u201cperturbation\u201d \u03be(t) as the sum of two orthogonal components \u03be(t)(cid:107) and \u03be(t)\u22a5 , where\n\u03be(t)(cid:107) \u2208 span({\u03bdi}N\ni=1)\u22a5. Also, for a vector v, let (cid:107)v(cid:107) denote its (cid:96)2 norm.\n5.1 Main results and discussion\nWe now state the main theorem of the section. For the proof, please refer to Appendix G.\nTheorem 5.1. Suppose a memorizing global minimum \u03b8\nferentiable at \u03b8\ninitialization \u03b8(0) satis\ufb01es (cid:107)\u03be(0)(cid:107) \u2264 \u03c1, then\n\n\u2217 of R(\u03b8) is given, and that R(\u00b7) is dif-\n\u2217. Then, there exist positive constants \u03c1, \u03b3, \u03bb, and \u03c4 satisfying the following: if\n\ni=1) and \u03be(t)\u22a5 \u2208 span({\u03bdi}N\n\n\u2217\n\nR(\u03b8(0)) \u2212 R(\u03b8\n\n\u2217\n\n) = O((cid:107)\u03be(0)(cid:107)2),\n\nand SGD with step size \u03b7 < \u03b3 satis\ufb01es\n\n(cid:107)\u03be(kE+E)\n\n(cid:107)\n\n(cid:107) \u2264 (1 \u2212 \u03b7\u03bb)(cid:107)\u03be(kE)\n\n(cid:107)\n\n(cid:107), and (cid:107)\u03be(kE+E)(cid:107) \u2264 (cid:107)\u03be(kE)(cid:107) + \u03b7\u03bb(cid:107)\u03be(kE)\n\n(cid:107)\n\n(cid:107),\n\nas long as (cid:107)\u03be(t)(cid:107) (cid:107) \u2265 \u03c4(cid:107)\u03be(t)(cid:107)2 holds for all t \u2208 [kE, kE + E \u2212 1]. As a consequence, at the \ufb01rst\niterate t\u2217 \u2265 0 where the condition (cid:107)\u03be(t)(cid:107) (cid:107) \u2265 \u03c4(cid:107)\u03be(t)(cid:107)2 is violated, we have\n\n(cid:107)\u03be(t\u2217)(cid:107) \u2264 2(cid:107)\u03be(0)(cid:107), and R(\u03b8(t\u2217)) \u2212 R(\u03b8\n\n\u2217\n\n) \u2264 C(cid:107)\u03be(0)(cid:107)4,\n\nfor some positive constant C.\n\n8\n\n\f\u2217 to \u03b8\n\n\u2217\n\ni=1 (cid:96)(cid:48)(cid:48)(f\u03b8\u2217 (xi); yi)\u03bdi\u03bdT\n+ \u03be.\n\neigenvalues of H =(cid:80)N\n\nThe full description of constants \u03c1, \u03b3, \u03bb, \u03c4, and C can be found in Appendix G. They are dependent\non a number of terms, such as N, B, the Taylor expansions of loss (cid:96)(f\u03b8\u2217 (xi); yi) and network\n\u2217, maximum and minimum strictly positive\noutput f\u03b8\u2217 (xi) around the memorizing global minimum \u03b8\ni . The constant \u03c1 must be small enough so that as long\nas (cid:107)\u03be(cid:107) \u2264 \u03c1, the slopes of piecewise linear activation functions evaluated for data points xi do not\nchange from \u03b8\nNotice that for small perturbation \u03be, the Taylor expansion of network output f\u03b8\u2217 (xi) is written as\ni \u03be(cid:107) + O((cid:107)\u03be(cid:107)2), because \u03bdi \u22a5 \u03be\u22a5 by de\ufb01nition. From this perspective,\nf\u03b8\u2217+\u03be(xi) = f\u03b8\u2217 (xi) + \u03bdT\nTheorem 5.1 shows that if initialized near global minima, the component in the perturbation \u03be that\ninduces \ufb01rst-order perturbation of f\u03b8\u2217 (xi), namely \u03be(cid:107), decays exponentially fast until SGD \ufb01nds a\nnearby point that has much smaller risk (O((cid:107)\u03be(0)(cid:107)4)) than the initialization (O((cid:107)\u03be(0)(cid:107)2)). Note also\nthat our result is completely deterministic, and independent of the partitions of the dataset taken by\nthe algorithm; the theorem holds true even if the algorithm is not \u201cstochastic\u201d and just cycles through\nthe dataset in a \ufb01xed order without reshuf\ufb02ing.\nWe would like to emphasize that Theorem 5.1 holds for any memorizing global minima of FNNs, not\nonly for the ones explicitly constructed in Sections 3 and 4. Moreover, the result is not dependent on\nthe network size or data distribution. As long as the global minimum memorizes the data, our theorem\nholds without any depth/width requirements or distributional assumptions, which is a noteworthy\ndifference that makes our result hold in more realistic settings than existing ones.\nThe remaining question is: what happens after t\u2217? Unfortunately, if (cid:107)\u03be(t)(cid:107) (cid:107) \u2264 \u03c4(cid:107)\u03be(t)(cid:107)2, we cannot\nensure exponential decay of (cid:107)\u03be(t)(cid:107) (cid:107), especially if it is small. Without exponential decay, one cannot\nshow an upper bound on (cid:107)\u03be(t)(cid:107) either. This means that after t\u2217, SGD may even diverge or oscillate\nnear global minimum. Fully understanding the behavior of SGD after t\u2217 seems to be a more dif\ufb01cult\nproblem, which we leave for future work.\n\n6 Conclusion and future work\n\u221a\nIn this paper, we show that fully-connected neural networks (FNNs) with \u2126(\nN ) nodes are expressive\n\u221a\nenough to perfectly memorize N arbitrary data points, which is a signi\ufb01cant improvement over the\nrecent results in the literature. We also prove the converse stating that at least \u0398(\nN ) nodes are\nnecessary; these two results together provide tight bounds on memorization capacity of neural\nnetworks. We further extend our expressivity results to deeper and/or narrower networks, providing\na nearly tight bound on memorization capacity for these networks as well. Under an assumption\nthat data points are in general position, we prove that classi\ufb01cation datasets can be memorized\nwith \u2126(N/dx + dy) hidden nodes in deep residual networks and one-hidden-layer FNNs, reducing\nthe existing requirement of \u2126(N ). Finally, we study the dynamics of stochastic gradient descent\n(SGD) on empirical risk, and showed that if SGD is initialized near a global minimum that perfectly\nmemorizes the data, it quickly \ufb01nds a nearby point with small empirical risk. Several future topics\nare open; e.g., 1) tight bounds on memorization capacity for deep FNNs and other architectures, 2)\ndeeper understanding of SGD dynamics in the presence of memorizing global minima.\n\nAcknowledgments\n\nWe thank Alexander Rakhlin for helpful discussion. All the authors acknowledge support from\nDARPA Lagrange. Chulhee Yun also thanks Korea Foundation for Advanced Studies for their\nsupport. Suvrit Sra also acknowledges support from an NSF-CAREER grant and an Amazon\nResearch Award.\n\nReferences\n[1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint arXiv:1811.03962, 2018.\n\n[2] D. Arpit, S. Jastrz\u02dbebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer,\nA. Courville, Y. Bengio, et al. A closer look at memorization in deep networks. In International\nConference on Machine Learning, pages 233\u2013242, 2017.\n\n9\n\n\f[3] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise\npolynomial networks. In Advances in Neural Information Processing Systems, pages 190\u2013196,\n1999.\n\n[4] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodi-\nmension bounds for piecewise linear neural networks. Journal of Machine Learning Research,\n20(63):1\u201317, 2019. URL http://jmlr.org/papers/v20/17-612.html.\n\n[5] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign over\ufb01tting in linear regression.\n\narXiv preprint arXiv:1906.11300, 2019.\n\n[6] E. B. Baum. On the capabilities of multilayer perceptrons. Journal of complexity, 4(3):193\u2013215,\n\n1988.\n\n[7] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning and the\n\nbias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.\n\n[8] M. Belkin, A. Rakhlin, and A. B. Tsybakov. Does data interpolation contradict statistical\n\noptimality? arXiv preprint arXiv:1806.09471, 2018.\n\n[9] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a ConvNet with Gaussian\n\ninputs. In International Conference on Machine Learning, pages 605\u2013614, 2017.\n\n[10] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD learns over-parameterized\nnetworks that provably generalize on linearly separable data. In International Conference on\nLearning Representations, 2018.\n\n[11] T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with\napplications in pattern recognition. IEEE transactions on electronic computers, (3):326\u2013334,\n1965.\n\n[12] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,\n\nsignals and systems, 2(4):303\u2013314, 1989.\n\n[13] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In Advances in Neural\n\nInformation Processing Systems, pages 666\u2013674, 2011.\n\n[14] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer\n\nCNN: Don\u2019t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.\n\n[15] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent \ufb01nds global minima of deep\n\nneural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[16] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\n\nparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[17] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference\n\non Learning Theory, pages 907\u2013940, 2016.\n\n[18] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.\n\nNeural networks, 2(3):183\u2013192, 1989.\n\n[19] J. Z. HaoChen and S. Sra. Random shuf\ufb02ing beats SGD after \ufb01nite epochs. arXiv preprint\n\narXiv:1806.10077, 2018.\n\n[20] M. Hardt and T. Ma. Identity matters in deep learning. In International Conference on Learning\n\nRepresentations, 2017.\n\n[21] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal\n\napproximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[22] G.-B. Huang. Learning capability and storage capacity of two-hidden-layer feedforward\n\nnetworks. IEEE Transactions on Neural Networks, 14(2):274\u2013281, 2003.\n\n10\n\n\f[23] G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward\nnetworks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural\nNetworks, 9(1):224\u2013229, 1998.\n\n[24] S.-C. Huang and Y.-F. Huang. Bounds on the number of hidden neurons in multilayer percep-\n\ntrons. IEEE transactions on neural networks, 2(1):47\u201355, 1991.\n\n[25] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[26] A. Kowalczyk. Estimates of storage capacity of multilayer perceptron with threshold logic\n\nhidden units. Neural networks, 10(8):1417\u20131433, 1997.\n\n[27] T. Laurent and J. Brecht. Deep linear networks with arbitrary loss: All local minima are global.\n\nIn International Conference on Machine Learning, pages 2908\u20132913, 2018.\n\n[28] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent\non structured data. In Advances in Neural Information Processing Systems, pages 8168\u20138177,\n2018.\n\n[29] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with ReLU activation.\n\nIn Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[30] S. Liang and R. Srikant. Why deep neural networks for function approximation? In International\n\nConference on Learning Representations, 2017.\n\n[31] T. Liang and A. Rakhlin. Just Interpolate: Kernel \u201cRidgeless\u201d Regression Can Generalize.\n\narXiv preprint arXiv:1808.00387, 2018.\n\n[32] T. Liang, A. Rakhlin, and X. Zhai. On the risk of minimum-norm interpolants and restricted\n\nlower isometry of kernels. arXiv preprint arXiv:1908.10292, 2019.\n\n[33] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view\nfrom the width. In Advances in Neural Information Processing Systems, pages 6231\u20136239,\n2017.\n\n[34] S. Mei and A. Montanari. The generalization error of random features regression: Precise\n\nasymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.\n\n[35] Q. Nguyen and M. Hein. Optimization landscape and expressivity of deep CNNs. arXiv preprint\n\narXiv:1710.10928, 2017.\n\n[36] N. J. Nilsson. Learning machines. 1965.\n\n[37] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In\n\nInternational Conference on Learning Representations, 2018.\n\n[38] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural\n\nnetworks. In International Conference on Machine Learning, pages 2979\u20132987, 2017.\n\n[39] I. Safran and O. Shamir. Spurious local minima are common in two-layer ReLU neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\n[40] O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in\n\nneural information processing systems, pages 46\u201354, 2016.\n\n[41] M. Soltanolkotabi. Learning ReLUs via gradient descent. In Advances in Neural Information\n\nProcessing Systems, pages 2007\u20132017, 2017.\n\n[42] E. D. Sontag. Shattering all sets of \u2018k\u2019 points in \u201cgeneral position\u201d requires (k\u20141)/2 parameters.\n\nNeural Computation, 9(2):337\u2013348, 1997.\n\n[43] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees\n\nfor multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n11\n\n\f[44] M. Telgarsky. Representation bene\ufb01ts of deep feedforward networks.\n\narXiv:1509.08101, 2015.\n\narXiv preprint\n\n[45] M. Telgarsky. Bene\ufb01ts of depth in neural networks. In Conference on Learning Theory, pages\n\n1517\u20131539, 2016.\n\n[46] Y. Tian. An analytical formula of population gradient for two-layered ReLU network and its\napplications in convergence and critical point analysis. In International Conference on Machine\nLearning, pages 3404\u20133413, 2017.\n\n[47] G. Wang, G. B. Giannakis, and J. Chen. Learning ReLU networks on linearly separable data:\n\nAlgorithm, optimality, and generalization. arXiv preprint arXiv:1808.04685, 2018.\n\n[48] M. Yamasaki. The lower bound of the capacity for a neural network with multiple hidden layers.\n\nIn ICANN\u201993, pages 546\u2013549. Springer, 1993.\n\n[49] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:\n\n103\u2013114, 2017.\n\n[50] D. Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks.\n\narXiv preprint arXiv:1802.03620, 2018.\n\n[51] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. In\n\nInternational Conference on Learning Representations, 2018.\n\n[52] C. Yun, S. Sra, and A. Jadbabaie. Small nonlinearities in activation functions create bad local\nminima in neural networks. In International Conference on Learning Representations, 2019.\n\n[53] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In International Conference on Learning Representations (ICLR),\n2017.\n\n[54] X. Zhang, Y. Yu, L. Wang, and Q. Gu. Learning one-hidden-layer ReLU networks via gradient\n\ndescent. arXiv preprint arXiv:1806.07808, 2018.\n\n[55] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hidden-\nlayer neural networks. In International Conference on Machine Learning, pages 4140\u20134149,\n2017.\n\n[56] Y. Zhou and Y. Liang. Critical points of neural networks: Analytical forms and landscape\n\nproperties. In International Conference on Learning Representations, 2018.\n\n[57] Y. Zhou, J. Yang, H. Zhang, Y. Liang, and V. Tarokh. SGD converges to global minimum in\ndeep learning via star-convex path. In International Conference on Learning Representations,\n2019.\n\n[58] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized\n\ndeep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.\n\n12\n\n\f", "award": [], "sourceid": 9023, "authors": [{"given_name": "Chulhee", "family_name": "Yun", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Ali", "family_name": "Jadbabaie", "institution": "MIT"}]}