{"title": "How degenerate is the parametrization of neural networks with the ReLU activation function?", "book": "Advances in Neural Information Processing Systems", "page_first": 7790, "page_last": 7801, "abstract": "Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.", "full_text": "How degenerate is the parametrization of neural\n\nnetworks with the ReLU activation function?\n\nJulius Berner\n\nFaculty of Mathematics, University of Vienna\n\nOskar-Morgenstern-Platz 1, 1090 Vienna, Austria\n\njulius.berner@univie.ac.at\n\nDennis Elbr\u00e4chter\n\nFaculty of Mathematics, University of Vienna\n\nOskar-Morgenstern-Platz 1, 1090 Vienna, Austria\n\ndennis.elbraechter@univie.ac.at\n\nFaculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna\n\nPhilipp Grohs\n\nOskar-Morgenstern-Platz 1, 1090 Vienna, Austria\n\nphilipp.grohs@univie.ac.at\n\nAbstract\n\nNeural network training is usually accomplished by solving a non-convex opti-\nmization problem using stochastic gradient descent. Although one optimizes over\nthe networks parameters, the main loss function generally only depends on the\nrealization of the neural network, i.e. the function it computes. Studying the opti-\nmization problem over the space of realizations opens up new ways to understand\nneural network training. In particular, usual loss functions like mean squared error\nand categorical cross entropy are convex on spaces of neural network realizations,\nwhich themselves are non-convex. Approximation capabilities of neural networks\ncan be used to deal with the latter non-convexity, which allows us to establish\nthat for suf\ufb01ciently large networks local minima of a regularized optimization\nproblem on the realization space are almost optimal. Note, however, that each\nrealization has many different, possibly degenerate, parametrizations. In particular,\na local minimum in the parametrization space needs not correspond to a local\nminimum in the realization space. To establish such a connection, inverse stability\nof the realization map is required, meaning that proximity of realizations must\nimply proximity of corresponding parametrizations. We present pathologies which\nprevent inverse stability in general, and, for shallow networks, proceed to establish\na restricted space of parametrizations on which we have inverse stability w.r.t. to a\nSobolev norm. Furthermore, we show that by optimizing over such restricted sets,\nit is still possible to learn any function which can be learned by optimization over\nunrestricted sets.\n\n1\n\nIntroduction and Motivation\n\nIn recent years much effort has been invested into explaining and understanding the overwhelming\nsuccess of deep learning based methods. On the theoretical side, impressive approximation capa-\nbilities of neural networks have been established [9, 10, 16, 20, 32, 33, 37, 39]. No less important\nare recent results on the generalization of neural networks, which deal with the question of how\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwell networks, trained on limited samples, perform on unseen data [2, 3, 5\u20137, 17, 29]. Last but\nnot least, the optimization error, which quanti\ufb01es how well a neural network can be trained by\napplying stochastic gradient descent to an optimization problem, has been analyzed in different\nscenarios [1, 11, 13, 22, 24, 25, 27, 38]. While there are many interesting approaches to the latter\nquestion, they tend to require very strong assumptions (e.g. (almost) linearity, convexity, or extreme\nover-parametrization). Thus a satisfying explanation for the success of stochastic gradient descent for\na non-smooth, non-convex problem remains elusive.\nIn the present paper we intend to pave the way for a functional perspective on the optimization\nproblem. This allows for new mathematical approaches towards understanding the training of neural\nnetworks, some of which are demonstrated in Section 1.2. To this end we examine degenerate\nparametrizations with undesirable properties in Section 2. These can be roughly classi\ufb01ed as\n\nC.1 unbalanced magnitudes of the parameters\nC.2 weight vectors with the same direction\nC.3 weight vectors with directly opposite directions.\n\nUnder conditions designed to avoid these degeneracies, Theorem 3.1 establishes inverse stability\nfor shallow networks with ReLU activation function. This is accomplished by a re\ufb01ned analysis\nof the behavior of ReLU networks near a discontinuity of their derivative. Proposition 1.2 shows\nhow inverse stability connects the loss surface of the parametrized minimization problem to the loss\nsurface of the realization space problem. In Theorem 1.3 we showcase a novel result on almost\noptimality of local minima of the parametrized problem obtained by analyzing the realization space\nproblem. Note that this approach of analyzing the loss surface is conceptually different from previous\napproaches as in [11, 18, 23, 30, 31, 36].\n\n1.1\n\nInverse Stability of Neural Networks\n\nWe will focus on neural networks with the ReLU activation function \u03c1(x) := x+, and adapt the\nmathematically convenient notation from [33], which distinguishes between the parametrization of a\nneural network and its realization. Let us de\ufb01ne the set AL of all network architectures with depth\nL \u2208 N, input dimension d \u2208 N, and output dimension D \u2208 N by\n\nAL := {(N0, . . . , NL) \u2208 NL+1 : N0 = d, NL = D}.\n\n(1)\nThe architecture N \u2208 AL simply speci\ufb01es the number of neurons Nl in each of the L layers. We can\nthen de\ufb01ne the space PN of parametrizations with architecture N \u2208 AL as\n\nL(cid:89)\n\n(cid:0)RN(cid:96)\u00d7N(cid:96)\u22121 \u00d7 RN(cid:96)(cid:1) ,\n\nPN :=\n\n(2)\n\nthe set P =(cid:83)\n\nN\u2208AL\n\nPN of all parametrizations with architecture in AL, and the realization map\n\n(cid:96)=1\n\nR : P \u2192 C(Rd, RD)\n(cid:96)=1 (cid:55)\u2192 R(\u0398) := WL \u25e6 \u03c1 \u25e6 WL\u22121 . . . \u03c1 \u25e6 W1,\n\n\u0398 = ((A(cid:96), b(cid:96)))L\n\n(3)\n\nwhere W(cid:96)(x) := A(cid:96)x + b(cid:96) and \u03c1 is applied component-wise. We refer to A(cid:96) and b(cid:96) as the weights\nand biases in the (cid:96)-th layer.\nNote that a parametrization \u0398 \u2208 \u2126 \u2286 P uniquely induces a realization R(\u0398) in the realization space\nR(\u2126), while in general there can be multiple non-trivially different parametrizations with the same\nrealization. To put it in mathematical terms, the realization map is not injective. Consider the basic\ncounterexample\n\nand \u0393 =(cid:0)(B1, c1), . . . , (BL\u22121, cL\u22121), (0, 0)(cid:1) (4)\n\n\u0398 =(cid:0)(A1, b1), . . . , (AL\u22121, bL\u22121), (0, 0)(cid:1)\n\nfrom [34] where regardless of A(cid:96), B(cid:96), b(cid:96) and c(cid:96) both realizations coincide with R(\u0398) = R(\u0393) = 0.\nHowever, it it is well-known that the realization map is locally Lipschitz continuous, meaning that\nclose1 parametrizations in PN induce realizations which are close in the uniform norm on compact\n1On the \ufb01nite dimensional vector space PN all norms are equivalent and we take w.l.o.g. the maximum norm\n\n(cid:107)\u0398(cid:107)\u221e, i.e. the maximum of the absolute values of the entries of the A(cid:96) and b(cid:96).\n\n2\n\n\fsets, see e.g. [2, Lemma 14.6], [7, Theorem 4.2], and [34, Proposition 5.1].\nWe will shed light upon the inverse question. Given realizations R(\u0393) and R(\u0398) that are close, do\nthe parametrizations \u0393 and \u0398 have to be close? In an abstract setting we measure the proximity of\nrealizations in the norm (cid:107) \u00b7 (cid:107) of a Banach space B with R(P) \u2286 B, while concrete Banach spaces of\ninterest will be speci\ufb01ed later. In view of the above counterexample we will, at the very least, need to\nallow for the reparametrization of one of the networks, i.e. we arrive at the following question.\n\nGiven R(\u0393) and R(\u0398) that are close, does there exist a parametrization \u03a6 with\nR(\u03a6) = R(\u0398) such that \u0393 and \u03a6 are close?\n\nAs we will see in Section 2, this question is fundamentally connected to understanding the redundan-\ncies and degeneracies of the way that neural networks are parametrized. By suitable regularization, i.e.\nconsidering a subspace \u2126 \u2286 PN of parametrizations, we can avoid these pathologies and establish a\npositive answer to the question above. For such a property the term inverse stability was introduced\nin [34], which constitutes the only other research conducted in this area, as far as we are aware.\nDe\ufb01nition 1.1 (Inverse stability). Let s, \u03b1 > 0, N \u2208 AL, and \u2126 \u2286 PN . We say that the realization\nmap is (s, \u03b1) inverse stable on \u2126 w.r.t. (cid:107) \u00b7 (cid:107), if for all \u0393 \u2208 \u2126 and g \u2208 R(\u2126) there exists \u03a6 \u2208 \u2126 with\n(5)\n\nand (cid:107)\u03a6 \u2212 \u0393(cid:107)\u221e \u2264 s(cid:107)g \u2212 R(\u0393)(cid:107)\u03b1.\n\nR(\u03a6) = g\n\nIn Section 2 we will see why inverse stability fails w.r.t. the uniform norm. Therefore, we consider\na norm which takes into account not only the maximum error of the function values but also of\nthe gradients. In mathematical terms, we make use of the Sobolev norm (cid:107) \u00b7 (cid:107)W 1,\u221e(U ) (on some\ndomain U \u2286 Rd) de\ufb01ned for every (locally) Lipschitz continuous function g : Rd \u2192 RD by\n(cid:107)g(cid:107)W 1,\u221e(U ) := max{(cid:107)g(cid:107)L\u221e(U ),|g|W 1,\u221e(U )} with the Sobolev semi-norm | \u00b7 |W 1,\u221e(U ) given by\n\n|g|W 1,\u221e(U ) := (cid:107)Dg(cid:107)L\u221e(U ) = ess sup\nx\u2208U\n\n(cid:107)Dg(x)(cid:107)\u221e.\n\n(6)\n\nSee [15] for further information on Sobolev norms, and [8] for further information on the derivative\nof ReLU networks.\n\n1.2\n\nImplications of inverse stability for neural network optimization\n\nWe proceed by demonstrating how inverse stability opens up new perspectives on the optimiza-\ntion problem which arises in neural network training. Speci\ufb01cally, consider a loss function\nL : C(Rd, RD) \u2192 [0,\u221e) on the space of continuous functions. For illustration, we take the com-\ni=1 \u2208 (Rd \u00d7 RD)n, is given\nmonly used mean squared error (MSE) which, for training data ((xi, yi))n\nby\n\n(7)\nTypically, the optimization problem is solved over some subspace of parametrizations \u2126 \u2286 PN , i.e.\n\ni=1\n\n(cid:107)g(xi) \u2212 yi(cid:107)2\n2,\n\nfor g \u2208 C(Rd, RD).\n\nL(g) = 1\n\nn\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nL(R(\u0393)) = min\n\u0393\u2208\u2126\n\n1\nn\n\nmin\n\u0393\u2208\u2126\n\n(cid:107)R(\u0393)(xi) \u2212 yi(cid:107)2\n2.\n\n(8)\n\nFrom an abstract point of view, by writing g = R(\u0393) \u2208 R(\u2126), this is equivalent to the corresponding\noptimization problem over the space of realizations R(\u2126), i.e.\n\nmin\ng\u2208R(\u2126)\n\nL(g) = min\ng\u2208R(\u2126)\n\n1\nn\n\n(cid:107)g(xi) \u2212 yi(cid:107)2\n2.\n\n(9)\n\nHowever, the loss landscape of the optimization problem (8) is only properly connected to the loss\nlandscape of the optimization problem (9) if the realization map is inverse stable on \u2126. Otherwise\na realization g \u2208 R(PN ) can be arbitrarily close to a global minimum in the realization space but\nevery parametrization \u03a6 with R(\u03a6) = g is far away from the corresponding global minimum in the\nparametrization space. Moreover, local minima of (8) in the parametrization space must correspond\nto local minima of (9) in the realization space if and only if we have inverse stability.\n\n3\n\n\fProposition 1.2 (Parametrization minimum \u21d2 realization minimum). Let N \u2208 AL, \u2126 \u2286 PN and\nlet the realization map be (s, \u03b1) inverse stable on \u2126 w.r.t. (cid:107) \u00b7 (cid:107). Let \u0393\u2217 \u2208 \u2126 be a local minimum of\nL \u25e6 R on \u2126 with radius r > 0, i.e. for all \u03a6 \u2208 \u2126 with (cid:107)\u03a6 \u2212 \u0393\u2217(cid:107)\u221e \u2264 r it holds that\n\nL(R(\u0393\u2217)) \u2264 L(R(\u03a6)).\n\nThen R(\u0393\u2217) is a local minimum of L on R(\u2126) with radius ( r\n(cid:107)g \u2212 R(\u0393\u2217)(cid:107) \u2264 ( r\n\ns )1/\u03b1 it holds that\n\nL(R(\u0393\u2217)) \u2264 L(g).\n\n(10)\ns )1/\u03b1, i.e. for all g \u2208 R(\u2126) with\n\nSee Appendix A.1.2 for a proof and Example A.1 for a counterexample in the case that inverse\nstability is not given. Note that in (9) we consider a problem with convex loss function but non-convex\nfeasible set, see [34, Section 3.2]. This opens up new avenues of investigation using tools from\nfunctional analysis and allows utilizing recent results [19, 34] exploring the topological properties of\nneural network realization spaces.\nAs a concrete demonstration we provide with Theorem A.2 a strong result obtained on the realization\nspace, which estimates the quality of a local minimum based on its radius and the approximation\ncapabilities of the chosen architecture for a class of functions S. Speci\ufb01cally let C > 0, let\n\u039b : B \u2192 [0,\u221e) be a quasi-convex regularizer, and de\ufb01ne\n\nS := {f \u2208 B : \u039b(f ) \u2264 C}.\n\nWe denote the sets of regularized parametrizations by\n\n\u2126N := {\u03a6 \u2208 PN : \u039b(R(\u03a6)) \u2264 C}\n\n(13)\nand assume that the loss function L is convex and c-Lipschitz continuous on S. Note that virtually\nall relevant loss functions are convex and locally Lipschitz continuous on C(Rd, RD). Employing\nProposition 1.2, inverse stability can then be used to derive the following result for the practically\nrelevant parametrized problem, showing that for suf\ufb01ciently large architectures local minima of a\nregularized neural network optimization problem are almost optimal.\nTheorem 1.3 (Almost optimality of local parameter minima). Assume that S is compact in the\n(cid:107) \u00b7 (cid:107)-closure of R(P) and that for every N \u2208 AL the realization map is (s, \u03b1) inverse stable on\n\u2126N w.r.t. (cid:107) \u00b7 (cid:107) . Then for all \u03b5, r > 0 there exists n(\u03b5, r) \u2208 AL such that for every N \u2208 AL with\nN1 \u2265 n1(\u03b5, r), . . . , NL\u22121 \u2265 nL\u22121(\u03b5, r) the following holds:\nEvery local minimum \u0393\u2217 with radius at least r of min\u0393\u2208\u2126N L(R(\u0393)) satis\ufb01es\n\nL(R(\u0393\u2217)) \u2264 min\n\u0393\u2208\u2126N\n\nL(R(\u0393)) + \u03b5.\n\n(11)\n\n(12)\n\n(14)\n\nSee Appendix A.1.2 for a proof and note that here it is important to have an inverse stability result,\nwhere the parameters (s, \u03b1) do not depend on the size of the architecture, which we achieve for\nL = 2 and B = W 1,\u221e. Suitable \u039b would be Besov norms which constitute a common regularizer in\nimage and signal processing. Moreover, note that the required size of the architecture in Theorem 1.3\ncan be quanti\ufb01ed, if one has approximation rates for S. In particular, this approach allows the use of\napproximation results in order to explain the success of neural network optimization and enables a\ncombined study of these two aspects, which, to the best of our knowledge, has not been done before.\nUnlike in recent literature, our result needs no assumptions on the sample set (incorporated in the loss\nfunction, see (7)), in particular we do not require \u201coverparametrization\u201d with respect to the sample\nsize. Here the required size of the architecture only depends on the complexity of S, i.e. the class of\nfunctions one wants to approximate, the radius of the local minima of interest, the Lipschitz constant\nof the loss function, and the parameters of the inverse stability.\nIn the following we restrict ourselves to two-layer ReLU networks without biases, where we present\na proof for (4, 1/2) inverse stability w.r.t. the Sobolev semi-norm on a suitably regularized space of\nparametrizations. Both the regularizations as well as the stronger norm (compared to the uniform\nnorm) will shown to be necessary in Section 2. We now present, in an informal way, a collection\nof our main results. A short proof making the connection to the formal results can be found in\nAppendix A.1.2.\nCorollary 1.4 (Inverse stability and implications - colloquial). Suppose we are given data\ni=1 \u2208 (Rd \u00d7 RD)n and want to solve a typical minimization problem for ReLU networks\n((xi, yi))n\nwith shallow architecture N = (d, N1, D), i.e.\n\nmin\n\u0393\u2208PN\n\n1\nn\n\n(cid:107)R(\u0393)(xi) \u2212 yi)(cid:107)2\n2.\n\n(15)\n\nn(cid:88)\n\ni=1\n\n4\n\n\fFirst we augment the architecture to \u02dcN = (d + 2, N1 + 1, D), while omitting the biases, and augment\nthe samples to \u02dcxi = (xi\n\nd, 1,\u22121). Moreover, we assume that the parametrizations\n\n1, . . . , xi\n\n\u03a6 =(cid:0)(cid:0)[a1| . . .|aN1+1]T , 0(cid:1), ([c1| . . .|cN1+1], 0)(cid:1) \u2208 \u2126 \u2286 P \u02dcN\n\n(16)\n\n(17)\n\n(18)\n\nare regularized such that\n\nC.1 the network is balanced, i.e. (cid:107)ai(cid:107)\u221e = (cid:107)ci(cid:107)\u221e,\nC.2 no non-zero weight vectors in the \ufb01rst layer are redundant, i.e. ai (cid:54)(cid:107) aj,\nC.3 the last two coordinates of each weight vector ai are strictly positive.\n\nThen for the new minimization problem\n\nmin\n\u03a6\u2208\u2126\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n(cid:107)R(\u03a6)(\u02dcxi) \u2212 yi(cid:107)2\n\n2\n\nthe following holds:\n\n1. If \u03a6\u2217 is a local minimum of (17) with radius r, then R(\u03a6\u2217) is a local minimum of\n\nming\u2208R(\u2126)\n\n1\nn\n\n2 with radius at least r2\n\n2. The global minimum of (17) is at least as good as the global minimum of (15), i.e.\n\n(cid:80)n\ni=1 (cid:107)g(\u02dcxi) \u2212 yi(cid:107)2\nn(cid:88)\n\nmin\n\u03a6\u2208\u2126\n\n1\nn\n\ni=1\n\n(cid:107)R(\u03a6)(\u02dcxi) \u2212 yi(cid:107)2\n\n2 \u2264 min\n\u0393\u2208PN\n\n1\nn\n\n16 w.r.t. | \u00b7 |W 1,\u221e.\nn(cid:88)\n\n(cid:107)R(\u0393)(xi) \u2212 yi(cid:107)2\n2.\n\ni=1\n\n3. By further regularizing (17) in the sense of Theorem 1.3, we can estimate the quality of its\n\nlocal minima.\n\nThis argument is not limited to the MSE loss function but works for any loss function based on\nevaluating the realization. The omission of bias weights is standard in neural network optimization\nliterature [11, 13, 22, 24]. While this severely limits the functions that can be realized with a given\narchitecture, it is suf\ufb01cient to augment the problem by one dimension in order to recover the full\nrange of functions that can be learned [1]. Here we augment by two dimensions, so that the third\nregularization condition C.3 can be ful\ufb01lled without loosing range. Moreover, note that, for simplicity\nof presentation, the regularization assumptions stated above are stricter than necessary and possible\nrelaxations are discussed in Section 3.\n\n2 Obstacles to inverse stability - degeneracies of ReLU parametrizations\n\nIn the remainder of this paper we focus on shallow ReLU networks without biases and de\ufb01ne the cor-\nresponding space of parametrizations with architecture N = (d, m, D) as NN := Rm\u00d7d \u00d7 RD\u00d7m.\n\nThe realization map2 R is, for every \u0398 = (A, C) =(cid:0)[a1| . . .|am]T , [c1| . . .|cm](cid:1) \u2208 NN , given by\n\nRd (cid:51) x (cid:55)\u2192 R(\u0398)(x) = C\u03c1(Ax) =\n\nci\u03c1((cid:104)ai, x(cid:105)).\n\n(19)\n\nm(cid:88)\n\ni=1\n\nNote that each function x (cid:55)\u2192 ci\u03c1((cid:104)ai, x(cid:105)) represents a so-called ridge function which is zero on the\nhalf-space {x \u2208 Rd : (cid:104)ai, x(cid:105) \u2264 0} and linear with constant derivative ciaT\ni \u2208 RD \u00d7 Rd on the other\nhalf-space. Thus, the ai are the normal vectors of the separating hyperplanes {x \u2208 Rd : (cid:104)ai, x(cid:105) = 0}\nand consequently we refer to the weight vectors ai also as the directions of \u0398. Moreover, for \u0398 \u2208 NN\nit holds that R(\u0398)(0) = 0 and, as long as the domain of interest U \u2286 Rd contains the origin, the\nSobolev norm (cid:107) \u00b7 (cid:107)W 1,\u221e(U ) is equivalent to its semi-norm, since\n\n(20)\n2This is a slight abuse of notation, justi\ufb01ed by the the fact that R acts the same on PN with zero biases b1, b2\n\n(cid:107)R(\u0398)(cid:107)L\u221e(U ) \u2264\n\nd diam(U )|R(\u0398)|W 1,\u221e ,\n\nand weights A1 = A and A2 = C.\n\n\u221a\n\n5\n\n\fFigure 1: The \ufb01gure shows gk for k = 1, 2.\n\nsee also inequalities of Poincar\u00e9-Friedrichs type [14, Subsection 5.8.1]. Therefore, in the rest of the\npaper we will only consider the Sobolev semi-norm3\n\n|R(\u0398)|W 1,\u221e(U ) = ess sup\nx\u2208U\n\n(cid:13)(cid:13)(cid:13) (cid:88)\n\ni\u2208[m] : (cid:104)ai,x(cid:105)>0\n\n(cid:13)(cid:13)(cid:13)\u221e.\n\nciaT\ni\n\n(21)\n\n(22)\n\n(23)\n\nIn (21) one can see that in our setting | \u00b7 |W 1,\u221e(U ) is independent of U (as long as U contains a\nneighbourhood of the origin) and will thus be abbreviated by | \u00b7 |W 1,\u221e.\n\n2.1 Failure of inverse stability w.r.t. uniform norm\n\nAll proofs for this section can be found in Appendix A.2.2. We start by showing that inverse stability\nfails w.r.t. the uniform norm. This example is adapted from [34, Theorem 5.2] and represents, to the\nbest of our knowledge, the only degeneracy which has already been observed before.\nExample 2.1 (Failure due to exploding gradient). Let \u0393 := (0, 0) \u2208 N(2,2,1) and gk \u2208 R(N(2,2,1))\nbe given by (see Figure 1)\n\nk \u2208 N.\nThen for every sequence (\u03a6k)k\u2208N \u2286 N(2,2,1) with R(\u03a6k) = gk it holds that\n\ngk(x) := k\u03c1((cid:104)(k, 0), x(cid:105)) \u2212 k\u03c1((cid:104)(k,\u2212 1\n\nk2 ), x(cid:105)),\n\nk\u2192\u221e(cid:107)R(\u03a6k) \u2212 R(\u0393)(cid:107)L\u221e((\u22121,1)2) = 0 and\n\nlim\n\nk\u2192\u221e(cid:107)\u03a6k \u2212 \u0393(cid:107)\u221e = \u221e.\n\nlim\n\nIn particular, note that inverse stability fails here even for a non-degenerate parametrization of the\nzero function \u0393 = (0, 0). However, for this type of counterexample the magnitude of the gradient of\nR(\u03a6k) needs to go to in\ufb01nity, which is our motivation for looking at inverse stability w.r.t. | \u00b7 |W 1,\u221e.\n\nExample 2.2 (Failure due to complete unbalancedness). Let r > 0, \u0393 :=(cid:0)(r, 0), 0(cid:1) \u2208 N(2,1,1) and\n\n2.2 Failure of inverse stability w.r.t. Sobolev norm\nIn this section we present four degenerate cases where inverse stability fails w.r.t. | \u00b7 |W 1,\u221e. This\ncollection of counterexamples is complete in the sense that we can establish inverse stability under\nassumptions which are designed to exclude these four pathologies.\ngk \u2208 R(N(2,1,1)) be given by (see Figure 2)\ngk(x) = 1\n\nk \u2208 N.\n\nk \u03c1((cid:104)(0, 1), x(cid:105)),\n\n(24)\n\nThen for every k \u2208 N and \u03a6k \u2208 N(2,1,1) with R(\u03a6k) = gk it holds that\n\n|R(\u03a6k) \u2212 R(\u0393)|W 1,\u221e = 1\n\n(cid:107)\u03a6k \u2212 \u0393(cid:107)\u221e \u2265 r.\n\n(25)\nThis is a very simple example of a degenerate parametrization of the zero function, since R(\u0393) = 0\nregardless of choice of r. The issue here is that we can have a weight pair, i.e. ((r, 0), 0), where the\nproduct is independent of the value of one of the parameters. Note that in Example A.4 one can see a\n\nslightly more subtle version of this pathology by considering \u0393k :=(cid:0)(k, 0), 1\n\n(cid:1) \u2208 N(2,1,1) instead.\n\nIn that case one could still get an inverse stability estimate for each \ufb01xed k; the parameters of inverse\n\nand\n\nk2\n\nk\n\n3For m \u2208 N we abbreviate [m] := {1, . . . , m}.\n\n6\n\n\fFigure 2: Shows R(\u0393) (r = 0.5) and g3.\n\nFigure 3: Shows R(\u0393) and g2.\n\nstability (s, \u03b1) would however deteriorate with increasing k. In particular this demonstrates the need\nfor some sort of balancedness of the parametrization, i.e. control over (cid:107)ci(cid:107)\u221e and (cid:107)ai(cid:107)\u221e individually\nrelative to (cid:107)ci(cid:107)\u221e(cid:107)ai(cid:107)\u221e.\nInverse stability is also prevented by redundant directions as the following example illustrates.\nExample 2.3 (Failure due to redundant directions). Let\n\n(cid:18)(cid:20)1\n\n1\n\n(cid:21)\n\n0\n0\n\n(cid:19)\n\n\u0393 :=\n\n, (1, 1)\n\n\u2208 N(2,2,1)\n\n(26)\n\n(27)\n\n(28)\n\nand gk \u2208 R(N(2,2,1)) be given by (see Figure 3)\n\nThen for every k \u2208 N and \u03a6k \u2208 N(2,2,1) with R(\u03a6k) = gk it holds that\n\ngk(x) := 2\u03c1((cid:104)(1, 0), x(cid:105)) + 1\n\nk \u03c1((cid:104)(0, 1), x(cid:105)),\n\nk \u2208 N.\n\n|R(\u03a6k) \u2212 R(\u0393)|W 1,\u221e = 1\n\nk\n\nand\n\n(cid:107)\u03a6k \u2212 \u0393(cid:107)\u221e \u2265 1.\n\nThe next example shows that not only redundant weight vectors can cause issues, but also weight\nvectors of opposite direction, as they would allow for a (balanced) degenerate parametrization of the\nzero function.\nExample 2.4 (Failure due to opposite weight vectors 1). Let ai \u2208 Rd, i \u2208 [m], be pairwise linearly\n\nindependent with (cid:107)ai(cid:107)\u221e = 1 and(cid:80)m\n\n\u0393 :=(cid:0)[a1| . . .|am| \u2212 a1| . . .| \u2212 am]T ,(cid:0)1, . . . , 1,\u22121, . . . ,\u22121(cid:1)(cid:1) \u2208 N(d,2m,1).\n\n(29)\nNow let v \u2208 Rd with (cid:107)v(cid:107)\u221e = 1 be linearly independent to each ai, i \u2208 [m], and let gk \u2208\nR(N(d,2m,1)) be given by (see Figure 4)\n\ni=1 ai = 0. We de\ufb01ne\n\n(30)\nThen there exists a constant C > 0 such that for every k \u2208 N and every \u03a6k \u2208 N(d,2m,1) with\nR(\u03a6k) = gk it holds that\n\ngk(x) = 1\n\nk \u03c1((cid:104)v, x(cid:105)),\n\nk \u2208 N.\n\n|R(\u03a6k) \u2212 R(\u0393)|W 1,\u221e = 1\n\nk\n\nand\n\n(cid:107)\u03a6k \u2212 \u0393(cid:107)\u221e \u2265 C.\n\n(31)\n\nThus we will need an assumption which prevents each individual \u0393 in our restricted set from having\npairwise linearly dependent weight vectors, i.e. coinciding hyperplanes of non-differentiability. This,\nhowever, does not suf\ufb01ce as is demonstrated by the next example, which shows that the relation\nbetween the hyperplanes of the two realizations matters.\nExample 2.5 (Failure due to opposite weight vectors 2). We de\ufb01ne the weight vectors\n\u221a\nck = (k, k,\n\n\u221a\n\n2k)\n\n2k,\n\n(32)\n\n),\n\nak\n1 = (k, k, 1\n\n1\u221a\n2k\n\nand consider the parametrizations (see Figure 5)\n\n(cid:16)(cid:2) \u2212 ak\n\n1\n\nk ),\n\n(cid:12)(cid:12) \u2212 ak\n\n2\n\n3 = (0,\u2212\nak\n\n2 = (\u2212k, k, 1\nak\nk ),\n(cid:3)T\n\n, ck(cid:17) \u2208 N(3,3,1), \u0398k :=\n\n(cid:12)(cid:12) \u2212 ak\n\n3\n\n\u0393k :=\n\n(cid:16)(cid:2)ak\n\n1\n\n(cid:12)(cid:12)ak\n\n2\n\n(cid:12)(cid:12)ak\n\n3\n\n(cid:3)T\n\n, ck(cid:17) \u2208 N(3,3,1).\n\nThen for every k \u2208 N and every \u03a6k \u2208 N(3,3,1) with R(\u03a6k) = R(\u0398k) it holds that\n\n|R(\u03a6k) \u2212 R(\u0393k)|W 1,\u221e = 3 and\n\n(cid:107)\u03a6k \u2212 \u0393k(cid:107)\u221e \u2265 k.\n\n(33)\n\n(34)\n\nNote that \u0393 and \u0398 need to have multiple exactly opposite weight vectors which add to something\nsmall (compared to the size of the individual vectors), but not zero, since otherwise reparametrization\nwould be possible (see Lemma A.5).\n\n7\n\n\fFigure 4: Shows R(\u0393) and g3 (a1 = (1,\u2212 1\n2 ),\na2 = (\u22121,\u2212 1\n2 ), a3 = (0, 1), v = (1, 0)).\n\nFigure 5: Shows the weight vectors of \u03982\n(grey) and \u03932 (black).\n\n3\n\nInverse stability for two-layer ReLU Networks\n\nWe now establish an inverse stability result using assumptions designed to exclude the pathologies\nfrom the previous section. First we present a rather technical theorem for output dimension one\nwhich considers a parametrization \u0393 in the unrestricted parametrization space NN and a function g\nin the the corresponding function space R(NN ). The aim is to use assumptions which are as weak as\npossible, while allowing us to \ufb01nd a parametrization \u03a6 of g, whose distance to \u0393 can be bounded\nrelative to |g \u2212 R(\u0393)|W 1,\u221e. We then continue by de\ufb01ning a restricted parametrization space N \u2217\nN , for\nwhich we get uniform inverse stability (meaning that we get the same estimate for every \u0393 \u2208 N \u2217\nN ).\nTheorem 3.1 (Inverse stability at \u0393 \u2208 NN ). Let d, m \u2208 N, N := (d, m, 1), \u03b2 \u2208 [0,\u221e), let\n\u0393 =\nAssume that the following conditions are satis\ufb01ed:\n\n, c\u0393(cid:17) \u2208 NN , g \u2208 R(NN ), and let I \u0393 := {i \u2208 [m] : a\u0393\n\n(cid:12)(cid:12) . . .(cid:12)(cid:12)a\u0393\n\n(cid:16)(cid:2)a\u0393\n\ni (cid:54)= 0}.\n\n(cid:3)T\n\nm\n\n1\n\nC.1 It holds for all i \u2208 [m] with (cid:107)c\u0393\nC.2 It holds for all i, j \u2208 I \u0393 with i (cid:54)= j that\n\ni a\u0393\n\nC.3 There exists a parametrization \u0398 =\n\ni (cid:107)\u221e \u2264 2|g \u2212 R(\u0393)|W 1,\u221e that |c\u0393\n\ni |,(cid:107)a\u0393\n\ni (cid:107)\u221e \u2264 \u03b2.\n\n(cid:16)(cid:2)a\u0398\n\n1\n\n, c\u0398(cid:17) \u2208 NN such that R(\u0398) = g and\n\nj (cid:107)\u221e (cid:54)= a\u0393\na\u0393\n(cid:12)(cid:12) . . .(cid:12)(cid:12)a\u0398\n(cid:3)T\ni (cid:107)\u221e .\nj(cid:107)a\u0393\ni(cid:107)a\u0393\nj (cid:107)\u221e (cid:54)= \u2212 a\u0393\ni (cid:107)\u221e and for all i, j \u2208 I \u0398 with\na\u0393\nj(cid:107)a\u0393\ni(cid:107)a\u0393\n\nm\n\n(a) it holds for all i, j \u2208 I \u0393 with i (cid:54)= j that\n\ni (cid:54)= j that\n\nj (cid:107)\u221e (cid:54)= \u2212 a\u0398\na\u0398\ni (cid:107)\u221e ,\nj(cid:107)a\u0398\ni(cid:107)a\u0398\n\n(b) it holds for all i \u2208 I \u0393, j \u2208 I \u0398 that\nwhere I \u0398 := {i \u2208 [m] : a\u0398\n\n(cid:54)= 0}.\n\ni\n\ni (cid:107)\u221e (cid:54)= \u2212 a\u0398\na\u0393\nj(cid:107)a\u0398\ni(cid:107)a\u0393\nj (cid:107)\u221e\n\nThen there exists a parametrization \u03a6 \u2208 NN with\n\nR(\u03a6) = g\n\nand\n\n(cid:107)\u03a6 \u2212 \u0393(cid:107)\u221e \u2264 \u03b2 + 2|g \u2212 R(\u0393)| 1\n\n2\n\nW 1,\u221e .\n\n(35)\n\nThe proof can be found in Appendix A.3.2. Note that each of the conditions in the theorem above\ncorresponds directly to one of the pathologies in Section 2.2. Condition C.1, which deals with\nunbalancedness, only imposes an restriction on the weight pairs whose product is small compared\nto the distance of R(\u0393) and g. As can be guessed from Example 2.2 and seen in the proof of\nTheorem 3.1, such a balancedness assumption is in fact only needed to deal with degenerate cases,\nwhere R(\u0393) and g have parts with mismatching directions of negligible magnitude. Otherwise a\nmatching reparametrization is always possible. Note that a balanced \u0393 (i.e. |c\u0393\ni (cid:107)\u221e) satis\ufb01es\nCondition C.1 with \u03b2 = (2|g \u2212 R(\u0393)|W 1,\u221e )1/2.\ni | and (cid:107)\u0393i(cid:107)\u221e to be close\nIt is also possible to relax the balancedness assumption by only requiring |c\u0393\nto (cid:107)c\u0393\ni (cid:107)1/2\u221e , which would still give a similar estimate but with a worse exponent. In order to see that\nrequiring balancedness does not restrict the space of realizations, observe that the ReLU is positively\nhomogeneous (i.e. \u03c1(\u03bbx) = \u03bb\u03c1(x) for all \u03bb \u2265 0, x \u2208 R). Thus balancedness can always be achieved\nsimply by rescaling.\nCondition C.2 requires \u0393 to have no redundant directions, the necessity of which is demonstrated by\nExample 2.3. Note that prohibiting redundant directions does not restrict the space of realizations,\n\ni | = (cid:107)a\u0393\n\ni a\u0393\n\n8\n\n\fsee (87) in the appendix for details. From a practical point of view, enforcing this condition could\nbe achieved by a regularization term using a barrier function. Alternatively on could employ a\nnon-standard approach of combining such redundant neurons by changing one of them according\nto (87) and either setting the other one to zero or removing it entirely4.\nFrom a theoretical perspective the \ufb01rst two conditions are rather mild, in the sense that they only\nrestrict the space of parametrizations and not the corresponding space of realizations. Speci\ufb01cally we\ncan de\ufb01ne the restricted parametrization space\n\ni (cid:107)\u221e = (cid:107)a\u0393\n\n(d,m,D) := {\u0393 \u2208 N(d,m,D) : (cid:107)c\u0393\nN (cid:48)\n\ni (cid:107)\u221e for all i \u2208 [m] and \u0393 satis\ufb01es C.2}\n\n(36)\nfor which we have R(N (cid:48)\nN ) = R(NN ). Note that the above de\ufb01nition as well as the following\nde\ufb01nition and theorem are for networks with arbitrary output dimensions, as the balancedness\ncondition makes this extension rather straightforward.\nIn order to satisfy Conditions C.3a and C.3b we need to restrict the parametrization space in a way\nwhich also restricts the corresponding space of realizations. One possibility to do so is the following\napproach, which also incorporates the previous restrictions as well as the transition to networks\nwithout biases.\nDe\ufb01nition 3.2 (Restricted parametrization space). Let N = (d, m, D) \u2208 N3. We de\ufb01ne\n\nN :=(cid:8)\u0393 \u2208 N (cid:48)\n\nN \u2217\n\ni )d > 0 for all i \u2208 [m](cid:9) .\n\nWhile we no longer have R(N \u2217\nexists \u0393 \u2208 N \u2217\n\n(d+2,m+1,D) such that for all x \u2208 Rd it holds that\n\nN : (a\u0393\n\ni )d\u22121, (a\u0393\n\n(37)\nN ) = R(NN ), Lemma A.6 shows that for every \u0398 \u2208 P(d,m,D) there\n\nR(\u0393)(x1, . . . , xd, 1,\u22121) = R(\u0398)(x1, . . . , xd).\n\n(38)\nIn particular, this means that for any optimization problem over an unrestricted parametrization\nspace P(d,m,D), there is a corresponding optimization problem over the parametrization space\nN \u2217\n(d+2,m+1,D) whose solution is at least as good (see Corollary 1.4). Our main result now states that\nfor such a restricted parametrization space we have uniform (4, 1/2) inverse stability w.r.t. | \u00b7 |W 1,\u221e,\na proof of which can be found in Appendix A.3.2.\nTheorem 3.3 (Inverse stability on N \u2217\na parametrization \u03a6 \u2208 N \u2217\n\nN ). Let N \u2208 N3. For all \u0393 \u2208 N \u2217\n\nN and g \u2208 R(N \u2217\n\nN ) there exists\n\nN with\nR(\u03a6) = g\n\nand\n\n(cid:107)\u03a6 \u2212 \u0393(cid:107)\u221e \u2264 4|g \u2212 R(\u0393)| 1\n\n2\n\nW 1,\u221e.\n\n(39)\n\n4 Outlook\n\nThis contribution investigates the potential insights which may be gained from studying the optimiza-\ntion problem over the space of realizations, as well as the dif\ufb01culties encountered when trying to\nconnect it to the parametrized problem. While Theorem 1.3 and Theorem 3.3 offer some compelling\npreliminary answers, there are multiple ways in which they can be extended.\nTo obtain our inverse stability result for shallow ReLU networks we studied sums of ridge functions.\nExtending this result to deep ReLU networks requires understanding their behaviour under com-\nposition. In particular, we have ridge functions which vanish on some half space, i.e. colloquially\nspeaking each neuron may \u201cdiscard half the information\u201d it receives from the previous layer. This\nintroduces a new type of degeneracy, which one will have to deal with.\nAnother interesting direction is an extension to inverse stability w.r.t. some weaker norm like (cid:107)\u00b7(cid:107)L\u221e or\na fractional Sobolev norm under stronger restrictions on the space of parametrizations (see Lemma A.7\nfor a simple approach using very strong restrictions).\nLastly, note that Theorem 1.3 is not speci\ufb01c to the ReLU activation function and thus also incentivizes\nthe study of inverse stability for any other activation function.\nFrom an applied point of view, Conditions C.1-C.3 motivate the implementation of corresponding\nregularization (i.e. penalizing unbalancedness and redundancy in the sense of parallel weight vectors)\nin state-of-the-art networks, in order to explore whether preventing inverse stability leads to improved\nperformance in practice. Note that there already are results using, e.g. cosine similarity, as regularizer\nto prevent parallel weight vectors [4, 35] as well as approaches, called Sobolev Training, reporting\nbetter generalization and data-ef\ufb01ciency by employing a Sobolev norm based loss [12].\n\n4This could be of interest in the design of dynamic network architectures [26, 28, 40] and is also closely\n\nrelated to the co-adaption of neurons, to counteract which, dropout was invented [21].\n\n9\n\n\fAcknowledgment\n\nThe research of JB and DE was supported by the Austrian Science Fund (FWF) under grants\nI3403-N32 and P 30148. The authors would like to thank Pavol Har\u00e1r for helpful comments.\n\nReferences\n[1] Z. Allen-Zhu, Y. Li, and Z. Song. A Convergence Theory for Deep Learning via Over-\n\nParameterization. arXiv:1811.03962, 2018.\n\n[2] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 2009.\n\n[3] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via\na compression approach. In International Conference on Machine Learning, pages 254\u2013263,\n2018.\n\n[4] N. Bansal, X. Chen, and Z. Wang. Can we gain more from orthogonality regularizations in\ntraining deep networks? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages\n4261\u20134271. Curran Associates, Inc., 2018.\n\n[5] P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. arXiv:1706.08498, 2017.\n\n[6] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodi-\n\nmension bounds for piecewise linear neural networks. arXiv:1703.02930, 2017.\n\n[7] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk\nminimization over deep arti\ufb01cial neural networks overcomes the curse of dimensionality in the\nnumerical approximation of Black-Scholes partial differential equations. arXiv:1809.03062,\n2018.\n\n[8] J. Berner, D. Elbr\u00e4chter, P. Grohs, and A. Jentzen. Towards a regularity theory for ReLU\n\nnetworks\u2013chain rule and global error estimates. arXiv:1905.04992, 2019.\n\n[9] H. B\u00f6lcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely\n\nconnected deep neural networks. arXiv:1705.01714, 2017.\n\n[10] M. Burger and A. Neubauer. Error Bounds for Approximation with Neural Networks . Journal\n\nof Approximation Theory, 112(2):235\u2013250, 2001.\n\n[11] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of\n\nmultilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204, 2015.\n\n[12] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for\nneural networks. In Advances in Neural Information Processing Systems, pages 4278\u20134287,\n2017.\n\n[13] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of\n\nDeep Neural Networks. arXiv:1811.03804, 2018.\n\n[14] L. C. Evans. Partial Differential Equations (second edition). Graduate studies in mathematics.\n\nAmerican Mathematical Society, 2010.\n\n[15] L. C. Evans and R. F. Gariepy. Measure Theory and Fine Properties of Functions, Revised\n\nEdition. Textbooks in Mathematics. CRC Press, 2015.\n\n[16] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.\n\nNeural Networks, 2(3):183\u2013192, 1989.\n\n[17] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\n\nnetworks. arXiv:1712.06541, 2017.\n\n10\n\n\f[18] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network\n\noptimization problems. arXiv:1412.6544, 2014.\n\n[19] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep\n\nneural networks. arXiv: 1905.01208, 2019.\n\n[20] I. G\u00fchring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep ReLU\n\nneural networks in W s,p norms. arXiv:1902.07896, 2019.\n\n[21] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving\n\nneural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.\n\n[22] K. Kawaguchi. Deep learning without poor local minima. In Advances in neural information\n\nprocessing systems, pages 586\u2013594, 2016.\n\n[23] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural\n\nnets. In Advances in Neural Information Processing Systems, pages 6389\u20136399, 2018.\n\n[24] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent\non structured data. In Advances in Neural Information Processing Systems, pages 8157\u20138166,\n2018.\n\n[25] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with ReLU activation.\n\nIn Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[26] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055,\n\n2018.\n\n[27] S. Mei, A. Montanari, and P.-M. Nguyen. A mean \ufb01eld view of the landscape of two-layer\nneural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013E7671,\n2018.\n\n[28] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad,\nA. Navruzyan, N. Duffy, and B. Hodjat. Chapter 15 - evolving deep neural networks. In\nR. Kozma, C. Alippi, Y. Choe, and F. C. Morabito, editors, Arti\ufb01cial Intelligence in the Age of\nNeural Networks and Brain Computing, pages 293 \u2013 312. Academic Press, 2019.\n\n[29] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep\n\nlearning. In Advances in Neural Information Processing Systems, pages 5947\u20135956, 2017.\n\n[30] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings\nof the 34th International Conference on Machine Learning - Volume 70, ICML\u201917, pages\n2603\u20132612. JMLR.org, 2017.\n\n[31] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix\ntheory. In Proceedings of the 34th International Conference on Machine Learning - Volume 70,\nICML\u201917, pages 2798\u20132806. JMLR.org, 2017.\n\n[32] D. Perekrestenko, P. Grohs, D. Elbr\u00e4chter, and H. B\u00f6lcskei. The universal approximation power\n\nof \ufb01nite-width deep ReLU networks. arXiv:1806.01528, 2018.\n\n[33] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using\n\ndeep ReLU neural networks. arXiv:1709.05289, 2017.\n\n[34] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of functions\n\ngenerated by neural networks of \ufb01xed size. arXiv:1806.08459, 2018.\n\n[35] P. Rodr\u00edguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, and X. Roca. Regularizing cnns with\n\nlocally constrained decorrelations. arXiv:1611.01967, 2016.\n\n[36] I. Safran and O. Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks. In\n\nInternational Conference on Machine Learning, pages 774\u2013782, 2016.\n\n[37] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep\n\nneural networks. Applied and Computational Harmonic Analysis, 44(3):537 \u2013 557, 2018.\n\n11\n\n\f[38] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence\nresults and optimal averaging schemes. In International Conference on Machine Learning,\npages 71\u201379, 2013.\n\n[39] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:\n\n103\u2013114, 2017.\n\n[40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable\nimage recognition. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 8697\u20138710, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4222, "authors": [{"given_name": "Dennis Maximilian", "family_name": "Elbr\u00e4chter", "institution": "University of Vienna"}, {"given_name": "Julius", "family_name": "Berner", "institution": "University of Vienna"}, {"given_name": "Philipp", "family_name": "Grohs", "institution": "University of Vienna"}]}