{"title": "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 3036, "page_last": 3046, "abstract": "Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.", "full_text": "On the Global Convergence of Gradient Descent for\nOver-parameterized Models using Optimal Transport\n\nL\u00e9na\u00efc Chizat\n\nFrancis Bach\n\nINRIA, ENS, PSL Research University\n\nINRIA, ENS, PSL Research University\n\nParis, France\n\nlenaic.chizat@inria.fr\n\nParis, France\n\nfrancis.bach@inria.fr\n\nAbstract\n\nMany tasks in machine learning and signal processing can be solved by minimizing\na convex function of a measure. This includes sparse spikes deconvolution or\ntraining a neural network with a single hidden layer. For these problems, we study\na simple minimization method: the unknown measure is discretized into a mixture\nof particles and a continuous-time gradient descent is performed on their weights\nand positions. This is an idealization of the usual way to train neural networks\nwith a large hidden layer. We show that, when initialized correctly and in the\nmany-particle limit, this gradient \ufb02ow, although non-convex, converges to global\nminimizers. The proof involves Wasserstein gradient \ufb02ows, a by-product of optimal\ntransport theory. Numerical experiments show that this asymptotic behavior is\nalready at play for a reasonable number of particles, even in high dimension.\n\n1\n\nIntroduction\n\nA classical task in machine learning and signal processing is to search for an element in a Hilbert\nspace F that minimizes a smooth, convex loss function R : F \u2192 R+ and that is a linear combination\nof a few elements from a large given parameterized set {\u03c6(\u03b8)}\u03b8\u2208\u0398 \u2282 F. A general formulation of\nthis problem is to describe the linear combination through an unknown signed measure \u00b5 on the\nparameter space and to solve for\nJ\u2217 = min\n\u00b5\u2208M(\u0398)\n\n(1)\nwhere M(\u0398) is the set of signed measures on the parameter space \u0398 and G : M(\u0398) \u2192 R is an\noptional convex regularizer, typically the total variation norm when sparse solutions are preferred. In\nthis paper, we consider the in\ufb01nite-dimensional case where the parameter space \u0398 is a domain of Rd\nand \u03b8 (cid:55)\u2192 \u03c6(\u03b8) is differentiable. This framework covers:\n\n(cid:18)(cid:90)\n\n(cid:19)\n\nJ(\u00b5) := R\n\n\u03c6d\u00b5\n\n+ G(\u00b5)\n\nJ(\u00b5),\n\nloss function, and \u03c6(\u03b8) : x (cid:55)\u2192 \u03c3((cid:80)d\u22121\n\n\u2022 Training neural networks with a single hidden layer, where the goal is to select, within a\nspeci\ufb01c class, a function that maps features in Rd\u22121 to labels in R, from the observation\nof a joint distribution of features and labels. This corresponds to F being the space of\nsquare-integrable real-valued functions on Rd\u22121, R being, e.g., the quadratic or the logistic\ni=1 \u03b8ixi + \u03b8d), with an activation function \u03c3 : R \u2192 R.\nCommon choices are the sigmoid function or the recti\ufb01ed linear unit [18, 14], see more\ndetails in Section 4.2.\n\u2022 Sparse spikes deconvolution, where one attempts to recover a signal which is a mixture\nof impulses on \u0398 given a noisy and \ufb01ltered observation y (a square-integrable function\non \u0398). This corresponds to F being the space of square-integrable real-valued functions\non Rd, de\ufb01ning \u03c6(\u03b8) : x (cid:55)\u2192 \u03c8(x \u2212 \u03b8) the translations of the \ufb01lter impulse response \u03c8\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand R(f ) = (1/2\u03bb)(cid:107)f \u2212 y(cid:107)2\nL2, for some \u03bb > 0 that depends on the estimated noise level.\nSolving (1) allows then to reconstruct the mixture of impulses with some guarantees [12, 13].\n\u2022 Low-rank tensor decomposition [16], recovering mixture models from sketches [26], see [6]\nfor a detailed list of other applications. For example, with symmetric matrices, F = Rd\u00d7d\nand \u03a6(\u03b8) = \u03b8\u03b8(cid:62), we recover low-rank matrix decompositions [15].\n\n1.1 Review of optimization methods and previous work\n\nWhile (1) is a convex problem, \ufb01nding approximate minimizers is hard as the variable is in\ufb01nite-\ndimensional. Several lines of work provide optimization methods but with strong limitations.\n\nConditional gradient / Frank-Wolfe. This approach tackles a variant of (1) where the regulariza-\ntion term is replaced by an upper bound on the total variation norm; the associated constraint set\nis the convex hull of all Diracs and negatives of Diracs at elements of \u03b8 \u2208 \u0398, and thus adapted to\nconditional gradient algorithms [19]. At each iteration, one adds a new particle by solving a linear\nminimization problem over the constraint set (which correspond to \ufb01nding a particle \u03b8 \u2208 \u0398), and then\nupdates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence\nrate of the objective function to its minimum. However, the linear minimization subroutine is hard to\nperform in general : it is for instance NP-hard for neural networks with homogeneous activations [4].\nOne thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to\nboosting [36]. The practical behavior is improved with nonconvex updates [6, 7] reminiscent of the\n\ufb02ow studied below.\n\nSemide\ufb01nite hierarchy. Another approach is to parameterize the unknown measure by its sequence\nof moments. The space of such sequences is characterized by a hierarchy of SDP-representable\nnecessary conditions. This approach concerns a large class of generalized moment problems [22]\nand can be adapted to deal with special instances of (1) [9]. It is however restricted to \u03c6 which\nare combinations of few polynomial moments, and its complexity explodes exponentially with the\ndimension d. For d \u2265 2, convergence to a global minimizer is only guaranteed asymptotically,\nsimilarly to the results of the present paper.\n\nParticle gradient descent. A third approach, which exploits the differentiability of \u03c6, consists in\ndiscretizing the unknown measure \u00b5 as a mixture of m particles parameterized by their positions and\nweights. This corresponds to the \ufb01nite-dimensional problem\n\n(cid:32)\n\n1\nm\n\n(cid:33)\n\nm(cid:88)\n\ni=1\n\nJm(w, \u03b8)\n\nwhere\n\nJm(w, \u03b8) := J\n\nmin\nw\u2208Rm\n\u03b8\u2208\u0398m\n\nwi\u03b4\u03b8i\n\n,\n\n(2)\n\nwhich can then be solved by classical gradient descent-based algorithms. This method is simple\nto implement and is widely used for the task of neural network training but, a priori, we may only\nhope to converge to local minima since Jm is non-convex. Our goal is to show that this method also\nbene\ufb01ts from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.\nThere is a recent literature on global optimality results for (2) in the speci\ufb01c task of training neural\nnetworks. It is known that in this context, Jm has less, or no, local minima in an over-parameterization\nregime and stochastic gradient descent (SGD) \ufb01nds a global minimizer under restrictive assump-\ntions [34, 35, 33, 23]; see [33] for an account of recent results. Our approach is not directly\ncomparable to these works: it is more abstract and nonquantitative\u2014we study an ideal dynamics\nthat one can only hope to approximate\u2014but also much more generic. Our objective, in the space\nof measures, has many local minima, but we build gradient \ufb02ows that avoids them, relying mainly\non the homogeneity properties of Jm (see [16, 20] for other uses of homogeneity in non-convex\noptimization). The novelty is to see (2) as a discretization of (1)\u2014a point of view also present in [25]\nbut not yet exploited for global optimality guarantees.\n\n1.2 Organization of the paper and summary of contributions\n\nOur goal is to explain when and why the non-convex particle gradient descent \ufb01nds global minima.\nWe do so by studying the many-particle limit m \u2192 \u221e of the gradient \ufb02ow of Jm. More speci\ufb01cally:\n\n2\n\n\f\u2022 In Section 2, we introduce a more general class of problems and study the many-particle\nlimit of the associated particle gradient \ufb02ow. This limit is characterized as a Wasserstein\ngradient \ufb02ow (Theorem 2.6), an object which is a by-product of optimal transport theory.\n\u2022 In Section 3, under assumptions on \u03c6 and the initialization, we prove that if this Wasserstein\ngradient \ufb02ow converges, then the limit is a global minimizer of J. Under the same conditions,\nit follows that if (w(m)(t), \u03b8(m)(t))t\u22650 are gradient \ufb02ows for Jm suitably initialized, then\n\nm,t\u2192\u221e J(\u00b5m,t) = J\u2217\n\nlim\n\nwhere\n\n\u00b5m,t =\n\n1\nm\n\nw(m)\n\ni\n\n(t)\u03b4\u03b8(m)\n\ni\n\n(t).\n\nm(cid:88)\n\ni=1\n\n\u2022 Two different settings that leverage the structure of \u03c6 are treated: the 2-homogeneous and the\npartially 1-homogeneous case. In Section 4, we apply these results to sparse deconvolution\nand training neural networks with a single hidden layer, with sigmoid or ReLU activation\nfunction. In each case, our result prescribes conditions on the initialization pattern.\n\u2022 We perform simple numerical experiments that indicate that this asymptotic regime is\nalready at play for small values of m, even for high-dimensional problems. The method\nbehaves incomparably better than simply optimizing on the weights with a very large set of\n\ufb01xed particles.\n\nOur focus on qualitative results might be surprising for an optimization paper, but we believe that\nthis is an insightful \ufb01rst step given the hardness and the generality of the problem. We suggest to\nunderstand our result as a \ufb01rst consistency principle for practical and a commonly used non-convex\noptimization methods. While we focus on the idealistic setting of a continuous-time gradient \ufb02ow\nwith exact gradients, this is expected to re\ufb02ect the behavior of \ufb01rst order descent algorithms, as they\nare known to approximate the former: see [31] for (accelerated) gradient descent and [21, Thm. 2.1]\nfor SGD.\nNotation. Scalar products and norms are denoted by \u00b7 and | \u00b7 | respectively in Rd, and by (cid:104)\u00b7,\u00b7(cid:105) and\n(cid:107) \u00b7 (cid:107) in the Hilbert space F. Norms of linear operators are also denoted by (cid:107) \u00b7 (cid:107). The differential of a\nfunction f at a point x is denoted dfx. We write M(Rd) for the set of \ufb01nite signed Borel measures\non Rd, \u03b4x is a Dirac mass at a point x and P2(Rd) is the set of probability measures endowed with\nthe Wasserstein distance W2 (see Appendix A).\n\nRecent related work. Several independent works [24, 28, 32] have studied the many-particle limit\nof training a neural network with a single large hidden layer and a quadratic loss R. Their main focus\nis on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a\nmean-\ufb01eld limit in this case. Since in our approach this limit is mostly an intermediate step necessary\nto state our global convergence theorems, it is not studied extensively for itself. These papers thus\nprovide a solid complement to Section 2.4 (a difference is that we do not assume that R is quadratic\nnor that V is differentiable). Also, [24] proves a quantitive global convergence result for noisy SGD\nto an approximate minimizer: we stress that our results are of a different nature, as they rely on\nhomogeneity and not on the mixing effect of noise.\n\n2 Particle gradient \ufb02ows and many-particle limit\n\n2.1 Main problem and assumptions\n\nFrom now on, we consider the following class of problems on the space of non-negative \ufb01nite\nmeasures on a domain \u2126 \u2282 Rd which, as explained below, is more general than (1):\n\n(cid:18)(cid:90)\n\n(cid:19)\n\n(cid:90)\n\nF \u2217 = min\n\n\u00b5\u2208M+(\u2126)\n\nF (\u00b5)\n\nwhere\n\nF (\u00b5) = R\n\n\u03a6d\u00b5\n\n+\n\nV d\u00b5,\n\n(3)\n\nand we make the following assumptions.\nAssumptions 2.1. F is a separable Hilbert space, \u2126 \u2282 Rd is the closure of a convex open set, and\n(i) (smooth loss) R : F \u2192 R+ is differentiable, with a differential dR that is Lipschitz on bounded\n\nsets and bounded on sublevel sets,\n\n3\n\n\f(ii) (basic regularity) \u03a6 : \u2126 \u2192 F is (Fr\u00e9chet) differentiable, V : \u2126 \u2192 R+ is semiconvex1, and\n(iii) (locally Lipschitz derivatives with sublinear growth) there exists a family (Qr)r>0 of nested\n\nnonempty closed convex subsets of \u2126 such that:\n(a) {u \u2208 \u2126 ; dist(u, Qr) \u2264 r(cid:48)} \u2282 Qr+r(cid:48) for all r, r(cid:48) > 0,\n(b) \u03a6 and V are bounded and d\u03a6 is Lipschitz on each Qr, and\n(c) there exists C1, C2 > 0 such that supu\u2208Qr ((cid:107)d\u03a6u(cid:107) +(cid:107)\u2202V (u)(cid:107)) \u2264 C1 + C2r for all r > 0,\n\nwhere (cid:107)\u2202V (u)(cid:107) stands for the maximal norm of an element in \u2202V (u).\n\nAssumption 2.1-(iii) reduces to classical local Lipschitzness and growth assumptions on d\u03a6 and \u2202V if\nthe nested sets (Qr)r are the balls of radius r, but unbounded sets Qr are also allowed. These sets are\na technical tool used later to con\ufb01ne the gradient \ufb02ows in areas where gradients are well-controlled.\n\nBy convention, we set F (\u00b5) = \u221e if \u00b5 is not concentrated on \u2126. Also, the integral(cid:82) \u03a6d\u00b5 is a\n(cid:82) (cid:107)\u03c6(cid:107)d|\u00b5| < \u221e. Otherwise, we also set F (\u00b5) = \u221e by convention.\n\nBochner integral [10, App. E6]. It yields a well-de\ufb01ned value in F whenever \u03a6 is measurable and\n\nRecovering (1) through lifting.\nIt is shown in Appendix A.2 that, for a class of admissible regu-\nlarizers G containing the total variation norm, problem (1) admits an equivalent formulation as (3).\nIndeed, consider the lifted domain \u2126 = R \u00d7 \u0398, the function \u03a6(w, \u03b8) = w\u03c6(\u03b8) and V (w, \u03b8) = |w|.\nThen J\u2217 equals F \u2217 and given a minimizer of one of the problems, one can easily build minimizers for\nthe other. This equivalent lifted formulation removes the asymmetry between weight and position\u2014\nweight becomes just another coordinate of a particle\u2019s position. This is the right point of view for our\npurpose and this is why F is our central object of study in the following.\n\nHomogeneity. The functions \u03a6 and V obtained through the lifting share the property of being\npositively 1-homogeneous in the variable w. A function f between vector spaces is said positively\np-homogeneous when for all \u03bb > 0 and argument x, it holds f (\u03bbx) = \u03bbpf (x). This property is\ncentral for our global convergence results (but is not needed throughout Section 2).\n\n2.2 Particle gradient \ufb02ow\n\nWe \ufb01rst consider an initial measure which is a mixture of particles\u2014an atomic measure\u2014 and de\ufb01ne\nthe initial object in our construction: the particle gradient \ufb02ow. For a number m \u2208 N of particles,\nand a vector u \u2208 \u2126m of positions, this is the gradient \ufb02ow of\n\n(cid:33)\n\n(cid:32)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:32)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:33)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nFm(u) := F\n\n\u03b4ui\n\n= R\n\n\u03a6(ui)\n\n+\n\nV (ui),\n\n(4)\n\nor, more precisely, its subgradient \ufb02ow because V can be non-smooth. We recall that a subgradient\nof a (possibly non-convex) function f : Rd \u2192 \u00afR at a point u0 \u2208 Rd is a p \u2208 Rd satisfying\nf (u) \u2265 f (u0) + p \u00b7 (u \u2212 u0) + o(u \u2212 u0) for all u \u2208 Rd. The set of subgradients at u is a closed\nconvex set called the subdifferential of f at u denoted \u2202f (u) [27].\nDe\ufb01nition 2.2 (Particle gradient \ufb02ow). A gradient \ufb02ow for the functional Fm is an absolutely\ncontinuous2 path u : R+ \u2192 \u2126m which satis\ufb01es u(cid:48)(t) \u2208 \u2212m \u2202Fm(u(t)) for almost every t \u2265 0.\nThis de\ufb01nition uses a subgradient scaled by m, which is the subgradient relative to the scalar product\non (Rd)m scaled by 1/m: this normalization amounts to assigning a mass 1/m to each particle and\nis convenient for taking the many-particle limit m \u2192 \u221e. We now state basic properties of this object.\nProposition 2.3. For any initialization u(0) \u2208 \u2126m, there exists a unique gradient \ufb02ow u : R+ \u2192 \u2126m\nds Fm(u(s))|s=t = \u2212|u(cid:48)(t)|2 and the velocity of\nfor Fm. Moreover, for almost every t > 0, it holds d\nthe i-th particle is given by u(cid:48)\n\ni(t) = vt(ui(t)), where for u \u2208 \u2126 and \u00b5m,t := (1/m)(cid:80)m\n(cid:1) , \u2202j\u03a6(u)(cid:11)(cid:3)d\n\n\u02dcvt(u) = \u2212(cid:2)(cid:10)R(cid:48)(cid:0)(cid:82) \u03a6d\u00b5m,t\n\nvt(u) = \u02dcvt(u) \u2212 proj\u2202V (u)(\u02dcvt(u)) with\n(5)\n1A function f : Rd \u2192 R is semiconvex, or \u03bb-convex, if f + \u03bb| \u00b7 |2 is convex, for some \u03bb \u2208 R. On a compact\nx(s) =(cid:82) t\n2An absolutely continuous function x : R \u2192 Rd is almost everywhere differentiable and satis\ufb01es x(t) \u2212\n\ndomain, any smooth fonction is semiconvex.\n\ni=1 \u03b4ui(t),\n\ns x(cid:48)(r)dr for all s < t.\n\nj=1 .\n\n4\n\n\fThe expression of the velocity involves a projection because gradient \ufb02ows select subgradients\nof minimal norm [29]. We have denoted by R(cid:48)(f ) \u2208 F the gradient of R at f \u2208 F and by\n\u2202j\u03a6(u) \u2208 F the differential d\u03a6u applied to the j-th vector of the canonical basis of Rd. Note\nthat [\u02dcvt(ui)]m\ni=1 is (minus) the gradient of the \ufb01rst term in (4) : when V is differentiable, we have\nvt(u) = \u02dcvt(u) \u2212 \u2207V (u) and we recover the classical gradient of (4). When V is non-smooth, this\ngradient \ufb02ow can be understood as a continuous-time version of the forward-backward minimization\nalgorithm [11].\n\n2.3 Wasserstein gradient \ufb02ow\n\nThe fact that the velocity of each particle can be expressed as the evaluation of a velocity \ufb01eld (Eq. (5))\nmakes it easy, at least formally, to generalize the particle gradient \ufb02ow to arbitrary measure-valued\ninitializations\u2014not just atomic ones. On the one hand, the evolution of a time-dependent measure\n(\u00b5t)t under the action of instantaneous velocity \ufb01elds (vt)t\u22650 can be formalized by a conservation\nof mass equation, known as the continuity equation, that reads \u2202t\u00b5t = \u2212div(vt\u00b5t) where div is the\ndivergence operator3 (see Appendix B). On the other hand, there is a direct link between the velocity\n\ufb01eld (5) and the functional F . The differential of F evaluated at \u00b5 \u2208 M(\u2126) is represented by the\nfunction F (cid:48)(\u00b5) : \u2126 \u2192 R de\ufb01ned as\n\n(cid:28)\n\nR(cid:48)(cid:18)(cid:90)\n\n(cid:19)\n\n(cid:29)\n\nF (cid:48)(\u00b5)(u) :=\n\n\u03a6d\u00b5\n\n, \u03a6(u)\n\n+ V (u).\n\nThus vt is simply a \ufb01eld of (minus) subgradients of F (cid:48)(\u00b5m,t)\u2014it is in fact the \ufb01eld of minimal norm\nsubgradients. We write this relation vt \u2208 \u2212\u2202F (cid:48)(\u00b5m,t). The set \u2202F (cid:48) is called the Wasserstein subdif-\nferential of F , as it can be interpreted as the subdifferential of F relatively to the Wasserstein metric\non P2(\u2126) (see Appendix B.2.1). We thus expect that for initializations with arbitrary probability\ndistributions, the generalization of the gradient \ufb02ow coindices with the following object.\nDe\ufb01nition 2.4 (Wasserstein gradient \ufb02ow). A Wasserstein gradient \ufb02ow for the functional F on a time\ninterval [0, T [ is an absolutely continuous path (\u00b5t)t\u2208[0,T [ in P2(\u2126) that satis\ufb01es, distributionally on\n[0, T [ \u00d7 \u2126d,\n\n\u2202t\u00b5t = \u2212div(vt\u00b5t) where\n\nvt \u2208 \u2212\u2202F (cid:48)(\u00b5t).\n\n(6)\n\nThis is a proper generalization of De\ufb01nition 2.2 since, whenever (u(t))t\u22650 is a particle gradient\n\ufb02ow for Fm, then t (cid:55)\u2192 \u00b5m,t := 1\ni=1 \u03b4ui(t) is a Wasserstein gradient \ufb02ow for F in the sense of\nDe\ufb01nition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient \ufb02ows developed\nin [3], we show in Appendix B.2.1 that these Wasserstein gradient \ufb02ows are well-de\ufb01ned.\nProposition 2.5 (Existence and uniqueness). Under Assumptions 2.1, if \u00b50 \u2208 P2(\u2126) is concentrated\non a set Qr0 \u2282 \u2126, then there exists a unique Wasserstein gradient \ufb02ow (\u00b5t)t\u22650 for F starting from \u00b50.\nIt satis\ufb01es the continuity equation with the velocity \ufb01eld de\ufb01ned in (5) (with \u00b5t in place of \u00b5m,t).\n\nm\n\n(cid:80)m\n\nNote that the condition on the initialization is automatically satis\ufb01ed in Proposition 2.3 because there\nthe initial measure has a \ufb01nite discrete support: it is thus contained in any Qr for r > 0 large enough.\n\n2.4 Many-particle limit\n\nWe now characterize the many-particle limit of classical gradient \ufb02ows, under Assumptions 2.1.\nTheorem 2.6 (Many-particle limit). Consider (t (cid:55)\u2192 um(t))m\u2208N a sequence of classical gradient\n\ufb02ows for Fm initialized in a set Qr0 \u2282 \u2126. If \u00b5m,0 converges to some \u00b50 \u2208 P2(\u2126) for the Wasserstein\ndistance W2, then (\u00b5m,t)t converges, as m \u2192 \u221e, to the unique Wasserstein gradient \ufb02ow of F\nstarting from \u00b50.\nGiven a measure \u00b50 \u2208 P2(Qr0), an example for the sequence um(0) is um(0) = (u1, . . . , um)\nwhere u1, u2, . . . , um are independent samples distributed according to \u00b50. By the law of large\nnumbers for empirical distributions, the sequence of empirical distributions \u00b5m,0 = 1\ni=1 \u03b4ui\nm\nconverges (almost surely, for W2) to \u00b50. In particular, our proof of Theorem 2.6 gives an alternative\nproof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the\nlimit).\n\n(cid:80)m\n\ni=1 : Rd \u2192 Rd, its divergence is given by div(E) =(cid:80)d\n\n3For a smooth vector \ufb01eld E = (Ei)d\n\ni=1 \u2202Ei/\u2202xi.\n\n5\n\n\f3 Convergence to global minimizers\n\n3.1 General idea\nAs can be seen from De\ufb01nition 2.4, a probability measure \u00b5 \u2208 P2(\u2126) is a stationary point of a\nWasserstein gradient \ufb02ow if and only if 0 \u2208 \u2202F (cid:48)(\u00b5)(u) for \u00b5-a.e. u \u2208 \u2126. It is proved in [25] that\nthese stationary points are, in some cases, optimal over probabilities that have a smaller support.\nHowever, they are not in general global minimizers of F over M+(\u2126), even when R is convex. Such\nglobal minimizers are indeed characterized as follows.\nProposition 3.1 (Minimizers). Assume that R is convex. A measure \u00b5 \u2208 M+(\u2126) such that F (\u00b5) <\n\u221e minimizes F on M+(\u2126) iff F (cid:48)(\u00b5) \u2265 0 and F (cid:48)(\u00b5)(u) = 0 for \u00b5-a.e. u \u2208 \u2126.\nDespite these strong differences between stationarity and global optimality, we show in this section\nthat Wasserstein gradient \ufb02ows converge to global minimizers, under two main conditions:\n\nde\ufb01nition of homogeneity), and\n\n\u2022 On the structure: \u03a6 and V must share a homogeneity direction (see Section 2.1 for the\n\u2022 On the initialization: the support of the initialization of the Wasserstein gradient \ufb02ow\nsatis\ufb01es a \u201cseparation\u201d property. This property is preserved throughout the dynamic and,\ncombined with homogeneity, allows to escape from neighborhoods of non-optimal points.\n\nWe turn these general ideas into concrete statements for two cases of interest, that exhibit different\nstructures and behaviors: (i) when \u03a6 and V are positively 2-homogeneous and (ii) when \u03a6 and V are\npositively 1-homogeneous with respect to one variable.\n\n3.2 The 2-homogeneous case\nIn the 2-homogeneous case a rich structure emerges, where the (d\u22121)-dimensional sphere Sd\u22121 \u2282 Rd\nplays a special role. This covers the case of lifted problems of Section 2.1 when \u03c6 is 1-homogeneous\nand neural networks with ReLU activation functions.\nAssumptions 3.2. The domain is \u2126 = Rd with d \u2265 2 and \u03a6 is differentiable with d\u03a6 locally\nLipschitz, V is semiconvex and V and \u03a6 are both positively 2-homogeneous. Moreover,\n\n(i) (smooth convex loss) The loss R is convex, differentiable with differential dR Lipschitz on\n\nbounded sets and bounded on sublevel sets,\n\n(ii) (Sard-type regularity) For all f \u2208 F, the set of regular values4 of \u03b8 \u2208 Sd\u22121 (cid:55)\u2192 (cid:104)f, \u03a6(\u03b8)(cid:105) + V (\u03b8)\nis dense in its range (it is in fact suf\ufb01cient that this holds for functions f which are of the form\n\nf = R(cid:48)((cid:82) \u03a6d\u00b5) for some \u00b5 \u2208 M+(\u2126)).\n\nTaking the balls of radius r > 0 as the family (Qr)r>0, these assumptions imply Assumptions 2.1.\nWe believe that Assumption 3.2-(ii) is not of practical importance: it is only used to avoid some\npathological cases in the proof of Theorem 3.3. By applying Morse-Sard\u2019s lemma [1], it is anyways\nful\ufb01lled if the function in question is d \u2212 1 times continuously differentiable. We now state our \ufb01rst\nglobal convergence result. It involves a condition on the initialization, a separation property, that can\nonly be satis\ufb01ed in the many-particle limit. In an ambient space \u2126, we say that a set C separates the\nsets A and B if any continuous path in \u2126 with endpoints in A and B intersects C.\nTheorem 3.3. Under Assumptions 3.2, let (\u00b5t)t\u22650 be a Wasserstein gradient \ufb02ow of F such that,\nfor some 0 < ra < rb, the support of \u00b50 is contained in B(0, rb) and separates the spheres raSd\u22121\nand rbSd\u22121. If (\u00b5t)t converges to \u00b5\u221e in W2, then \u00b5\u221e is a global minimizer of F over M+(\u2126). In\nparticular, if (um(t))m\u2208N,t\u22650 is a sequence of classical gradient \ufb02ows initialized in B(0, rb) such\nthat \u00b5m,0 converges weakly to \u00b50 then (limits can be interchanged)\nF (\u00b5).\n\nlim\nt,m\u2192\u221e F (\u00b5m,t) = min\n\n\u00b5\u2208M+(\u2126)\n\nA proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasser-\nstein gradient \ufb02ows to escape neighborhoods of non-optimal measures\u2014also valid in the \ufb01nite-particle\n4For a function g : \u0398 \u2192 R, a regular value is a real number \u03b1 in the range of g such that g\u22121(\u03b1) is included\n\nin an open set where g is differentiable and where dg does not vanish.\n\n6\n\n\fsetting\u2014and then show that it is always satis\ufb01ed by the \ufb02ow de\ufb01ned above. We also weaken the\nassumption that \u00b5t converges: we only need a certain projection of \u00b5t to converge weakly. Finally,\nthe fact that limits in m and t can be interchanged is not anecdotal: it shows that the convergence is\nnot conditioned on a relative speed of growth of both parameters.\nThis result might be easier to understand by drawing an informal distinction between (i) the structural\nassumptions which are instrumental and (ii) the technical conditions which have a limited practical\ninterest. The initialization and the homogeneity assumptions are of the \ufb01rst kind. The Sard-type\nregularity is in contrast a purely technical condition: it is generally hard to check and known counter-\nexamples involve arti\ufb01cial constructions such as the Cantor function [37]. Similarly, when there is\ncompactness, a gradient \ufb02ow that does not converge is an unexpected (in some sense adversarial)\nbehavior, see a counter-example in [2]. We were however not able to exclude this possibility under\ninteresting assumptions (see a discussion in Appendix C.5).\n\n3.3 The partially 1-homogeneous case\n\nSimilar results hold in the partially 1-homogeneous setting, which covers the lifted problems of\nSection 2.1 when \u03c6 is bounded (e.g., sparse deconvolution and neural networks with sigmoid\nactivation).\nAssumptions 3.4. The domain is \u2126 = R \u00d7 \u0398 with \u0398 \u2282 Rd\u22121, \u03a6(w, \u03b8) = w \u00b7 \u03c6(\u03b8) and V (w, \u03b8) =\n|w| \u02dcV (\u03b8) where \u03c6 and \u02dcV are bounded, differentiable with Lipschitz differential. Moreover,\n(i) (smooth convex loss) The loss R is convex, differentiable with differential dR Lipschitz on\n\nbounded sets and bounded on sublevel sets,\n\n(ii) (Sard-type regularity) For all f \u2208 F, the set of regular values of gf : \u03b8 \u2208 \u0398 (cid:55)\u2192 (cid:104)f, \u03c6(\u03b8)(cid:105) + \u02dcV (\u03b8)\n\nis dense in its range, and\n\n(iii) (boundary conditions) The function \u03c6 behaves nicely at the boundary of the domain: either\n\n(a) \u0398 = Rd\u22121 and for all f \u2208 F, \u03b8 \u2208 Sd\u22122 (cid:55)\u2192 gf (r\u03b8) converges, uniformly in C 1(Sd\u22122) as\nr \u2192 \u221e, to a function satisfying the Sard-type regularity, or\n(b) \u0398 is the closure of an bounded open convex set and for all f \u2208 F, gf satis\ufb01es Neumann\nboundary conditions (i.e., for all \u03b8 \u2208 \u2202\u0398, d(gf )\u03b8((cid:126)n\u03b8) = 0 where (cid:126)n\u03b8 \u2208 Rd\u22121 is the normal\nto \u2202\u0398 at \u03b8).\n\nWith the family of nested sets Qr := [\u2212r, r] \u00d7 \u0398, r > 0, these assumptions imply Assumptions 2.1.\nThe following theorem mirrors the statement of Theorem 3.3, but with a different condition on the\ninitialization. The remarks after Theorem 3.3 also apply here.\nTheorem 3.5. Under Assumptions 3.4, let (\u00b5t)t\u22650 be a Wasserstein gradient \ufb02ow of F such that\nfor some r0 > 0, the support of \u00b50 is contained in [\u2212r0, r0] \u00d7 \u0398 and separates {\u2212r0} \u00d7 \u0398 from\n{r0} \u00d7 \u0398. If (\u00b5t)t converges to \u00b5\u221e in W2, then \u00b5\u221e is a global minimizer of F over M+(\u2126). In\nparticular, if (um(t))m\u2208N,t\u22650 is a sequence of classical gradient \ufb02ows initialized in [\u2212r0, r0] \u00d7 \u0398\nsuch that \u00b5m,0 converges to \u00b50 in W2 then (limits can be interchanged)\n\nlim\nt,m\u2192\u221e F (\u00b5m,t) = min\n\n\u00b5\u2208M+(\u2126)\n\nF (\u00b5).\n\n4 Case studies and numerical illustrations\n\nIn this section, we apply the previous abstract statements to speci\ufb01c examples and show on synthetic\nexperiments that the particle-complexity to reach global optimality is very favorable.\n\n4.1 Sparse deconvolution\nFor sparse deconvolution, it is typical to consider a signal y \u2208 F := L2(\u0398) on the d-torus \u0398 = Rd/Zd.\nThe loss function is R(f ) = (1/2\u03bb)(cid:107)y \u2212 f(cid:107)2\nL2 for some \u03bb > 0, a parameter that increases with the\nnoise level and the regularization is V (w, \u03b8) = |w|. Consider a \ufb01lter impulse response \u03c8 : \u0398 \u2192 R and\nlet \u03a6(w, \u03b8) : x (cid:55)\u2192 w \u00b7 \u03c8(x \u2212 \u03b8). The object sought after is a signed measure on \u0398, which is obtained\nR wd\u00b5(w, B)\nfor all measurable B \u2282 \u0398. We show in Appendix D that Theorem 3.5 applies.\n\nfrom a probability measure on R \u00d7 \u0398 by applying a operator de\ufb01ned by h1(\u00b5)(B) =(cid:82)\n\n7\n\n\fProposition 4.1 (Sparse deconvolution). Assume that the \ufb01lter impulse response \u03c8 is min{2, d}\ntimes continuously differentiable, and that the support of \u00b50 contains {0} \u00d7 \u0398. If the projection\n(h1(\u00b5t))t of the Wasserstein gradient \ufb02ow of F weakly converges to \u03bd \u2208 M(\u0398), then \u03bd is a global\nminimizer of\n\nmin\n\n\u00b5\u2208M(\u0398)\n\n1\n2\u03bb\n\n(cid:13)(cid:13)y \u2212(cid:82) \u03c8d\u00b5(cid:13)(cid:13)2\n\nL2 + |\u00b5|(\u0398).\n\nWe show an example of such a reconstruction on the 1-torus on Figure 1, where the ground truth\nconsists of m0 = 5 weighted spikes, \u03c8 is an ideal low pass \ufb01lter (a Dirichlet kernel of order 7)\nand y is a noisy observation of the \ufb01ltered spikes. The particle gradient \ufb02ow is integrated with the\nforward-backward algorithm [11] and the particles initialized on a uniform grid on {0} \u00d7 \u0398.\n\nFigure 1: Particle gradient \ufb02ow for sparse deconvolution on the 1-torus (horizontal axis shows\npositions, vertical axis shows weights). Failure to \ufb01nd a minimizer with 6 particles, success with 10\nand 100 particles (an animated plot of this particle gradient \ufb02ow can be found in Appendix D.5).\n\n4.2 Neural networks with a single hidden layer\nWe consider a joint distribution of features and labels \u03c1 \u2208 P(Rd\u22122 \u00d7 R) and \u03c1x \u2208 P(Rd\u22122) the\non F = L2(\u03c1x), where (cid:96) : R \u00d7 R \u2192 R+ is either the squared loss or the logistic loss. Also, we set\ni=1 \u03b8ixi + \u03b8d\u22121) for an activation function \u03c3 : R \u2192 R. Depending on the\n\nmarginal distribution of features. The loss is the expected risk R(f ) =(cid:82) (cid:96)(f (x), y)d\u03c1(x, y) de\ufb01ned\n\u03a6(w, \u03b8) : x (cid:55)\u2192 w\u03c3((cid:80)d\u22122\n\nchoice of \u03c3, we face two different situations.\n\nIf \u03c3 is a sigmoid, say \u03c3(s) = (1 + e\u2212s)\u22121, then Theorem 3.5, with domain\nSigmoid activation.\n\u0398 = Rd\u22121 applies. The natural (optional) regularization term is V (w, \u03b8) = |w|, which amounts to\npenalizing the (cid:96)1 norm of the weights.\nProposition 4.2 (Sigmoid activation). Assume that \u03c1x has \ufb01nite moments up to order min{4, 2d\u2212 2},\nthat the support of \u00b50 is {0} \u00d7 \u0398 and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein\ngradient \ufb02ow of F converges in W2 to \u00b5\u221e, then \u00b5\u221e is a global minimizer of F .\n\nNote that we have to explicitly assume the boundary condition 3.4-(iii)-(a) because the Sard-type\nregularity at in\ufb01nity cannot be checked a priori (this technical detail is discussed in Appendix D.3).\nReLU activation. The activation function \u03c3(s) = max{0, s} is positively 1-homogeneous: this\nmakes \u03a6 2-homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An\nadmissible choice of regularizer here would be the (semi-convex) function V (w, \u03b8) = |w| \u00b7 |\u03b8| [4].\nHowever, as shown in Appendix D.4, the differential d\u03a6 has discontinuities: this prevents altogether\nfrom de\ufb01ning gradient \ufb02ows, even in the \ufb01nite-particle regime.\nStill, a statement holds for a different parameterization of the same class of functions, which makes \u03a6\ndifferentiable. To see this, consider a domain \u0398 which is the disjoint union of 2 copies of Rd. On the\ni=1 s(\u03b8i)xi + s(\u03b8d)) where s(\u03b8i) = \u03b8i|\u03b8i| is the signed square\nfunction. On the second copy, \u03a6 has the same de\ufb01nition but with a minus sign. This trick allows to\nhave the same expression power than classical ReLU networks. In practice, it corresponds to simply\nputting, say, random signs in front of the activation. The regularizer here can be V (\u03b8) = |\u03b8|2.\nProposition 4.3 (Relu activation). Assume that \u03c1x \u2208 P(Rd\u22121) has \ufb01nite second moments, that the\nsupport of \u00b50 is r0Sd\u22121 for some r0 > 0 (on both copies of Rd) and that the Sard-type regularity\n\n\ufb01rst copy, de\ufb01ne \u03a6(\u03b8) : x (cid:55)\u2192 \u03c3((cid:80)d\u22121\n\n8\n\n0.00.20.40.60.81.021012particle gradient flowlimit of the flowoptimal positions0.00.20.40.60.81.0210120.00.20.40.60.81.021012\fAssumption 3.2-(ii) holds. If the Wasserstein gradient \ufb02ow of F converges in W2 to \u00b5\u221e, then \u00b5\u221e is a\nglobal minimizer of F .\n\nWe display on Figure 2 particle gradient \ufb02ows for training a neural network with a single hidden\nlayer and ReLU activation in the classical (non-differentiable) parameterization, with d = 2 (no\nregularization). Features are normally distributed, and the ground truth labels are generated with a\nsimilar network with m0 = 4 neurons. The particle gradient \ufb02ow is \u201cintegrated\u201d with mini-batch\nSGD and the particles are initialized on a small centered sphere.\n\nFigure 2: Training a neural network with ReLU activation. Failure with 5 particles (a.k.a. neurons),\nsuccess with 10 and 100 particles. We show the trajectory of |w(t)| \u00b7 \u03b8(t) \u2208 R2 for each particle (an\nanimated plot of this particle gradient \ufb02ow can be found in Appendix D.5).\n\n4.3 Empirical particle-complexity\n\nSince our convergence results are non-quantitative, one might argue that similar\u2014and much simpler\nto prove\u2014asymptotical results hold for the method of distributing particles on the whole of \u0398\nand simply optimizing on the weights, which is a convex problem. Yet, the comparison of the\nparticle-complexity shown in Figure 3 stands strongly in favor of particle gradient \ufb02ows. While\nexponential particle-complexity is unavoidable for the convex approach, we observed on several\nsynthetic problems that particle gradient descent only needs a slight over-parameterization m > m0\nto \ufb01nd global minimizers within optimization error (see details in Appendix D.5).\n\ne\nc\nn\ne\ng\nr\ne\nv\nn\no\nc\n\nt\na\n\ns\ns\no\nl\n\ns\ns\ne\nc\nx\nE\n\n(a) Sparse deconvolution (d = 1)\n\n(b) ReLU activation (d = 100)\n\n(c) Sigmoid activation (d = 100)\n\nFigure 3: Comparison of particle-complexity for particle gradient \ufb02ow and convex minimization on a\n\ufb01xed grid: excess loss at convergence vs. number of particles. Simplest minimizer has m0 particles.\n\n5 Conclusion\n\nWe have established asymptotic global optimality properties for a family of non-convex gradient\n\ufb02ows. These results were enabled by the study of a Wasserstein gradient \ufb02ow: this object simpli\ufb01es\nthe handling of many-particle regimes, analogously to a mean-\ufb01eld limit. The particle-complexity\nto reach global optimality turns out very favorable on synthetic numerical problems. This con\ufb01rms\nthe relevance of our qualitative results and calls for quantitative ones that would further exploit the\nproperties of such particle gradient \ufb02ows. Multiple layer neural networks are also an interesting\navenue for future research.\n\n9\n\n21012321012particle gradient flowoptimal positionslimit measure2101232101221012321012100101102105104103102101100101101102106105104103102101100particle gradient flowconvex minimizationbelow optim. errorm0101102105104103102101100\fAcknowledgments\n\nWe acknowledge supports from grants from R\u00e9gion Ile-de-France and the European Research Council\n(grant SEQUOIA 724063).\n\nReferences\n[1] Ralph Abraham and Joel Robbin. Transversal mappings and \ufb02ows. WA Benjamin New York,\n\n1967.\n\n[2] Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews. Convergence of the iterates of\ndescent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531\u2013547,\n2005.\n\n[3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar\u00e9. Gradient \ufb02ows: in metric spaces and in\n\nthe space of probability measures. Springer Science & Business Media, 2008.\n\n[4] Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of\n\nMachine Learning Research, 18(19):1\u201353, 2017.\n\n[5] Adrien Blanchet and J\u00e9r\u00f4me Bolte. A family of functional inequalities: \u0141ojasiewicz inequalities\nand displacement convex functions. Journal of Functional Analysis, 275(7):1650\u20131673, 2018.\n\n[6] Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional\ngradient method for sparse inverse problems. SIAM Journal on Optimization, 27(2):616\u2013639,\n2017.\n\n[7] Kristian Bredies and Hanna Katriina Pikkarainen. Inverse problems in spaces of measures.\n\nESAIM: Control, Optimisation and Calculus of Variations, 19(1):190\u2013218, 2013.\n\n[8] Felix E. Browder. Fixed point theory and nonlinear problems. Proc. Sym. Pure. Math, 39:49\u201388,\n\n1983.\n\n[9] Paul Catala, Vincent Duval, and Gabriel Peyr\u00e9. A low-rank approach to off-the-grid sparse\n\ndeconvolution. Journal of Physics: Conference Series, 904(1):012015, 2017.\n\n[10] Donald L. Cohn. Measure theory, volume 165. Springer, 1980.\n\n[11] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal\nprocessing. In Fixed-point algorithms for inverse problems in science and engineering, pages\n185\u2013212. Springer, 2011.\n\n[12] Yohann De Castro and Fabrice Gamboa. Exact reconstruction using Beurling minimal extrapo-\n\nlation. Journal of Mathematical Analysis and applications, 395(1):336\u2013354, 2012.\n\n[13] Vincent Duval and Gabriel Peyr\u00e9. Exact support recovery for sparse spikes deconvolution.\n\nFoundations of Computational Mathematics, 15(5):1315\u20131355, 2015.\n\n[14] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\n[15] Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati\nSrebro. Implicit regularization in matrix factorization. In Advances in Neural Information\nProcessing Systems 30, 2017.\n\n[16] Benjamin D. Haeffele and Ren\u00e9 Vidal. Global optimality in neural network training.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n7331\u20137339, 2017.\n\n[17] Daniel Hauer and Jos\u00e9 Maz\u00f3n. Kurdyka-\u0141ojasiewicz-Simon inequality for gradient \ufb02ows in\n\nmetric spaces. arXiv preprint arXiv:1707.03129, 2017.\n\n[18] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 1994.\n\n[19] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceed-\n\nings of the International Conference on Machine Learning (ICML), 2013.\n\n10\n\n\f[20] Michel Journ\u00e9e, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on\nthe cone of positive semide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351,\n2010.\n\n[21] Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms and\n\napplications, volume 35. Springer Science & Business Media, 2003.\n\n[22] Jean-Bernard Lasserre. Moments, positive polynomials and their applications, volume 1. World\n\nScienti\ufb01c, 2010.\n\n[23] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[24] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665\u2013\nE7671, 2018.\n\n[25] Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for in\ufb01nite ensembles.\n\narXiv preprint arXiv:1712.05438, 2017.\n\n[26] Clarice Poon, Nicolas Keriven, and Gabriel Peyr\u00e9. A dual certi\ufb01cates analysis of compressive\n\noff-the-grid recovery. arXiv preprint arXiv:1802.08464, 2018.\n\n[27] Ralph T. Rockafellar. Convex Analysis. Princeton University Press, 1997.\n\n[28] Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems:\nAsymptotic convexity of the loss landscape and universal scaling of the approximation error.\narXiv preprint arXiv:1805.00915, 2018.\n\n[29] Filippo Santambrogio. Optimal transport for applied mathematicians. Birk\u00e4user, NY, 2015.\n[30] Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient \ufb02ows: an overview.\n\nBulletin of Mathematical Sciences, 7(1):87\u2013154, 2017.\n\n[31] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d\u2019Aspremont. Integration methods\nand optimization algorithms. In Advances in Neural Information Processing Systems, pages\n1109\u20131118, 2017.\n\n[32] Justin Sirignano and Konstantinos Spiliopoulos. Mean \ufb01eld analysis of neural networks. arXiv\n\npreprint arXiv:1805.01053, 2018.\n\n[33] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D. Lee. Theoretical insights into the op-\ntimization landscape of over-parameterized shallow neural networks. IEEE Transactions on\nInformation Theory, 2018.\n\n[34] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer\n\nneural networks. arXiv preprint arXiv:1702.05777, 2017.\n\n[35] Luca Venturi, Afonso Bandeira, and Joan Bruna. Spurious valleys in two-layer neural network\n\noptimization landscapes. arXiv preprint arXiv:1802.06384, 2018.\n\n[36] Chu Wang, Yingfei Wang, Robert Schapire, et al. Functional Frank-Wolfe boosting for general\n\nloss functions. arXiv preprint arXiv:1510.02558, 2015.\n\n[37] Hassler Whitney et al. A function not constant on a connected set of critical points. Duke\n\nMathematical Journal, 1(4):514\u2013517, 1935.\n\n11\n\n\f", "award": [], "sourceid": 1573, "authors": [{"given_name": "L\u00e9na\u00efc", "family_name": "Chizat", "institution": "INRIA"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}