{"title": "Semi-flat minima and saddle points by embedding neural networks to overparameterization", "book": "Advances in Neural Information Processing Systems", "page_first": 13868, "page_last": 13876, "abstract": "We theoretically study the landscape of the training error for neural networks in overparameterized cases. We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. Our results show that the networks with smooth and ReLU activation have different partially flat landscapes around the embedded point. We also relate these results to a difference of their generalization abilities in overparameterized realization.", "full_text": "Semi-\ufb02at minima and saddle points by embedding\n\nneural networks to overparameterization\n\nKenji Fukumizu\u2020,\u2021\n\nShoichiro Yamaguchi\u2021\n\nYoh-ichi Mototake\u2020\n\nMirai Tanaka\u2020\n\n\u2020The Institute of Statistical Mathematics\n\nTachikawa, Tokyo 190-8562, Japan\n\n{fukumizu, mototake, mirai}@ism.ac.jp\n\n\u2021Preferred Networks, Inc.\n\nChiyoda-ku, Tokyo 100-0004, Japan\n\nguguchi@preferred.jp\n\nAbstract\n\nWe theoretically study the landscape of the training error for neural networks\nin overparameterized cases. We consider three basic methods for embedding a\nnetwork into a wider one with more hidden units, and discuss whether a minimum\npoint of the narrower network gives a minimum or saddle point of the wider one.\nOur results show that the networks with smooth and ReLU activation have different\npartially \ufb02at landscapes around the embedded point. We also relate these results to\na difference of their generalization abilities in overparameterized realization.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have been applied to many problems with remarkable successes. On\nthe theoretical understanding of DNNs, however, many problems are still unsolved. Among others,\nlocal minima are important issues on learning of DNNs; existence of many local minima is naturally\nexpected by its strong nonlinearity, while people also observe that, with a large network and the\nstochastic gradient descent, training of DNNs may avoid this issue [8, 9]. For a better understanding\nof learning, it is essential to clarify the landscape of the training error.\nThis paper focuses on the error landscape in overparameterized situations, where the number of units\nis surplus to realize a function. This naturally occurs when a large network architecture is employed,\nand has been recently discussed in connection to optimization and generalization of neural networks\n([14, 2, 1] to list a few). To formulate overparameterization rigorously, this paper introduces three\nbasic methods, unit replication, inactive units, and inactive propagation, for embedding a network to\na network of more units in some layer. We investigate especially the landscape of the training error\naround the embedded point, when we embed a minimizer of the error for a smaller model.\nA relevant topic to this paper is \ufb02at minima [6, 7], which attract much attention in literature. Such\n\ufb02atness of minima is often observed empirically, and is connected to generalization performance [3, 8].\nThere are also some works on how to de\ufb01ne \ufb02atness appropriately and its relations to generalization\n[15, 17]. Different from these works, this paper shows some embeddings cause semi-\ufb02at minima,\nat which a lower dimensional af\ufb01ne subset in the parameter space gives a constant value of error\n(see Sec. A). We will also discuss difference between smooth activation and Recti\ufb01ed Linear Unit\n(ReLU); at a semi-\ufb02at minimum obtained by embedding a network of zero training error, the ReLU\nnetworks have more \ufb02at directions. Using PAC-Bayes arguments [11], we relate this to the difference\nof generalization bounds between ReLU and smooth networks in overparameterized situations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis paper extends [4], in which the three embedding methods are discussed and some conditions on\nminimum points are shown. However, the paper is limited to three-layer networks of smooth activation\nwith one-dimensional output, and addition of one hidden unit is discussed. The current paper covers\na much more general class of networks including ReLU activation and arbitrary number of layers,\nand discusses the difference based on the activation functions as well as a link to generalization.\nThe main contributions of this paper are summarized as follows.\n\u2022 Three methods of embedding are introduced for the general J-layer networks as basic construction\n\u2022 For smooth activation, the unit replication method embeds a minimum to a saddle point under\n\u2022 It is shown theoretically that, for ReLU activation, a minimum is always embedded as a minimum\nby the method of inactive units. The surplus parameters correspond to a \ufb02at subset of the training\nerror (Theorem 9). The unit replication gives only saddles under mild conditions (Theorem 10).\n\u2022 When a network attains zero training error, the embedding by inactive units gives semi-\ufb02at minima\nin both activation models. The ReLU networks give \ufb02atter minima in the overparameterized\nrealization, which suggests better generalization through the PAC-Bayes bounds (Sec. 5.2).\n\nof overparameterized realization of a function (Sec. 2).\n\nsome assumptions (Theorem 5).\n\nAll the proofs of the technical results are given in Supplements.\n\n2 Neural network and its embedding to a wider model\n\ni = \u03d5(zq\u22121; wq\n\nWe discuss J layer, fully connected neural networks that have an activation function \u03d5(z; w), where\nz is the input to a unit and w is a parameter vector. The output of the i-th unit U q\nin the q-th\nlayer is recursively de\ufb01ned by zq\ni and the\n(q \u2212 1)-th layer. The activation function \u03d5(z; w) is any nonlinear function, which often takes the\nwgtz \u2212 wbias) with w = (wwgt, wbias); typical examples are the sigmoidal function\nform \u03d5(wT\nwgtz \u2212 wbias, 0}. This paper\n\u03d5(z; w) = tanh(wT\nassumes that there is w(0) such that \u03d5(x; w(0)) = 0 for any x. Focusing the q-th layer, with size of\nthe other layers \ufb01xed, the set of networks having H units in the q-th layer is denoted by NH. With a\nparameter \u03b8(H) = (W0, w1, . . . , wH , v1, . . . , vH , V0), the function f (H)\n\nwgtz \u2212 wbias) and ReLU \u03d5(z; w) = max{wT\n\ni is the weight between U q\n\n\u03b8(H) of NH is de\ufb01ned by\n\ni ), where wq\n\ni\n\n\u03b8(H) (x) := f (H)(x; \u03b8(H)) = \u03c8(cid:0)(cid:80)H\n\nf (H)\n\n(cid:1),\n\nj=1vj\u03d5(x; wj, W0); V0\n\n(1)\nwhere \u03d5(x; wj, W0) is the output of U q\ni with a summarized parameter W0 in the previous layers, and\n\u03c8(zq+1; V0) is all the parts after zq+1 with parameter V0. Note that vj is a connection weight from\nthe unit U q\nj to the units in the (q + 1)-th layer (we omit the bias term for simplicity). The number of\nunits in the (q \u2212 1)-th and (q + 1)-th layers are denoted by D and M, respectively.\nEmbedding of a network refers to a map associating a narrower network in NH0 (H0 < H) with a\nnetwork of a speci\ufb01c parameter in a wider model NH to realize the same function, keeping other\nlayers unchanged. For clarity, we use (\u03b6i, ui) instead of (vj, wj) for the parameter \u03b8(H0) of NH0;\n(2)\n\n\u03b8(H0)(x) := f (H0)(x; \u03b8(H0)) = \u03c8(cid:0)(cid:80)H0\n\ni=1\u03b6i\u03d5(x; ui, W0); V0\n\n(cid:1).\n\nf (H0)\n\nWe consider minima and stationary points of the empirical risk (or training error)\n\n(3)\nwhere (cid:96)(y, f ) is a loss function to measure the discrepancy between a teacher y and network output\nf, and (x1, y1), . . . , (xn, yn) are given training data. Typical examples of (cid:96)(y, f ) include the square\nerror (cid:107)y \u2212 f(cid:107)2/2 and logistic loss \u2212y log f \u2212 (1 \u2212 y) log(1 \u2212 f ) for y \u2208 {0, 1} and f \u2208 (0, 1). In\nthe sequel, we assume the second order differentiability of (cid:96)(y, f ) with respect to f for each y.\n\n\u03bd=1(cid:96)(y\u03bd, f (H)(x\u03bd; \u03b8(H))),\n\nLH (\u03b8(H)) :=(cid:80)n\n\n2.1 Three embedding methods of a network\n\n\u03b8(H0 ) into NH\nTo fomulate overparameterization, we introduce three basic methods for embedding f (H0)\nso that it realizes exactly the same function as f (H0)\n\u03b8(H0). See Table 1 and Figure 1 for the de\ufb01nitions.\n(I) Unit replication: We \ufb01x a unit, say the H0-th unit U q\n, in NH0, and replicate it. Simply, \u03b8(H)\nhas H \u2212 H0 + 1 copies of uH0, and divides the weight \u03b6H0 by vH0, . . . , vH, keeping the other\n\nH0\n\n2\n\n\fFigure 1: Embedding of a narrower network to a wider one.\n\nUnit replication \u03a0repl(\u03b8(H0))\nwi = ui (1 \u2264 i \u2264 H0 \u2212 1)\nvi = \u03b6i (1 \u2264 i \u2264 H0 \u2212 1)\nwH0 = \u00b7\u00b7\u00b7 = wH = uH0\nvH0 + \u00b7\u00b7\u00b7 + vH = \u03b6H0\n\nInactive units \u03a0iu(\u03b8(H0))\nwi = ui (1 \u2264 i \u2264 H0)\nvi = \u03b6i (1 \u2264 i \u2264 H0)\nwH0+1 = \u00b7\u00b7\u00b7 = wH = w(o) wH0+1, . . . , wH: arbitrary\nvH0+1, . . . , vH: arbitrary\n\nInactive propagation \u03a0ip(\u03b8(H0))\nwi = ui (1 \u2264 i \u2264 H0)\nvi = \u03b6i (1 \u2264 i \u2264 H0)\nvH0+1 = \u00b7\u00b7\u00b7 = vH = 0\n\nTable 1: Three methods of embedding\n\nparts unchanged. A choice of ui (1 \u2264 i \u2264 H0) to replicate is arbitrary, and a different choice\nde\ufb01nes a different network. We use uH0 for simplicity. The parameters vH0, . . . , vH consist of an\n(H \u2212 H0) \u00d7 M dimensional af\ufb01ne subspace, denoted by \u03a0repl(\u03b8(H0)), in the parameters for NH.\n(II) Inactive units: This embedding uses the special weight w(0) to make the surplus units inactive.\nThe set of parameters is denoted by \u03a0iu(\u03b8(H0)), which is of (H \u2212 H0) \u00d7 M dimension.\n(III) Inactive propagation: This embedding cuts off the weights to the (q + 1)-th layer for the\nsurplus part. The weights wj of the surplus units are arbitrary. The set of parameters is denoted by\n\u03a0ip(\u03b8(H0)), which is of (H \u2212 H0) \u00d7 D dimension.\nAll the above embeddings give the same function as the narrower network.\nProposition 1. For any \u03b8(H) \u2208 \u03a0repl(\u03b8(H0)) \u222a \u03a0iu(\u03b8(H0)) \u222a \u03a0ip(\u03b8(H0)), we have f (H)\n\u03b8(H) = f (H0)\n\u03b8(H0).\nIt is important to note that a network is not uniquely embedded in a wider model, in contrast to \ufb01xed\nbases models such as the polynomial model. This unidenti\ufb01ability has been clari\ufb01ed for three-layer\nnetworks [10, 16]; in fact, for three layer networks of tanh activation, [16] shows that the three\nmethods essentially cover all possible embedding. For three-layer networks of 1-dimensional output\nand smooth activation, [4] shows that this unidenti\ufb01able embedding causes minima or saddle points.\nThe current paper extends this result to general networks with ReLU as well as smooth activation.\n\n3 Embedding of smooth networks\n\nThis section assumes the second order differentiability of \u03d5(x; w) on w. The case of ReLU will be\ndiscussed in Sec. 4. Let \u03b8(H0)\n\u2202\u03b8(H0 ) = 0. We are interested\nin whether the embedding in Sec. 2 also gives a stationary point of LH. More importantly, we wish\nto know if a minimum of LH0 is embedded to a minimum of LH. A network can be embedded by\nany combination of the three methods, but we consider their effects separately for simplicity. The\nde\ufb01nition of minimum, saddle point, and related notions are given by Sec. A.\n\nbe a stationary point of LH0, i.e., \u2202LH0 (\u03b8(H0)\n\n\u2217\n\n\u2217\n\n)\n\n3.1 Stationary properties of embedding\n\nTo discuss the stationarity for the case (I) unit replication, we need to restrict \u03a0repl(\u03b8(H0)) to a subset.\nFor \u03b8(H0), de\ufb01ne \u03b8(H)\n\nfor every \u03bb = (\u03bbH0, . . . , \u03bbH ) \u2208 RH\u2212H0+1 with(cid:80)H\n\n\u03bbj = 1 by\n\nj=H0\n\n\u03bb\nwi = ui,\nwH0 = \u00b7\u00b7\u00b7 = wH = uH0 ,\n\nvi = \u03b6i\n\n(1 \u2264 i \u2264 H0 \u2212 1),\n\n\u03bb \u2208 \u03a0repl(\u03b8(H0)) so that f (H)(x; \u03b8(H)\n\n(4)\nvj = \u03bbj\u03b6H0\nObviously, \u03b8(H)\n\u03bb ) = f (H0)(x; \u03b8(H0)). The next theorem tells\nthat a stationary point of NH0 is embedded to an (H \u2212 H0)-dimensional stationary subset of NH.\n(cid:80)H\nTheorem 2. Let \u03b8(H0)\nbe a stationary point of LH0. Then, for any \u03bb = (\u03bbH0 , . . . , \u03bbH ) with\n\nde\ufb01ned by Eq. (4) is a stationary point of LH.\n\n\u03bbj = 1, the point \u03b8(H)\n\n\u2217\n\n\u03bb\n\nj=H0\n\n(H0 \u2264 j \u2264 H).\n\n3\n\n(cid:3037)((cid:3042))(cid:3037)(cid:2868)Embedding(cid:3037)(cid:3037)(cid:3009)(cid:3116)(cid:3009)(cid:3116)(cid:3037)(cid:3009)(cid:3116)(cid:3037)(I) Unit replication(II) Inactive units(III) Inactive propagation\fThe basic idea for the proof is to separate the subset of parameters (vH0 , wH0, . . . , vH , wH ) into\na copy of (\u03b6H0, uH0) and the remaining ones, the latter of which do not contribute to change the\nfunction f (H)\nIt is easy to see that the inactive units or propagations does not generally embed a stationary point to\na stationary one (see also Theorems 2 and 4 in [4]). The details will be given in Sec. C.\n\n\u03bb . We will see this reparameterization in Sec. 3.2 in detail.\n\n\u03b8(H) at \u03b8(H)\n\n3.2 Embedding of a minimum point in the case of smooth networks\n\n\u03bb\n\n\u2217\n\nof a mininum point \u03b8(H0)\n\nNH : f (H)(x; \u03b8(H)) =(cid:80)H\n\nWe next consider the embedding \u03b8(H)\nof LH0. In the sequel, for readability,\nwe discuss three-layer models (J = 3) and linear output units. Note however that, for general J, the\nderivatives and Hessian of LH for the other parameters are exactly the same as those of LH0 for the\ncorresponding parameters. We omit the full description here. The two models are simply given by\ni=1\u03b6i\u03d5(x; ui).\n(5)\nTo simplify the Hessian for unit replication, we introduce a new parameterization of NH. Let\n\u03bb \u2208 RH\u2212H0+1 be \ufb01xed such that \u03bbH0 +\u00b7\u00b7\u00b7 + \u03bbH = 1 and \u03bbj (cid:54)= 0. For such \u03bb, take an (H \u2212 H0)\u00d7\n(H \u2212 H0 + 1) matrix A = (\u03b1cj) (H0 + 1 \u2264 c \u2264 H, H0 \u2264 j \u2264 H) that satis\ufb01es the two conditions:\n\nand NH0 : f (H0)(x; \u03b8(H0)) =(cid:80)H0\n\nj=1vj\u03d5(x; wj)\n\n(cid:1) is invertible, where 1d = (1, . . . , 1)T \u2208 Rd,\n\nA\n\nj=H0\n\n\u03b1cj\u03bbj = 0 for any H0 + 1 \u2264 c \u2264 H.\nTo \ufb01nd such A, take A = (aH0+1, . . . , aH )T so that aT\nfor some scalars sc, taking the inner product with \u03bb causes a contradiction.\nGiven such \u03bb and A = (\u03b1cj), de\ufb01ne a bijective linear transform from (vH0, . . . , vH ; wH0, . . . , wH )\nto (a, \u03beH0+1, . . . , \u03beH ; b, \u03b7H0+1, . . . , \u03b7H ) by\n\nc \u03bb = 0. Then, if(cid:80)H\n\nc=H0+1 scac = 1H\u2212H0+1\n\n(A1) (cid:0) 1T\n(A2) (cid:80)H\n\nH\u2212H0+1\n\nc=H0+1\u03b1cj\u03b7c\n\n(6)\nThe parameter b serves as the direction that makes all the hidden units behave equally, and (\u03b7j)\nde\ufb01ne the remaining H \u2212 1 directions that differentiate them. The parameter b thus essentially plays\nthe role of uH0 for NH0. Also, a works as \u03b6H0 when all wj are equal. The next lemma con\ufb01rms this\nrole of (a, b) and shows that the directions \u03b7c and \u03bec do not change the function f (H) at \u03b8(H0)\nLemma 3. Let \u03b8(H0) be any parameter of NH0, and \u03b8(H)\n\nbe its embedding de\ufb01ned by Eq. (4). Then,\n\nc=H0+1\u03bbj\u03b1cj\u03bec\n\n\u03bb\n\n.\n\n\u03bb\n\n(H0 \u2264 j \u2264 H).\n\nand vj = \u03bbja +(cid:80)H\n\nwj = b +(cid:80)H\n\n(cid:12)(cid:12)(cid:12)\u03b8(H)=\u03b8(H)\n(cid:12)(cid:12)(cid:12)\u03b8(H)=\u03b8(H)\n\n\u03bb\n\n\u03bb\n\n\u2202f (H)(x;\u03b8(H))\n\n\u2202b\n\n\u2202f (H)(x;\u03b8(H))\n\n\u2202a\n\n= \u2202f (H0)(x;\u03b8(H0))\n\n\u2202uH0\n\n= \u2202f (H0)(x;\u03b8(H0))\n\n\u2202\u03b6H0\n\n,\n\n,\n\n\u2202f (H)(x;\u03b8(H))\n\n\u2202\u03b7c\n\n\u2202f (H)(x;\u03b8(H))\n\n\u2202\u03bec\n\n(cid:12)(cid:12)(cid:12)\u03b8(H)=\u03b8(H)\n(cid:12)(cid:12)(cid:12)\u03b8(H)=\u03b8(H)\n\n\u03bb\n\n\u03bb\n\n= 0,\n\n= 0.\n\n(7)\n\nFrom Lemma 3, the Hessian takes a simple form:\nis a stationary point of NH0 and \u03b8(H)\nLemma 4. Let \u03bb and A be as above. Suppose \u03b8(H0)\nis its embedding de\ufb01ned by Eq. (4). Then, the Hessian matrix of LH with respect to \u03c9 =\n(a, b, \u03beH0+1, . . . , \u03beH , \u03b7H0+1, . . . , \u03b7H ) at \u03b8(H) = \u03b8(H)\n\nis given by\n\n\u2217\n\n\u03bb\n\n\u03bb\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u22022LH (\u03b8(H)\n\u03bb )\n\n\u2202\u03c9\u2202\u03c9\n\n=\n\na\n\nb\n\n\u03bec\n\na\n\nb\n\n\u2217\n\n\u22022LH0 (\u03b8(H0)\n\u2202\u03b6H0 \u2202\u03b6H0\n\u22022LH0 (\u03b8(H0)\n\u2202uH0 \u2202\u03b6H0\n\n\u2217\n\n)\n\n)\n\n\u2217\n\n\u22022LH0 (\u03b8(H0)\n\u2202\u03b6H0 \u2202uH0\n\u22022LH0 (\u03b8(H0)\n\u2202uH0 \u2202uH0\n\n\u2217\n\n)\n\n)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb.\n\n\u03bed\n\n\u03b7d\n\nO O\n\nO O\n\u02dcF\n\u02dcG\n\n(8)\n\nis given by (cid:0)A\u039bAT(cid:1) \u2297 G with \u039b = Diag(\u03bbH0, . . . , \u03bbH ) and G :=\n\nO\nO\n)cd, which is a symmetric matrix of (H \u2212 H0) \u00d7\n\nO\n\u02dcF T\n\nO\nO\n\n\u03b7c\n\nThe lower-right block \u02dcG := ( \u22022LH (\u03b8(H)\n\u03bb )\n\u2202\u03b7c\u2202\u03b7d\nD dimension,\n\n(cid:80)n\nH0)\u00d7 M dimension, is given by(cid:0)A\u039bAT(cid:1)\u2297 F with F :=(cid:80)n\n\n\u03b6H0\u2217 \u22022\u03d5(x\u03bd ;uH0\u2217)\n\u2202uH0 \u2202uH0\n\n\u2202(cid:96)(y\u03bd ,f (H0 )(x\u03bd ;\u03b8(H0)\n\n\u03bd=1\n\n\u2202z\n\n))\n\n\u2217\n\n; and \u02dcF := ( \u22022LH (\u03b8(H)\n\u03bb )\n\u2202\u03bec\u2202\u03b7d\n\n)cd, which is of size (H \u2212\n\u2202\u03d5(x\u03bd ;uH0\u2217)\n.\n\n))\n\n\u2217\n\n\u2202z\n\n\u2202uH0\n\n\u2202(cid:96)(y\u03bd ,f (H0)(x\u03bd ;\u03b8(H0)\n\n\u03bd=1\n\n4\n\n\fLemma 4 shows that, with the reparametrization, the Hessian at the embedded stationary point \u03b8(H)\ncontains the Hessian of LH0 with a, b, and that the cross blocks between (a, b) and (\u03bec, \u03b7d) are zero.\nNote that the \u03be-\u03be block is zero, which is important when we prove Theorem 5.\nTheorem 5. Consider a three layer network given by Eq. (5). Suppose that the the output dimension\nM is greater than 1 and \u03b8(H0)\nis a minimum of LH0. Let the matrices G, F and the parameter \u03b8(H)\nbe used in the same meaning as in Lemma 4 (unit replication). Then, if either of the conditions\n(i) G is positive or negative de\ufb01nite, and F (cid:54)= O,\n(ii) G has positive and negative eigenvalues,\n\n\u2217\n\n\u03bb\n\n\u03bb\n\nholds, then for any \u03bb with(cid:80)H\n\nis a saddle point of LH.\n\n\u03bbj = 1 and \u03bbj (cid:54)= 0, \u03b8(H)\n\n\u03bb\n\nj=H0\n\nTheorem 5 is easily proved from Lemma 4. From the form of the lower-right four blocks of Eq. (8), it\nhas positive and negative eigenvalues if \u02dcG is positive (or negative) de\ufb01nite and \u02dcF (cid:54)= O. See Sec. D.3\nin Supplements for a complete proof. The assumption M \u2265 2 is necessary for the condition (i) to\nhappen. In fact, [4] discussed the case of M = 1, in which F = O is derived. The paper also gave a\nsuf\ufb01cient condition that the embedded point \u03b8(H)\nis a local minimum when G is positive (or negative)\nde\ufb01nite. See Sec. E for more details on the special case of M = 1.\nSuppose that \u03b8(H0)\ncan never be a saddle point but a global\nminimum. Therefore, the situation (ii) can never happen. In that case, if G is invertible, it must be\npositive de\ufb01nite and F = O. We will discuss this case further in Sec. 5.1.\n\nattains zero training error. Then, \u03b8(H)\n\n\u2217\n\n\u03bb\n\n\u03bb\n\n4 Semi-\ufb02at minima by embedding of ReLU networks\n\nThis section discusses networks with ReLU. Its special shape causes different results. Let \u03c6(t) be\nthe ReLU function: \u03c6(t) = max{t, 0}, which is used very often in DNNs to prevent vanishing\nwgtx \u2212 wbias. It\ngradients [12, 5]. The activation is given by \u03d5(x; w) = \u03c6(wT \u02dcx) with wT \u02dcx := wT\nis important to note that the ReLU function satis\ufb01es positive homogeneity; i.e., \u03c6(\u03b1t) = \u03b1\u03c6(t) for\nany \u03b1 \u2265 0. This causes special properties on \u03d5, that is, (a) \u03d5(x; rw) = r\u03d5(x; w) for any r \u2265 0, (b)\n\n\u2202\u03d5(x;w)\n\n\u2202w\n\n= \u2202\u03d5(x;w)\n\n\u2202w\n\nif r > 0, wT \u02dcx (cid:54)= 0, and (c) \u22022\u03d5(x;w)\n\n\u2202w\u2202w = 0 if wT \u02dcx (cid:54)= 0.\n\n(cid:12)(cid:12)(cid:12)w=rw\u2217\n\n(cid:12)(cid:12)(cid:12)w=w\u2217\n\nFrom the positive homogeneity, effective parameterization needs some normalization of vj or wj.\nHowever, this paper uses the redundant parameterization. In our theoretical arguments, no problem is\ncaused by the redundancy, while it gives additional \ufb02at directions in the parameter space.\n\n4.1 Embeddings of ReLU networks\n\nRe\ufb02ecting the above special properties, we introduce modi\ufb01ed versions for embeddings of \u03b8(H0)\n(I)R Unit replication: Fix U q\n\n.\n, and take \u03b3 = (\u03b3H0, . . . , \u03b3H ) \u2208 RH\u2212H0+1 and \u03b2 = (\u03b2H0 , . . . , \u03b2H )\n\nsuch that \u03b2j > 0 (H0 \u2264 \u2200j \u2264 H) and(cid:80)H\n\nH0\n\n\u2217\n\nj=H0\nvi = \u03b6i\n\n\u03b3j\u03b2j = 1. De\ufb01ne \u03b8(H)\n(1 \u2264 i \u2264 H0 \u2212 1),\n\n\u03b3,\u03b2 by\n\nvj = \u03b3j\u03b6H0\n\n(H0 \u2264 j \u2264 H).\n\nwi = ui,\nwj = \u03b2juH0 ,\n\n(9)\n\n(II)R Inactive units: De\ufb01ne a parameter \u02c6\u03b8(H) by\n\n(1 \u2264 i \u2264 H0),\n\nvj : arbitrary\n\n(H0 + 1 \u2264 j \u2264 H)\n\nwi = ui,\nwj such that wT\n\nvi = \u03b6i\n\n(\u2200\u03bd, H0 + 1 \u2264 j \u2264 H).\n\nj \u02dcx\u03bd < 0\n\n(10)\nNote that the de\ufb01nition (II)R is different from the smooth activation case. The last condition is easily\nsatis\ufb01ed if wbias is large. Note also that \u03d5(x\u03bd; wj) = 0 for each \u03bd, but \u03d5(x; wj) (cid:54)\u2261 0 in general.\nSince a small change of wj (H0 + 1 \u2264 j \u2264 H) does not alter \u03d5(x\u03bd; wj) = 0, the function LH is\nconstant locally on vj and wj (H0 + 1 \u2264 j \u2264 H) at \u02c6\u03b8(H). This is clear difference from the smooth\ncase, where changing wj from w(0) may cause a different function.\n(III)R Inactive propagation: The inactive propagation is exactly the same as the smooth activation\ncase. The embedded point is denoted by \u02dc\u03b8(H).\nThe following proposition is obvious from the de\ufb01nitions.\n\n5\n\n\fProposition 6. For the unit replication and inactive propagation, we have f (H)\n\u03b8(H)\n\u03b3,\u03b2\n\n= f (H)\n\n\u02dc\u03b8(H) = f (H0)\n\u03b8(H0)\n\u2217\n\n.\n\nWe see that there are some other \ufb02at directions in addition to the general cases. In the embedding by\nj \u02dcx\u03bd \u2264 0 is maintained, LH has the same value. Assume (cid:107)x\u03bd(cid:107) \u2264 1\ninactive units, if the condition wT\nwithout loss of generality, and \ufb01x K > 1 as a constant. De\ufb01ne \u02c6wj,wgt = 0 and \u02c6wj,bias = 2K for\nH0 + 1 \u2264 j \u2264 H. From wT\nj \u02dcx\u03bd \u2264 (cid:107)wj,wgt(cid:107) \u2212 wj,bias \u2264 0 for wj \u2208 BK := {wj | (cid:107)wj,wgt(cid:107) \u2264\nK and K \u2264 wj,bias \u2264 3K} and any vj (H0 + 1 \u2264 j \u2264 H), we have the following result, showing\nthat an (H \u2212 H0) \u00d7 (M + D) dimensional af\ufb01ne subset at \u02c6\u03b8(H) gives the same value at x\u03bd.\nProposition 7. Assume (cid:107)x\u03bd(cid:107) \u2264 1 (\u2200\u03bd). If (vi, wi) = (\u03b6i\u2217, ui\u2217) (1 \u2264 i \u2264 H0) and (vj, wj) \u2208\nRM \u00d7 BK (H0 + 1 \u2264 j \u2264 H), we have for any \u03bd = 1, . . . , n\n\nf (H)(x\u03bd; \u03b8(H)) = f (H0)(x\u03bd; \u03b8(H0)\n\n\u2217\n\n).\n\n\ufb02at directions. To see this, for a \ufb01xed (\u03b3, \u03b2) with(cid:80)\n\u03b1cj\u03b3j\u03b2j = 0 (\u2200c) and(cid:0) 1T\nmatrix such that(cid:80)H\n\nNext, for the unit replication of ReLU networks, the piecewise linearity of ReLU causes additional\nj \u03b3j\u03b2j = 1, we introduce a parametriza-\ntion in a similar manner to the smooth case. Let A = (\u03b1cj) be an (H \u2212 H0) \u00d7 (H \u2212 H0 + 1)\n\n(cid:1) is invertible. Fix such A and de\ufb01ne\n\nH\u2212H0+1\n\nj=H0+1 does not alter the value LH (\u03b8(H)) = LH0(\u03b8(H0)\n\n(a, \u03beH0+1, . . . , \u03beH ; b, \u03b7H0+1, . . . , \u03b7H ) by Eq. (6). The next proposition shows that a small change of\n\u03b4 (\u03b8(H)) denote the intersection\n(\u03b7j)H\nof the ball of radius \u03b4 > 0 at \u03b8(H) and the af\ufb01ne subspace spanned by \u03b7H0+1, . . . , \u03b7H at \u03b8(H).\nProposition 8. Let {x\u03bd}n\nand \u03b8(H)\n\nbe any parameter of the ReLU network NH0,\n\u2217\nH0\u2217x\u03bd (cid:54)= 0 for all \u03bd. Then, there is \u03b4 > 0 such that\n\n\u03b3,\u03b2 be de\ufb01ned by Eq. (9). Assume that uT\n\n\u03bd=1 be any data set, \u03b8(H0)\n\n). Let B\u03b7\n\n\u2217\n\nj=H0\n\nA\n\nf (H)(x\u03bd; \u03b8(H)) = f (H0)(x\u03bd; \u03b8(H0)\n\n\u2217\n\n)\n\n(\u2200\u03b8(H) \u2208 B\u03b7\n\n\u03b4 (\u03b8(H)\n\n\u03b3,\u03b2 ), \u2200\u03bd = 1, . . . , n).\n\nSee Sec. F.1 for the proof. The situation uT\n\nH0\u2217x\u03bd (cid:54)= 0 may easily occur in practice (Fig. 2(a)).\n\n4.2 Embedding a local minimum of ReLU networks\n\n\u2217\n\nWe \ufb01rst consider the embedding of a minimum by inactive units. Let \u02c6\u03b8(H) be an embedding of \u03b8(H0)\nj=H0+1 around \u02c6\u03b8(H) but\nby Eq. (10). From Proposition 7, LH (\u03b8(H)) does not depend on (vj, wj)H\ntakes the same value as LH0 (\u03b8(H0)) with \u03b8(H0) = (vi, wi)H0\ni=1. We have thus the following theorem.\nTheorem 9. Assume that \u03b8(H0)\nis a minimum of LH0. Then, the embedded point \u02c6\u03b8(H) de\ufb01ned by\nEq. (10) (inactive units) is a minimum of LH.\nTheorem 9 and Proposition 7 imply that there is an (H \u2212 H0) \u00d7 (M + D) dimensional af\ufb01ne subset\nthat gives local minima, and in those directions LH is \ufb02at.\nNext, we consider the embedding by unit replication, which needs further restriction on \u03b3 and \u03b2. Let\n\u03b8(H0) be a parameter of NH0, and \u03b3 = (\u03b3j)H\nby replacing\nH0\u2217x\u03bd (cid:54)= 0 (\u2200\u03bd),\nthe function LH is differentiable on \u03b7c, \u03bec, and for the same reason as Theorem 5, the derivatives are\nzero. By restricting the function on those directions around \u03b8(H)\n= 0, we\n\nsatisfy(cid:80)H\n(cid:1), which includes a positive and negative eigenvalue\n\n\u03b3j > 0. De\ufb01ne \u03b8(H)\n\u03b3k (H0 \u2264 j \u2264 H). If we assume uT\n\nwj = \u03b2juj in Eq. (9) with wj = uH0/(cid:80)H\ncan see that the Hessian has the form(cid:0) O \u02dcF\nthat(cid:80)H\n\nH0\u2217x\u03bd (cid:54)= 0 for any\nTheorem 10. Suppose that \u03b8(H0)\n\u03bd = 1, . . . , n, and that F (cid:54)= O where F is given by Lemma 4. Then, for any \u03b3 \u2208 RH\u2212H0+1 such\n\nis a minimum point of LH0. Assume that uT\n\nunless F = O. This derives the following theorem. (See Sec. F.2 for a complete proof.)\n\n, from the fact \u22022\u03d5(x\u03bd ;uH0 )\n\u2202uH0 \u2202uH0\n\n\u03b3j > 0, the embedded parameter \u03b8(H)\n\nis a saddle point of LH.\n\n\u02dcF T O\n\nk=H0\n\nj=H0\n\nj=H0\n\n\u2217\n\nj=H0\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\n6\n\n\f5 Discussions\n\n5.1 Minimum of zero error\n\n\u2217\n\nIn using a very large network with more parameters than the data size, the training error may reach\nzero. Assume (cid:96)(y, z) \u2265 0 and that a narrower model attains LH0(\u03b8(H0)\n) = 0 without redundant\nunits, i.e., any deletion of a unit will increase the training error. We investigate overparameterized\nrealization of such a global minimum by embedding in a wider network NH. Note that by any methods\nthe embedded parameter is a minimum. This causes special local properties on the embedded point.\nFor simplicity, we assume three-layer networks and (cid:107)x\u03bd(cid:107) \u2264 1 (\u2200\u03bd). First, consider the unit replication\nfor the smooth activation. As discussed in the last part of Sec. 3.2, the Hessian takes the form\n\n\uf8ee\uf8f0\n(cid:1). The case of inactive propagation is similar.\nunits, the lower-right four blocks take the form(cid:0) O O\n\nwhere \u02dcG is non-negative de\ufb01nite. It is not dif\ufb01cult to see (Sec. G.2.2) that, in the case of inactive\n\nSmooth: \u22072LH (\u03b8(H)\n\n) O O\nO O\nO \u02dcG\n\n\uf8f9\uf8fb,\n\n\u22072LH0 (\u03b8(H0)\n\n\u2217\n\n\u03bb ) =\n\n(11)\n\nO\nO\n\n\u03b8(H0)\n\n\u03b7c\n\n\u03bec\n\nFor ReLU activation, assume \u03b8(H0)\n7, the Hessian at the embedding \u02c6\u03b8(H) by inactive units is given by\n\n\u2217\n\nis a differentiable point of LH0 for simplicity. From Proposition\n\nReLU: \u22072LH ( \u02c6\u03b8(H)) =\n\n\u03b8(H0)\n\u2217\n\n\u22072LH0(\u03b8(H0)\n\nO\n\n(vj ,wj )\n\n)\n\nO\nO\n\n(cid:21)\n\n.\n\n(12)\n\nO S\n\n(cid:20)\n\n(cid:115)\n\nSimilarly to the smooth case, the Hessian for the unit replication \u03b8(H)\n\n\u03b3\n\ntakes the same form as Eq. (12).\n\n5.2 Generalization error bounds of embedded networks\n\nBased on the results in Sec. 5.1, here we compare the embedding between ReLU and smooth\nactivation. The results suggest that the ReLU networks can have an advantage in generalization error\nwhen zero training error is realized by some type of overparameterized models.\nSuppose that the smooth model NH0,s and ReLU mdoel NH0,r attain zero training error without\nredundant units. They are embedded by the method of inactive units into NHs and NHr, respectively,\nso that Hs \u2212 H0,s = Hr \u2212 H0,r(=: E) (the same number of surplus units). The dimensionality of\nthe parameters of NH0,s and NH0,r are denoted by d0\nThe major difference of the local properties in Eqs. (11) and (12) is the existence of matrix S or \u02dcG in\nthe smooth case. The ReLU network has a \ufb02at error surface LH in both the directions of wj and vj.\nIn this sense, the embedded minimum is \ufb02atter in the ReLU network. We relate this difference of\nsemi-\ufb02atness to the generalization ability of the networks through the PAC-Bayes bounds, which has\nbeen already used for discussing deep learning [13]. Our motivation here is to consider the difference\nof the activation functions. We give a summary here and defer the details in Sec. G, Supplements.\nLet D be a probability distribution of (x, y) and LH (\u03b8(H)) := ED[(cid:96)(y, f (x; \u03b8(H)))] be the gener-\nalization error (or risk). Training data (x1, y1), . . . , (xn, yn) are i.i.d. sample with distribution D.\nThen, with a trained parameter \u02c6\u03b8, the PAC-Bayes bound tells\n\nrl, respectively.\n\nsm and d0\n\nLH ( \u02c6\u03b8) (cid:47) 1\nn\n\nLH ( \u02c6\u03b8) + 2\n\n2(KL(Q||P ) + ln 2\u03b4\nn )\n\nn \u2212 1\n\n,\n\n(13)\n\nwhere P is a prior distribution which does not depend on the training data, and Q is any distribution\nsuch that it distributes on parameters that do not change the value of LH so much from LH ( \u02c6\u03b8).\nWe focus on the embedding by inactive units here. See Sec. G.2.3, Supplements, for the other\ncases. The essential factor of the PAC-Bayes bound is the KL-divergence KL(Q||P ), which is to\nbe small. We use different choices of P and Q for the smooth and ReLU networks (see Sec. G for\ndetails). For the smooth networks, Psm is a non-informative normal distribution N (0, \u03c32Idsm ) with\n\n7\n\n\f(a)\n\n(b)\n\nsm,0, \u03c4 2H\u22121\n\nsm) \u00d7 N ( \u02c6\u03b8(H)\n\nFigure 2: (a) Data and \ufb01tting by N5 with ReLU. (b) Ratio of generalization errors of NH and NH0.\nsm,2, \u03c4 2S\u22121) with \u03c4 (cid:28) 1, where\n\u03c3 (cid:29) 1, and Qsm is N ( \u02c6\u03b8(H)\nj=H0+1. Hsm :=\nthe decomposition corresponds to the components \u03b8(H0), (vj)H\n\u22072LH0(\u03b8(H0)\n) \u00d7\nN (0, \u03c32Id1) \u00d7 UnifBE\n, where d1 =\nE \u00d7 M is dim(vj)H\n\n\u2217,sm) is the Hessian. For ReLU, based on Proposition 7, Prl is given by N (0, \u03c32Id0\n\nj=H0+1. For these choices, the major difference of the bounds is the term\n\nsm,1, \u03c32Id1) \u00d7 N ( \u02c6\u03b8(H)\n\nrl ) \u00d7 N ( \u02c6\u03b8(H)\n\nrl,1 , \u03c32Id1) \u00d7 UnifBE\n\n, while Qrl is N ( \u02c6\u03b8(H)\n\nK\n\nrl,0 , \u03c4 2H\u22121\n\nd1 log(cid:0)\u03c32/\u03c4 2)\n\nj=H0+1, and (wj)H\n\nrl\n\nK\n\nin the KL divergence for the smooth model. We can argue that, in realizing perfect \ufb01tting to\ntraining data with an overparameterized network, the ReLU network achieves a better upper bound of\ngeneralization than the smooth network, when the numbers of surplus units are the same.\nNumerical experiments. We made experiments on the generalization errors of networks with ReLU\nand tanh in overparameterization. The input and output dimension is 1. Training data of size 10\nare given by N1 (one hidden unit) for the respective models with additive noise \u03b5 \u223c N (0, 10\u22122) in\nthe output. We \ufb01rst trained three-layer networks with each activation to achieve zero training error\n(< 10\u221229 in squared errors) with minimum number of hidden units (H0 = 5 in both models). See\nFigure 2(a) for an example of \ufb01tting by the ReLU network. We used the method of inactive units for\nembedding to NH, and perturb the whole parameters with N (0, \u03c12), where \u03c1 is the 0.01 \u00d7 (cid:107)\u03b8(H0)\n(cid:107).\nThe code is available in Supplements. Figure 2(b) shows the ratio of the generalization errors (average\nand standard error for 1000 trials) of NH over NH0 as increasing H. We can see that, as more surplus\nunits are added, the generalization errors increase for the tanh networks, while the ReLU networks\ndo not show such increase. This accords with the theoretical considerations in Sec. 5.2: adding\nsurplus units in tanh activation makes sharp directions, which degrade the generalization.\n\n\u2217\n\n5.3 Additional remarks\n\nRegularization. In training of a large network, one often regularizes parameters based on the norm\nsuch as (cid:96)2 or (cid:96)1. Consider, for example, the inactive method of embedding for tanh or ReLU by\nsetting vj = 0 and wj = 0 (H0 + 1 \u2264 j \u2264 H). Then the norm of the embedded parameter is smaller\nthan that of unit replication. This implies that if norm regularization is applied during training, the\nembedding by inactive units and propagation is to be promoted in overparameterized realization.\nAbundance of semi-\ufb02at minima in ReLU networks. Theorems 9 and 10 discuss three layer models\nfor simplicity, but they can be easily extended to networks of any number of layers. Given a minimum\nof LH0, it can be embedded to a wider network by making inactive units in any layers. Thus, in\na very large (deep and wide) network with overparameterization, there are many af\ufb01ne subsets of\nparameters to realize the same function, which consist of semi-\ufb02at minima of the training error.\n\n6 Conclusions\n\nFor a better theoretical understanding of the error landscape, this paper has discussed three methods\nfor embedding a network to a wider model, and studied overparameterized realization of a function\nand its local properties. From the difference of the properties between smooth and ReLU networks, our\nresults suggest that ReLU may have an advantage in realizing zero errors with better generalization.\nThe current analysis reveals some nontrivial geometry of the error landscape, and its implications to\ndynamics of learning will be within important future works.\n\n8\n\n1.00.50.00.51.0x1.01.52.02.5yfitting curvetrue curvetraining data0.02.55.07.510.0Number of added nodes123Rates of Generalization ErrorsReLUTanh\fReferences\n[1] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural\nnetworks, going beyond two layers. CoRR, abs/1811.04918, 2018. URL http://arxiv.org/\nabs/1811.04918.\n\n[2] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration\nby overparameterization. In J. Dy and A. Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,\npages 244\u2013253, 2018.\n\n[3] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes,\nL. Sagun, and R. Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. CoRR,\nabs/1611.01838, 2017.\n\n[4] K. Fukumizu and S. Amari. Local minima and plateaus in hierarchical structures of multilayer\n\nperceptrons. Neural Networks, 13(3):317\u2013327, 2000.\n\n[5] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In G. Gordon,\nD. Dunson, and M. Dud\u00edk, editors, Proceedings of the 14th International Conference on\nArti\ufb01cial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research,\npages 315\u2013323, Fort Lauderdale, FL, USA, 11\u201313 Apr 2011.\n\n[6] S. Hochreiter and J. Schmidhuber. Simplifying neural nets by discovering \ufb02at minima. In\n\nAdvances in Neural Information Processing Systems 7, pages 529\u2013536. MIT Press, 1995.\n\n[7] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997. doi:\n\n10.1162/neco.1997.9.1.1.\n\n[8] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch\ntraining for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2017.\n[9] B. Kleinberg, Y. Li, and Y. Yuan. An alternative view: When does SGD escape local minima?\nIn Proceedings of the 35th International Conference on Machine Learning, pages 2698\u20132707,\n2018.\n\n[10] V. K\u02daurkov\u00e1 and P. C. Kainen. Functionally equivalent feedforward neural networks. Neural\n\nComputation, 6(3):543\u2013558, 1994. doi: 10.1162/neco.1994.6.3.543.\n\n[11] D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, Dec\n\n1999.\n\n[12] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\nProceedings of the 27th International Conference on International Conference on Machine\nLearning, ICML 2010, pages 807\u2013814, 2010.\n\n[13] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep\nlearning. In Advances in Neural Information Processing Systems 30, pages 5947\u20135956, 2017.\n[14] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of\nthe 34th International Conference on Machine Learning - Volume 70, ICML\u201917, pages 2603\u2013\n2612. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305890.3305950.\n[15] A. Rangamani, N. H. Nguyen, A. Kumar, D. Phan, S. H. Chin, and T. D. Tran. A Scale Invariant\n\nFlatness Measure for Deep Network Minima. arXiv:1902.02434 [stat.ML], Feb 2019.\n\n[16] H. J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given input-\n\noutput map. Neural Networks, 5(4):589 \u2013 593, 1992.\n\n[17] Y. Tsuzuku, I. Sato, and M. Sugiyama. Normalized Flat Minima: Exploring Scale Invariant\nDe\ufb01nition of Flat Minima for Neural Networks using PAC-Bayesian Analysis. arXiv e-prints,\nart. arXiv:1901.04653, Jan 2019.\n\n9\n\n\f", "award": [], "sourceid": 7751, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics / Preferred Networks / RIKEN AIP"}, {"given_name": "Shoichiro", "family_name": "Yamaguchi", "institution": "Preferred Networks"}, {"given_name": "Yoh-ichi", "family_name": "Mototake", "institution": "Institute of Statistical Mathematics"}, {"given_name": "Mirai", "family_name": "Tanaka", "institution": "The Institute of Statistical Mathematics / RIKEN"}]}