{"title": "Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1687, "page_last": 1695, "abstract": "The optimization problem behind neural networks is highly non-convex. Training with stochastic gradient descent and variants requires careful parameter tuning and provides no guarantee to achieve the global optimum. In contrast we show under quite weak assumptions on the data that a particular class of feedforward neural networks can be trained globally optimal with a linear convergence rate. Up to our knowledge this is the first practically feasible method which achieves such a guarantee. While the method can in principle be applied to deep networks, we restrict ourselves for simplicity in this paper to one- and two hidden layer networks. Our experiments confirms that these models are already rich enough to achieve good performance on a series of real-world datasets.", "full_text": "Globally Optimal Training of Generalized\n\nPolynomial Neural Networks with Nonlinear\n\nSpectral Methods\n\nA. Gautier, Q. Nguyen and M. Hein\n\nDepartment of Mathematics and Computer Science\n\nSaarland Informatics Campus, Saarland University, Germany\n\nAbstract\n\nThe optimization problem behind neural networks is highly non-convex.\nTraining with stochastic gradient descent and variants requires careful\nparameter tuning and provides no guarantee to achieve the global optimum.\nIn contrast we show under quite weak assumptions on the data that a\nparticular class of feedforward neural networks can be trained globally\noptimal with a linear convergence rate with our nonlinear spectral method.\nUp to our knowledge this is the \ufb01rst practically feasible method which\nachieves such a guarantee. While the method can in principle be applied to\ndeep networks, we restrict ourselves for simplicity in this paper to one and\ntwo hidden layer networks. Our experiments con\ufb01rm that these models are\nrich enough to achieve good performance on a series of real-world datasets.\n\n1 Introduction\n\nDeep learning [13, 16] is currently the state of the art machine learning technique in\nmany application areas such as computer vision or natural language processing. While the\ntheoretical foundations of neural networks have been explored in depth see e.g.\n[1], the\nunderstanding of the success of training deep neural networks is a currently very active\nresearch area [5, 6, 9]. On the other hand the parameter search for stochastic gradient descent\nand variants such as Adagrad and Adam can be quite tedious and there is no guarantee that\none converges to the global optimum. In particular, the problem is even for a single hidden\nlayer in general NP hard, see [17] and references therein. This implies that to achieve global\noptimality e\ufb03ciently one has to impose certain conditions on the problem.\nA recent line of research has directly tackled the optimization problem of neural networks\nand provided either certain guarantees [2, 15] in terms of the global optimum or proved\ndirectly convergence to the global optimum [8, 11]. The latter two papers are up to our\nknowledge the \ufb01rst results which provide a globally optimal algorithm for training neural\nnetworks. While providing a lot of interesting insights on the relationship of structured\nmatrix factorization and training of neural networks, Hae\ufb00ele and Vidal admit themselves\nin their paper [8] that their results are \u201cchallenging to apply in practice\u201d. In the work of\nJanzamin et al. [11] they use a tensor approach and propose a globally optimal algorithm\nfor a feedforward neural network with one hidden layer and squared loss. However, their\napproach requires the computation of the score function tensor which uses the density of\nthe data-generating measure. However, the data generating measure is unknown and also\ndi\ufb03cult to estimate for high-dimensional feature spaces. Moreover, one has to check certain\nnon-degeneracy conditions of the tensor decomposition to get the global optimality guarantee.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn contrast our nonlinear spectral method just requires that the data is nonnegative which is\ntrue for all sorts of count data such as images, word frequencies etc. The condition which\nguarantees global optimality just depends on the parameters of the architecture of the network\nand boils down to the computation of the spectral radius of a small nonnegative matrix.\nThe condition can be checked without running the algorithm. Moreover, the nonlinear\nspectral method has a linear convergence rate and thus the globally optimal training of the\nnetwork is very fast. The two main changes compared to the standard setting are that we\nrequire nonnegativity on the weights of the network and we have to minimize a modi\ufb01ed\nobjective function which is the sum of loss and the negative total sum of the outputs. While\nthis model is non-standard, we show in some \ufb01rst experimental results that the resulting\nclassi\ufb01er is still expressive enough to create complex decision boundaries. As well, we achieve\ncompetitive performance on some UCI datasets. As the nonlinear spectral method requires\nsome non-standard techniques, we use the main part of the paper to develop the key steps\nnecessary for the proof. However, some proofs of the intermediate results are moved to the\nsupplementary material.\n\n2 Main result\n\nIn this section we present the algorithm together with the\nmain theorem providing the convergence guarantee. We limit\nthe presentation to one hidden layer networks to improve the\nreadability of the paper. Our approach can be generalized\nto feedforward networks of arbitrary depth. In particular, we\npresent in Section 4.1 results for two hidden layers.\nWe consider in this paper multi-class classi\ufb01cation where d is\nthe dimension of the feature space and K is the number of\nclasses. We use the negative cross-entropy loss de\ufb01ned for label\ny \u2208 [K] := {1, . . . , K} and classi\ufb01er f : Rd \u2192 RK as\n\n!\n\n= \u2212fy(x) + log(cid:16) KX\n\nefj(x)(cid:17)\n\n.\n\nPK\n\nefy(x)\nj=1 efj(x)\n\nj=1\n\n(cid:17)\u03b1l\n\nn1X\n\n(cid:16) dX\n\nFigure 1: Classi\ufb01cation decision\nboundaries in R2. (Best viewed in\ncolors.)\n\nL(cid:0)y, f(x)(cid:1) = \u2212 log\n\n \n\nThe function class we are using is a feedforward neural network with one hidden layer with\nn1 hidden units. As activation functions we use real powers of the form of a generalized\npolyomial, that is for \u03b1 \u2208 Rn1 with \u03b1i \u2265 1, i \u2208 [K], we de\ufb01ne:\n\nfr(x) = fr(w, u)(x) =\n\nwrl\n\nulmxm\n\n,\n\n(1)\n\nm=1\nl=1\n, u \u2208 Rn1\u00d7d\n\n+\n\nwhere R+ = {x \u2208 R| x \u2265 0} and w \u2208 RK\u00d7n1+\nare the parameters of the network\nwhich we optimize. The function class in (1) can be seen as a generalized polynomial in the\nsense that the powers do not have to be integers. Polynomial neural networks have been\nrecently analyzed in [15]. Please note that a ReLU activation function makes no sense in\nour setting as we require the data as well as the weights to be nonnegative. Even though\nnonnegativity of the weights is a strong constraint, one can model quite complex decision\nboundaries (see Figure 1, where we show the outcome of our method for a toy dataset in R2).\nIn order to simplify the notation we use w = (w1, . . . , wK) for the K output units wi \u2208 Rn1+ ,\ni = 1, . . . , K. All output units and the hidden layer are normalized. We optimize over the set\n\n(cid:12)(cid:12) kukpu = \u03c1u, kwikpw = \u03c1w, \u2200i = 1, . . . , K(cid:9).\n\nS+ =(cid:8)(w, u) \u2208 RK\u00d7n1+\n\nWe also introduce S++ where one replaces R+ with R++ = {t \u2208 R | t > 0}. The \ufb01nal\noptimization problem we are going to solve is given as\n\n\u00d7 Rn1\u00d7d\n\n+\n\nnX\n\nmax\n\n(w,u)\u2208S+\n\nh \u2212 L\n\n\u03a6(w, u) with\n\n(cid:16)\nyi, f(w, u)(xi)(cid:17) +\n\ni=1\n\n\u03a6(w, u) = 1\n\nn\n\nKX\n\nr=1\n\nfr(w, u)(xi)i + \u0001\n\n(cid:16) KX\n\nn1X\n\nr=1\n\nl=1\n\nn1X\n\ndX\n\nl=1\n\nm=1\n\nwr,l +\n\n2\n\n(2)\n\n(cid:17)\n\nulm\n\n,\n\n00.010.020.030.040.050.060.070.080.090.100.010.020.030.040.050.060.07\fwhere (xi, yi) \u2208 Rd+ \u00d7 [K], i = 1, . . . , n is the training data. Note that this is a maximization\nproblem and thus we use minus the loss in the objective so that we are e\ufb00ectively minimizing\nthe loss. The reason to write this as a maximization problem is that our nonlinear spectral\nmethod is inspired by the theory of (sub)-homogeneous nonlinear eigenproblems on convex\ncones [14] which has its origin in the Perron-Frobenius theory for nonnegative matrices.\nIn fact our work is motivated by the closely related Perron-Frobenius theory for multi-\nhomogeneous problems developed in [7]. This is also the reason why we have nonnegative\nweights, as we work on the positive orthant which is a convex cone. Note that \u0001 > 0 in the\nobjective can be chosen arbitrarily small and is added out of technical reasons.\nIn order to state our main theorem we need some additional notation. For p \u2208 (1,\u221e), we\nlet p0 = p/(p \u2212 1) be the H\u00f6lder conjugate of p, and \u03c8p(x) = sign(x)|x|p\u22121. We apply \u03c8p to\nscalars and vectors in which case the function is applied componentwise. For a square matrix\nA we denote its spectral radius by \u03c1(A). Finally, we write \u2207wi\u03a6(w, u) (resp. \u2207u\u03a6(w, u)) to\n(cid:19)\ndenote the gradient of \u03a6 with respect to wi (resp. u) at (w, u). The mapping\n\n(cid:18) \u03c1w\u03c8p0\n\n,\n(3)\nde\ufb01nes a sequence converging to the global optimum of (2). Indeed, we prove:\ni=1 \u2282 Rd+ \u00d7 [K], pw, pu \u2208 (1,\u221e), \u03c1w, \u03c1u > 0, n1 \u2208 N and \u03b1 \u2208\nTheorem 1. Let {xi, yi}n\nPn1\nRn1 with \u03b1i \u2265 1 for every i \u2208 [n1]. De\ufb01ne \u03c1x, \u03be1, \u03be2 > 0 as \u03c1x = maxi\u2208[n] kxik1, \u03be1 =\nl=1 \u03b1l(\u03c1u\u03c1x)\u03b1l and let A \u2208 R(K+1)\u00d7(K+1)\n\u03c1w\nAl,m = 4(p0\nAK+1,m = 2(p0\nIf the spectral radius \u03c1(A) of A satis\ufb01es \u03c1(A) < 1, then (2) has a unique global maximizer\n(w\u2217, u\u2217) \u2208 S++. Moreover, for every (w0, u0) \u2208 S++, there exists R > 0 such that\n\nu \u2212 1)(2\u03be1 + 1), AK+1,K+1 = 2(p0\n\nu \u2212 1)(2\u03be2 + k\u03b1k\u221e \u2212 1),\n\nw \u2212 1)(2\u03be2 + k\u03b1k\u221e),\n\nl=1(\u03c1u\u03c1x)\u03b1l, \u03be2 = \u03c1w\n\nAl,K+1 = 2(p0\n\n\u03c1u\u03c8p0\nk\u03c8p0\n\n(\u2207u\u03a6(w, u))\n(\u2207u\u03a6(w, u))kpu\n\nu\n\n\u03c1w\u03c8p0\nk\u03c8p0\n\n(\u2207wK\u03a6(w, u))\n(\u2207wK\u03a6(w, u))kpw\n\nw\n\n\u2200m, l \u2208 [K].\n\n(\u2207w1\u03a6(w, u))\n(\u2207w1\u03a6(w, u))kpw\n\nw\n\nk\u03c8p0\n\nw\n\nbe de\ufb01ned as\n\nw \u2212 1)\u03be1,\n\nG\u03a6(w, u) =\n\nPn1\n\n, . . . ,\n\nw\n\n,\n\nu\n\n++\n\nk\u2192\u221e(wk, uk) = (w\u2217, u\u2217)\nlim\n\nand\n\nk(wk, uk) \u2212 (w\u2217, u\u2217)k\u221e \u2264 R \u03c1(A)k\n\n\u2200k \u2208 N,\n\nwhere (wk+1, uk+1) = G\u03a6(wk, uk) for every k \u2208 N.\nNote that one can check for a given model (number of hidden units n1, choice of \u03b1, pw, pu,\n\u03c1u, \u03c1w) easily if the convergence guarantee to the global optimum holds by computing the\nspectral radius of a square matrix of size K + 1. As our bounds for the matrix A are very\nconservative, the \u201ce\ufb00ective\u201d spectral radius is typically much smaller, so that we have very\nfast convergence in only a few iterations, see Section 5 for a discussion. Up to our knowledge\nthis is the \ufb01rst practically feasible algorithm to achieve global optimality for a non-trivial\nneural network model. Additionally, compared to stochastic gradient descent, there is no\nfree parameter in the algorithm. Thus no careful tuning of the learning rate is required. The\nreader might wonder why we add the second term in the objective, where we sum over all\noutputs. The reason is that we need that the gradient of G\u03a6 is strictly positive in S+, this is\nwhy we also have to add the third term for arbitrarily small \u0001 > 0. In Section 5 we show\nthat this model achieves competitive results on a few UCI datasets.\n\nChoice of \u03b1:\nIt turns out that in order to get a non-trivial classi\ufb01er one has to choose\n\u03b11, . . . , \u03b1n1 \u2265 1 so that \u03b1i 6= \u03b1j for every i, j \u2208 [n1] with i 6= j. The reason for this\nlies in certain invariance properties of the network. Suppose that we use a permutation\ninvariant componentwise activation function \u03c3, that is \u03c3(P x) = P \u03c3(x) for any permutation\nmatrix P and suppose that A, B are globally optimal weight matrices for a one hidden layer\narchitecture, then for any permutation matrix P,\n\nA\u03c3(Bx) = AP T P \u03c3(Bx) = AP T \u03c3(P Bx),\n\nwhich implies that A0 = AP T and B0 = P B yield the same function and thus are also\nglobally optimal. In our setting we know that the global optimum is unique and thus it has\nto hold that, A = AP T and B = P B for all permutation matrices P. This implies that\nboth A and B have rank one and thus lead to trivial classi\ufb01ers. This is the reason why one\nhas to use di\ufb00erent \u03b1 for every unit.\n\n3\n\n\f+\n\nLet Q, \u02dcQ \u2208 Rm\u00d7m\n\nDependence of \u03c1(A) on the model parameters:\nand assume\n0 \u2264 Qi,j \u2264 \u02dcQi,j for every i, j \u2208 [m], then \u03c1(Q) \u2264 \u03c1( \u02dcQ), see Corollary 3.30 [3]. It follows\nthat \u03c1(A) in Theorem 1 is increasing w.r.t. \u03c1u, \u03c1w, \u03c1x and the number of hidden units\nn1. Moreover, \u03c1(A) is decreasing w.r.t. pu, pw and in particular, we note that for any\n\ufb01xed architecture (n1, \u03b1, \u03c1u, \u03c1w) it is always possible to \ufb01nd pu, pw large enough so that\n\u03c1(A) < 1. Indeed, we know from the Collatz-Wielandt formula (Theorem 8.1.26 in [10]) that\n\u03c1(A) = \u03c1(AT ) \u2264 maxi\u2208[K+1](AT v)i/vi for any v \u2208 RK+1\n++ . We use this to derive lower bounds\non pu, pw that ensure \u03c1(A) < 1. Let v = (pw \u2212 1, . . . , pw \u2212 1, pu \u2212 1), then (AT v)i < vi for\nevery i \u2208 [K + 1] guarantees \u03c1(A) < 1 and is equivalent to\n\npw > 4(K + 1)\u03be1 + 3\n\n(4)\nwhere \u03be1, \u03be2 are de\ufb01ned as in Theorem 1. However, we think that our current bounds are\nsub-optimal so that this choice is quite conservative. Finally, we note that the constant R in\nTheorem 1 can be explicitly computed when running the algorithm (see Theorem 3).\n\npu > 2(K + 1)(k\u03b1k\u221e + 2\u03be2) \u2212 1,\n\nand\n\nProof Strategy:\nalgorithm. For that we need some further notation. We introduce the sets\n\nThe following main part of the paper is devoted to the proof of the\n\nV+ = RK\u00d7n1+\n\nB+ =(cid:8)(w, u) \u2208 V+\n\n\u00d7 Rn1\u00d7d\n\n(cid:12)(cid:12) kukpu \u2264 \u03c1u, kwikpw \u2264 \u03c1w, \u2200i = 1, . . . , K},\n\nV++ = RK\u00d7n1++ \u00d7 Rn1\u00d7d\n++\n\n+\n\n,\n\nand similarly we de\ufb01ne B++ replacing V+ by V++ in the de\ufb01nition. The high-level idea of\nthe proof is that we \ufb01rst show that the global maximum of our optimization problem in (2)\nis attained in the \u201cinterior\u201d of S+, that is S++. Moreover, we prove that any critical point of\n(2) in S++ is a \ufb01xed point of the mapping G\u03a6. Then we proceed to show that there exists a\nunique \ufb01xed point of G\u03a6 in S++ and thus there is a unique critical point of (2) in S++. As\nthe global maximizer of (2) exists and is attained in the interior, this \ufb01xed point has to be\nthe global maximizer.\nFinally, the proof of the fact that G\u03a6 has a unique \ufb01xed point follows by noting that G\u03a6\nmaps B++ into B++ and the fact that B++ is a complete metric space with respect to the\nThompson metric. We provide a characterization of the Lipschitz constant of G\u03a6 and in turn\nderive conditions under which G\u03a6 is a contraction. Finally, the application of the Banach\n\ufb01xed point theorem yields the uniqueness of the \ufb01xed point of G\u03a6 and the linear convergence\nrate to the global optimum of (2). In Section 4 we show the application of the established\nframework for our neural networks.\n\n3 From the optimization problem to \ufb01xed point theory\nLemma 1. Let \u03a6 : V \u2192 R be di\ufb00erentiable. If \u2207\u03a6(w, u) \u2208 V++ for every (w, u) \u2208 S+, then\nthe global maximum of \u03a6 on S+ is attained in S++.\nWe now identify critical points of the objective \u03a6 in S++ with \ufb01xed points of G\u03a6 in S++.\nLemma 2. Let \u03a6 : V \u2192 R be di\ufb00erentiable. If \u2207\u03a6(w, u) \u2208 V++ for all (w, u) \u2208 S++, then\n(w\u2217, u\u2217) is a critical point of \u03a6 in S++ if and only if it is a \ufb01xed point of G\u03a6.\nOur goal is to apply the Banach \ufb01xed point theorem to G\u03a6 : B++ \u2192 S++ \u2282 B++. We recall\nthis theorem for the convenience of the reader.\nTheorem 2 (Banach \ufb01xed point theorem e.g. [12]). Let (X, d) be a complete metric space\nwith a mapping T : X \u2192 X such that d(T(x), T(y)) \u2264 q d(x, y) for q \u2208 [0, 1) and all x, y \u2208 X.\nThen T has a unique \ufb01xed-point x\u2217 in X, that is T(x\u2217) = x\u2217 and the sequence de\ufb01ned as\nxn+1 = T(xn) with x0 \u2208 X converges limn\u2192\u221e xn = x\u2217 with linear convergence rate\n\nd(xn, x\u2217) \u2264 qn\n1 \u2212 q\n\nd(x1, x0).\n\nSo, we need to endow B++ with a metric \u00b5 so that (B++, \u00b5) is a complete metric space. A\npopular metric for the study of nonlinear eigenvalue problems on the positive orthant is the\nso-called Thompson metric d: Rm++ \u00d7 Rm++ \u2192 R+ [18] de\ufb01ned as\n\nln(z) =(cid:0) ln(z1), . . . , ln(zm)(cid:1).\n\nd(z, \u02dcz) = k ln(z) \u2212 ln(\u02dcz)k\u221e\n\nwhere\n\n4\n\n\fUsing the known facts that (Rn++, d) is a complete metric space and its topology coincides\nwith the norm topology (see e.g. Corollary 2.5.6 and Proposition 2.5.2 [14]), we prove:\nLemma 3. For p \u2208 (1,\u221e) and \u03c1 > 0, ({z \u2208 Rn++ | kzkp \u2264 \u03c1}, d) is a complete metric space.\nNow, the idea is to see B++ as a product of such metric spaces. For i = 1, . . . , K, let\nBi++ = {wi \u2208 Rn1++ | kwikpw \u2264 \u03c1w} and di(wi, \u02dcwi) = \u03b3ik ln(wi) \u2212 ln( \u02dcwi)k\u221e for some\n++ | kukpu \u2264 \u03c1u} and dK+1(u, \u02dcu) =\nconstant \u03b3i > 0. Furthermore, let BK+1\n\u03b3K+1k ln(u) \u2212 ln(\u02dcu)k\u221e. Then (Bi++, di) is a complete metric space for every i \u2208 [K + 1] and\n++ \u00d7 . . . \u00d7 BK++ \u00d7 BK+1\nB++ = B1\n++ . It follows that (B++, \u00b5) is a complete metric space with\n\u00b5: B++ \u00d7 B++ \u2192 R+ de\ufb01ned as\nKX\n\n++ = {u \u2208 Rn1\u00d7d\n\n\u03b3ik ln(wi) \u2212 ln( \u02dcwi)k\u221e + \u03b3K+1k ln(u) \u2212 ln(\u02dcu)k\u221e.\n\n\u00b5(cid:0)(w, u), ( \u02dcw, \u02dcu)(cid:1) =\n\ni=1\n\nThe motivation for introducing the weights \u03b31, . . . , \u03b3K+1 > 0 is given by the next theorem.\nWe provide a characterization of the Lipschitz constant of a mapping F : B++ \u2192 B++ with\nrespect to \u00b5. Moreover, this Lipschitz constant can be minimized by a smart choice of \u03b3.\nFor i \u2208 [K], a, j \u2208 [n1], b \u2208 [d], we write Fwi,j and Fuab to denote the components of F such\nthat F = (Fw1,1, . . . , Fw1,n1 , Fw2,1, . . . , FwK,n1 , Fu11 , . . . , Fun1 d).\nLemma 4. Suppose that F \u2208 C1(B++, V++) and A \u2208 R(K+1)\u00d7(K+1)\n\nsatis\ufb01es\n\n(cid:10)|\u2207uFwi,j(w, u)|, u(cid:11) \u2264 Ai,K+1 Fwi,j(w, u)\n\n(cid:10)|\u2207wk Fwi,j(w, u)|, wk\n\n(cid:11) \u2264 Ai,k Fwi,j(w, u),\n\n+\n\nand\nh|\u2207wk Fuab(w, u)|, wki \u2264 AK+1,k Fuab(w, u),\nh|\u2207uFuab(w, u)|, ui \u2264 AK+1,K+1 Fuab(w, u)\nfor all i, k \u2208 [K], a, j \u2208 [n1], b \u2208 [d] and (w, u) \u2208 B++. Then, for every (w, u), ( \u02dcw, \u02dcu) \u2208 B++\nit holds\n\n\u00b5(cid:0)F(w, u), F( \u02dcw, \u02dcu)(cid:1) \u2264 U \u00b5(cid:0)(w, u), ( \u02dcw, \u02dcu)(cid:1)\n\nwith\n\nU = max\nk\u2208[K+1]\n\n(AT \u03b3)k\n\n\u03b3k\n\n.\n\nNote that, from the Collatz-Wielandt ratio for nonnegative matrices, we know that the\nconstant U in Lemma 4 is lower bounded by the spectral radius \u03c1(A) of A. Indeed, by\nTheorem 8.1.31 in [10], we know that if AT has a positive eigenvector \u03b3 \u2208 RK+1\n\n++ , then\n\nmax\ni\u2208[K+1]\n\n(AT \u03b3)i\n\n\u03b3i\n\n= \u03c1(A) = min\n\u02dc\u03b3\u2208RK+1\n++\n\nmax\ni\u2208[K+1]\n\n(AT \u02dc\u03b3)i\n\n\u02dc\u03b3i\n\n.\n\n(5)\n\nTherefore, in order to obtain the minimal Lipschitz constant U in Lemma 4, we choose the\nweights of the metric \u00b5 to be the components of \u03b3. A combination of Theorem 2, Lemma 4\nand this observation implies the following result.\nTheorem 3. Let \u03a6 \u2208 C1(V, R) \u2229 C2(B++, R) with \u2207\u03a6(S+) \u2282 V++. Let G\u03a6 : B++ \u2192 B++\nbe de\ufb01ned as in (3). Suppose that there exists a matrix A \u2208 R(K+1)\u00d7(K+1)\nsuch that G\u03a6 and\nA satis\ufb01es the assumptions of Lemma 4 and AT has a positive eigenvector \u03b3 \u2208 RK+1\n++ . If\n\u03c1(A) < 1, then \u03a6 has a unique critical point (w\u2217, u\u2217) in S++ which is the global maximum of\nk de\ufb01ned for any (w0, u0) \u2208\nS++ as (wk+1, uk+1) = G\u03a6(wk, uk), k \u2208 N, satis\ufb01es limk\u2192\u221e(wk, uk) = (w\u2217, u\u2217) and\n\nthe optimization problem (2). Moreover, the sequence(cid:0)(wk, uk)(cid:1)\n\u00b5(cid:0)(w1, u1), (w0, u0)(cid:1)\n(cid:0)1 \u2212 \u03c1(A)(cid:1) min(cid:8) \u03b3K+1\n\nk(wk, uk) \u2212 (w\u2217, u\u2217)k\u221e \u2264 \u03c1(A)k\n\n(cid:9)(cid:19)\n\n\u2200k \u2208 N,\n\n, mint\u2208[K] \u03b3t\n\n(cid:18)\n\n+\n\n\u03c1u\n\n\u03c1w\n\nwhere the weights in the de\ufb01nition of \u00b5 are the entries of \u03b3.\n\n4 Application to Neural Networks\nIn the previous sections we have outlined the proof of our main result for a general objective\nfunction satisfying certain properties. The purpose of this section is to prove that the\nproperties hold for our optimization problem for neural networks.\n\n5\n\n\fWe recall our objective function from (2)\n\u03a6(w, u) = 1\n\nh \u2212 L(cid:0)yi, f(w, u)(xi)(cid:1) +\n\nnX\n\nn\n\ni=1\n\nKX\n\nr=1\n\nand the function class we are considering from (1)\n\nfr(x) = fr(w, u)(x) =\n\n(cid:17)\n\nulm\n\nn1X\n\ndX\n\nl=1\n\nm=1\n\nwr,l +\n\nfr(w, u)(xi)i + \u0001\n(cid:16) dX\nn1X\n\n(cid:16) KX\nn1X\n(cid:17)\u03b1l\n\nr=1\n\nl=1\n\nulmxm\n\n,\n\nwr,l\n\nl=1\n\nm=1\n\nThe arbitrarily small \u0001 in the objective is needed to make the gradient strictly positive on\nthe boundary of V+. We note that the assumption \u03b1i \u2265 1 for every i \u2208 [n1] is crucial in the\nfollowing lemma in order to guarantee that \u2207\u03a6 is well de\ufb01ned on S+.\nLemma 5. Let \u03a6 be de\ufb01ned as in (2), then \u2207\u03a6(w, u) is strictly positive for any (w, u) \u2208 S+.\nNext, we derive the matrix A \u2208 R(K+1)\u00d7(K+1) in order to apply Theorem 3 to G\u03a6 with\n\u03a6 de\ufb01ned in (2). As discussed in its proof, the matrix A given in the following theorem\nhas a smaller spectral radius than that of Theorem 1. To express this matrix, we consider\np,q : Rn1++ \u00d7 R++ \u2192 R++ de\ufb01ned for p, q \u2208 (1,\u221e) and \u03b1 \u2208 Rn1++ as\n\u03a8\u03b1\n\ni1\u2212 \u03b1p\n\n(cid:19)1/p\n\n\u03a8\u03b1\np,q(\u03b4, t) =\n\n(\u03b4l t\u03b1l) p q\nq\u2212\u03b1p\n\nq + max\nj\u2208J c\n\n(\u03b4j t\u03b1j)p\n\n,\n\n(6)\n\n(cid:18)hX\n\nl\u2208J\n\nwhere J = {l \u2208 [n1] | \u03b1lp \u2264 q}, J c = {l \u2208 [n1] | \u03b1lp > q} and \u03b1 = minl\u2208J \u03b1l.\nTheorem 4. Let \u03a6 be de\ufb01ned as above and G\u03a6 be as in (3). Set Cw = \u03c1w \u03a8\u03b1\np0\nw,pu\nCu = \u03c1w \u03a8\u03b1\np0\nw,pu\nof Lemma 4 with\n\n(\u03b1, \u03c1u\u03c1x) and \u03c1x = maxi\u2208[n] kxikp0\n\n(1, \u03c1u\u03c1x),\n. Then A and G\u03a6 satisfy all assumptions\n\nu\n\nA = 2 diag(cid:0)p0\n\nw \u2212 1, p0\nw \u2212 1, . . . , p0\n++ , Qu,w \u2208 R1\u00d7K\n\nu \u2212 1(cid:1)(cid:18)Qw,w Qw,u\n\nQu,w Qu,u\n\n(cid:19)\n\nwhere Qw,w \u2208 RK\u00d7K\n\n++ , Qw,u \u2208 RK\u00d71\n\nQw,w = 2Cw11T ,\nQu,w = (2Cw + 1)1T , Qu,u = (2Cu + k\u03b1k\u221e \u2212 1).\n\n++ and Qu,u \u2208 R++ are de\ufb01ned as\nQw,u = (2Cu + k\u03b1k\u221e)1,\n\nIn the supplementary material, we prove that \u03a8\u03b1\nl=1 \u03b4lt\u03b1l which yields the weaker\nbounds \u03be1, \u03be2 given in Theorem 1. In particular, this observation combined with Theorems 3\nand 4 implies Theorem 1.\n\n4.1 Neural networks with two hidden layers\nWe show how to extend our framework for neural networks with 2 hidden layers. In future\nwork we will consider the general case. We brie\ufb02y explain the major changes. Let n1, n2 \u2208 N\nand \u03b1 \u2208 Rn1++, \u03b2 \u2208 Rn2++ with \u03b1i, \u03b2j \u2265 1 for all i \u2208 [n1], j \u2208 [n2], our function class is:\n\np,q(\u03b4, t) \u2264Pn1\n\nfr(x) = fr(w, v, u)(x) =\n\nand the optimization problem becomes\n\nn2X\n\n(cid:16) n1X\n\n(cid:16) dX\n\nwr,l\n\nvlm\n\numsxs\n\nl=1\n\nm=1\n\ns=1\n\n(cid:17)\u03b1m(cid:17)\u03b2l\n\nmax\n\n(w,v,u)\u2208S+\n\n\u03a6(w, v, u)\n\n+\nS+ = {(w1, . . . , wK, v, u) \u2208 V+ | kwikpw = \u03c1w, kvkpv = \u03c1v, kukpu = \u03c1u} and\n\u03a6(w, v, u) = 1\n\nh\u2212L(cid:0)yi, f(xi)(cid:1)+\n\nfr(xi)i+\u0001\n\n(cid:16) KX\n\nV+ = RK\u00d7n2+\nn2X\n\n\u00d7 Rn2\u00d7n1\n\nnX\n\nKX\n\nn2X\n\nn1X\n\nwr,l+\n\nwhere\n\n+\n\nvlm+\n\n\u00d7 Rn1\u00d7d\n\n,\n\n(7)\n\nn1X\n\ndX\n\n(cid:17)\n\n.\n\nums\n\nn\n\ni=1\n\nr=1\n\nr=1\n\nl=1\n\nl=1\n\nm=1\n\nm=1\n\ns=1\n\n6\n\n\fThe map G\u03a6 : S++ \u2192 S++ = {z \u2208 S+ | z > 0}, G\u03a6 = (G\u03a6\n(\u2207wi\u03a6(w, u))\n\nG\u03a6\nwi\n\n(w, v, u) = \u03c1w\n\nw\n\n\u03c8p0\n(\u2207wi\u03a6(w, v, u))kpw\n\nk\u03c8p0\n\nw\n\nw1, . . . , G\u03a6\n\nwK\n\n, G\u03a6\n\nv , G\u03a6\n\nu ), becomes\n\n\u2200i \u2208 [K]\n\n(8)\n\nand\n\n(\u2207v\u03a6(w, v, u))\n(\u2207v\u03a6(w, v, u))kpv\n\nv\n\n,\n\n(\u2207u\u03a6(w, v, u))\n(\u2207u\u03a6(w, v, u))kpu\n\nu\n\n.\n\nu\n\nas\n\n,\n\nu\n\n\u03c8p0\nk\u03c8p0\n\nv\n\nG\u03a6\nv (w, v, u) = \u03c1v\n\nG\u03a6\nu (w, v, u) = \u03c1u\n\n(1, \u03b8), Cv = \u03c1w\u03a8\u03b2\np0\nw,pv\n\n(\u03b2, \u03b8), Cu = k\u03b1k\u221eCv,\n\n(1, \u03c1u\u03c1x), Cw = \u03c1w\u03a8\u03b2\np0\nw,pv\n\nAm,K+1 = 2(p0\nAK+1,l = 2(p0\nAK+1,K+2 = 2(p0\nAK+2,K+1 = 2(p0\n\nu \u2212 1)(2Cw + 1),\nu \u2212 1)(2Cu + k\u03b1k\u221ek\u03b2k\u221e \u2212 1)\n\nw \u2212 1)(2Cv + k\u03b2k\u221e)\n\nv \u2212 1)(cid:0)2Cw + 1(cid:1)\nv \u2212 1)(cid:0)2Cu + k\u03b1k\u221ek\u03b2k\u221e(cid:1)\n\ni=1 \u2282 Rd+ \u00d7 [K], pw, pv, pu \u2208 (1,\u221e), \u03c1w, \u03c1v, \u03c1u > 0, n1, n2 \u2208 N and\n\nw \u2212 1)(cid:0)2Cu + k\u03b1k\u221ek\u03b2k\u221e(cid:1),\nv \u2212 1)(cid:0)2Cv + k\u03b2k\u221e \u2212 1(cid:1),\n\n\u03c8p0\nk\u03c8p0\nWe have the following equivalent of Theorem 1 for 2 hidden layers.\nTheorem 5. Let {xi, yi}n\n\u03b1 \u2208 Rn1++, \u03b2 \u2208 Rn2++ with \u03b1i, \u03b2j \u2265 1 for all i \u2208 [n1], j \u2208 [n2]. Let \u03c1x = maxi\u2208[n] kxikp0\n\u03b8 = \u03c1v\u03a8\u03b1\np0\nv,pu\nand de\ufb01ne A \u2208 R(K+2)\u00d7(K+2)\n++\nw \u2212 1)Cw,\nAm,l = 4(p0\nAm,K+2 = 2(p0\nAK+1,K+1 = 2(p0\nAK+2,l = 2(p0\nAK+2,K+2 = 2(p0\nIf \u03c1(A) < 1, then (7) has a unique global maximizer (w\u2217, v\u2217, u\u2217) \u2208 S++. Moreover, for every\n(w0, v0, u0) \u2208 S++, there exists R > 0 such that\nk\u2192\u221e(wk, vk, uk) = (w\u2217, v\u2217, u\u2217)\nlim\nwhere (wk+1, vk+1, uk+1) = G\u03a6(wk, vk, uk) for every k \u2208 N and G\u03a6 is de\ufb01ned as in (8).\nAs for the case with one hidden layer, for any \ufb01xed architecture \u03c1w, \u03c1v, \u03c1u > 0, n1, n2 \u2208 N\nand \u03b1 \u2208 Rn1++, \u03b2 \u2208 Rn2++ with \u03b1i, \u03b2j \u2265 1 for all i \u2208 [n1], j \u2208 [n2], it is possible to derive lower\nbounds on pw, pv, pu that guarantee \u03c1(A) < 1 in Theorem 5. Indeed, it holds\n\ni\u03b2j\npw > 4(K+2)\u03b61+5, pv > 2(K+2)(cid:2)2\u03b62+k\u03b2k\u221e(cid:3)\u22121, pu > 2(K+2)k\u03b1k\u221e(2\u03b62+k\u03b2k\u221e)\u22121. (9)\n\nwith \u02dc\u03c1x = maxi\u2208[n] kxik1. Hence, the two hidden layers equivalent of (4) becomes\n\nk(wk, vk, uk)\u2212(w\u2217, v\u2217, u\u2217)k\u221e \u2264 R \u03c1(A)k \u2200k \u2208 N\n\nu \u2212 1)(2Cv + k\u03b2k\u221e),\n\nand Cv \u2264 \u03b62 = \u03c1w\n\nCw \u2264 \u03b61 = \u03c1w\n\n\u2200m, l \u2208 [K].\n\nn2X\n\nh\n\nn2X\n\nj=1\n\nn1X\n\nl=1\n\nn1X\n\nl=1\n\n(\u03c1u \u02dc\u03c1x)\u03b1l\n\n(\u03c1u \u02dc\u03c1x)\u03b1l\n\ni\u03b2j\n\n\u03c1v\n\nj=1\n\nh\n\n\u03b2j\n\n\u03c1v\n\nand\n\n,\n\n5 Experiments\n\nTable 1: Test accuracy on UCI datasets\n\nNLSM1 NLSM2 ReLU1 ReLU2 SVM\nDataset\n95.7\n96.4\nCancer\n100\nIris\n90.0\n100\nBanknote 97.1\n77.3\nBlood\n76.0\n72.1\nHaberman 75.4\nSeeds\n88.1\n95.2\n79.9\n79.2\nPima\n\n95.7\n100\n100\n76.0\n70.5\n90.5\n76.6\n\n93.6\n93.3\n97.8\n76.0\n72.1\n92.9\n79.2\n\n96.4\n96.7\n96.4\n76.7\n75.4\n90.5\n80.5\n\nthe optimal score p\u2217\nFigure 2: Training score (left) w.r.t.\nand test error (right) of NLSM1 and Batch-SGD with di\ufb00erent\nstep-sizes.\nThe shown experiments should be seen as a proof of concept. We do not have yet a\ngood understanding of how one should pick the parameters of our model to achieve good\nperformance. However, the other papers which have up to now discussed global optimality\nfor neural networks [11, 8] have not included any results on real datasets. Thus, up to our\n\n7\n\n100101102epochs10-1410-1210-1010-8(p\u2217\u2212f)/|p\u2217|NLSM1SGD 100SGD 10SGD 1SGD 0.1SGD 0.01100101102103epochs020406080100test errorNLSM1SGD 100SGD 10SGD 1SGD 0.1SGD 0.01\fNonlinear Spectral Method for 1 hidden layer\n\nInput: Model n1 \u2208 N, pw, pu \u2208 (1,\u221e), \u03c1w, \u03c1u > 0, \u03b11, . . . , \u03b1n1 \u2265 1, \u0001 > 0 so that the\nmatrix A of Theorem 1 satis\ufb01es \u03c1(A) < 1. Accuracy \u03c4 > 0 and (w0, u0) \u2208 S++.\n1 Let (w1, u1) = G\u03a6(w0, u0) and compute R as in Theorem 3\n2 Repeat\n3\n4\n\n(wk+1, uk+1) = G\u03a6(wk, uk)\nk \u2190 k + 1\n\n5 Until k \u2265 ln(cid:0)\u03c4 /R(cid:1)/ ln(cid:0)\u03c1(A)(cid:1)\n\nOutput: (wk, uk) ful\ufb01lls k(wk, uk) \u2212 (w\u2217, u\u2217)k\u221e < \u03c4.\nWith G\u03a6 de\ufb01ned as in (3). The method for two hidden layers is similar: consider G\u03a6\n\nas in (8) instead of (3) and assume that the model satis\ufb01es Theorem 5.\n\nknowledge, we show for the \ufb01rst time a globally optimal algorithm for neural networks that\nleads to non-trivial classi\ufb01cation results.\nWe test our methods on several low dimensional UCI datasets and denote our algorithms as\nNLSM1 (one hidden layer) and NLSM2 (two hidden layers). We choose the parameters of our\nmodel out of 100 randomly generated combinations of (n1, \u03b1, \u03c1w, \u03c1u) \u2208 [2, 20]\u00d7 [1, 4]\u00d7 (0, 1]2\n(respectively (n1, n2, \u03b1, \u03b2, \u03c1w, \u03c1v, \u03c1u) \u2208 [2, 10]2 \u00d7 [1, 4]2 \u00d7 (0, 1]2) and pick the best one based\non 5-fold cross-validation error. We use Equation (4) (resp. Equation (9)) to choose pu, pw\n(resp. pu, pv, pw) so that every generated model satis\ufb01es the conditions of Theorem 1 (resp.\nTheorem 5), i.e. \u03c1(A) < 1. Thus, global optimality is guaranteed in all our experiments.\nFor comparison, we use the nonlinear RBF-kernel SVM and implement two versions of the\nRecti\ufb01ed-Linear Unit network - one for one hidden layer networks (ReLU1) and one for\ntwo hidden layers networks (ReLU2). To train ReLU, we use a stochastic gradient descent\nmethod which minimizes the sum of logistic loss and L2 regularization term over weight\nmatrices to avoid over-\ufb01tting. All parameters of each method are jointly cross validated.\nMore precisely, for ReLU the number of hidden units takes values from 2 to 20, the step-sizes\nand regularizers are taken in {10\u22126, 10\u22125, . . . , 102} and {0, 10\u22124, 10\u22123, . . . , 104} respectively.\nFor SVM, the hyperparameter C and the kernel parameter \u03b3 of the radius basis function\nK(xi, xj) = exp(\u2212\u03b3kxi \u2212 xjk2) are taken from {2\u22125, 2\u22124 . . . , 220} and {2\u221215, 2\u221214 . . . , 23}\nrespectively. Note that ReLUs allow negative weights while our models do not. The results\npresented in Table 1 show that overall our nonlinear spectral methods achieve slightly worse\nperformance than kernel SVM while being competitive/slightly better than ReLU networks.\nNotably in case of Cancer, Haberman and Pima, NLSM2 outperforms all the other models.\nFor Iris and Banknote, we note that without any constraints ReLU1 can easily \ufb01nd an\narchitecture which achieves zero test error while this is di\ufb03cult for our models as we impose\nconstraints on the architecture in order to prove global optimality.\nWe compare our algorithms with Batch-SGD in order to optimize (2) with batch-size being\n5% of the training data while the step-size is \ufb01xed and selected between 10\u22122 and 102.\nAt each iteration of our spectral method and each epoch of Batch-SGD, we compute the\nobjective and test error of each method and show the results in Figure 2. One can see that\nour method is much faster than SGDs, and has a linear convergence rate. We noted in\nour experiments that as \u03b1 is large and our data lies between [0, 1], all units in the network\ntend to have small values that make the whole objective function relatively small. Thus, a\nrelatively large change in (w, u) might cause only small changes in the objective function\nbut performance may vary signi\ufb01cantly as the distance is large in the parameter space. In\nother words, a small change in the objective may have been caused by a large change in the\nparameter space, and thus, largely in\ufb02uences the performance - which explains the behavior\nof SGDs in Figure 2.\nThe magnitude of the entries of the matrix A in Theorems 1 and 5 grows with the number\nof hidden units and thus the spectral radius \u03c1(A) also increases with this number. As we\nexpect that the number of required hidden units grows with the dimension of the datasets\nwe have limited ourselves in the experiments to low-dimensional datasets. However, these\nbounds are likely not to be tight, so that there might be room for improvement in terms of\ndependency on the number of hidden units.\n\n8\n\n\fAcknowledgment\nThe authors acknowledge support by the ERC starting grant NOLEPRO 307793.\n\nReferences\n[1] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, New York, 1999.\n\n[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representa-\n\ntions. In ICML, 2014.\n\n[3] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. SIAM,\n\nPhiladelphia, 1994.\n\n[4] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, Mass., 1999.\n[5] A. Choromanska, M. Hena, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of\n\nmultilayer networks. In AISTATS, 2015.\n\n[6] A Daniely, R. Frostigy, and Y. Singer. Toward deeper understanding of neural networks: The\n\npower of initialization and a dual view on expressivity, 2016. arXiv:1602.05897v1.\n\n[7] A. Gautier, F. Tudisco, and M. Hein. The Perron-Frobenius Theorem for Multi-Homogeneous\n\nMaps. in preparation, 2016.\n\n[8] B. D. Hae\ufb00ele and Rene Vidal. Global optimality in tensor factorization, deep learning, and\n\nbeyond, 2015. arXiv:1506.07540v1.\n\n[9] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic\n\ngradient descent. In ICML, 2016.\n\n[10] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York,\n\nsecond edition, 2013.\n\n[11] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity:guaranteed\n\ntraining of neural networks using tensor methods, 2015. arXiv:1506.08473v3.\n\n[12] W. A. Kirk and M. A. Khamsi. An Introduction to Metric Spaces and Fixed Point Theory.\n\nJohn Wiley, New York, 2001.\n\n[13] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521, 2015.\n[14] B. Lemmens and R. D. Nussbaum. Nonlinear Perron-Frobenius theory. Cambridge University\n\nPress, New York, general edition, 2012.\n\n[15] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational e\ufb03ciency of training neural\n\nnetworks. In NIPS, pages 855\u2013863, 2014.\n\n[16] J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, 61:85\u2013117,\n\n2015.\n\n[17] J. Sima. Training a single sigmoidal neuron is hard. Neural Computation, 14:2709\u20132728, 2002.\n[18] A. C. Thompson. On certain contraction mappings in a partially ordered vector space.\n\nProceedings of the American Mathematical Society, 14:438\u2013443, 1963.\n\n9\n\n\f", "award": [], "sourceid": 927, "authors": [{"given_name": "Antoine", "family_name": "Gautier", "institution": "Saarland University"}, {"given_name": "Quynh", "family_name": "Nguyen", "institution": "Saarland University"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}]}