{"title": "Lipschitz regularity of deep neural networks: analysis and efficient estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 3835, "page_last": 3844, "abstract": "Deep neural networks are notorious for being sensitive to small well-chosen perturbations, and estimating the regularity of such architectures is of utmost importance for safe and robust practical applications. In this paper, we investigate one of the key characteristics to assess the regularity of such methods: the Lipschitz constant of deep learning architectures. First, we show that, even for two layer neural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improve previous estimation methods by providing AutoLip, the first generic algorithm for upper bounding the Lipschitz constant of any automatically differentiable function. We provide a power method algorithm working with automatic differentiation, allowing efficient computations even on large convolutions. Second, for sequential neural networks, we propose an improved algorithm named SeqLip that takes advantage of the linear computation graph to split the computation per pair of consecutive layers. Third we propose heuristics on SeqLip in order to tackle very large networks. Our experiments show that SeqLip can significantly improve on the existing upper bounds. Finally, we provide an implementation of AutoLip in the PyTorch environment that may be used to better estimate the robustness of a given neural network to small perturbations or regularize it using more precise Lipschitz estimations. These results also hint at the difficulty to estimate the Lipschitz constant of deep networks.", "full_text": "Lipschitz regularity of deep neural networks:\n\nanalysis and ef\ufb01cient estimation\n\nKevin Scaman\n\nHuawei Noah\u2019s Ark Lab\n\nkevin.scaman@huawei.com\n\nAladin Virmaux\n\nHuawei Noah\u2019s Ark Lab\n\naladin.virmaux@huawei.com\n\nAbstract\n\nDeep neural networks are notorious for being sensitive to small well-chosen per-\nturbations, and estimating the regularity of such architectures is of utmost impor-\ntance for safe and robust practical applications. In this paper, we investigate one\nof the key characteristics to assess the regularity of such methods: the Lipschitz\nconstant of deep learning architectures. First, we show that, even for two layer\nneural networks, the exact computation of this quantity is NP-hard and state-of-\nart methods may signi\ufb01cantly overestimate it. Then, we both extend and improve\nprevious estimation methods by providing AutoLip, the \ufb01rst generic algorithm for\nupper bounding the Lipschitz constant of any automatically differentiable func-\ntion. We provide a power method algorithm working with automatic differen-\ntiation, allowing ef\ufb01cient computations even on large convolutions. Second, for\nsequential neural networks, we propose an improved algorithm named SeqLip that\ntakes advantage of the linear computation graph to split the computation per pair\nof consecutive layers. Third we propose heuristics on SeqLip in order to tackle\nvery large networks. Our experiments show that SeqLip can signi\ufb01cantly improve\non the existing upper bounds. Finally, we provide an implementation of AutoLip\nin the PyTorch environment that may be used to better estimate the robustness of\na given neural network to small perturbations or regularize it using more precise\nLipschitz estimations.\n\n1\n\nIntroduction\n\nDeep neural networks made a striking entree in machine learning and quickly became state-of-the-\nart algorithms in many tasks such as computer vision [1, 2, 3, 4], speech recognition and generation\n[5, 6] or natural language processing [7, 8].\nHowever, deep neural networks are known for being very sensitive to their input, and adversarial\nexamples provide a good illustration of their lack of robustness [9, 10]. Indeed, a well-chosen small\nperturbation of the input image can mislead a neural network and signi\ufb01cantly decrease its classi-\n\ufb01cation accuracy. One metric to assess the robustness of neural networks to small perturbations is\nthe Lipschitz constant (see De\ufb01nition 1), which upper bounds the relationship between input per-\nturbation and output variation for a given distance. For generative models, the recent Wasserstein\nGAN [11] improved the training stability of GANs by reformulating the optimization problem as a\nminimization of the Wasserstein distance between the real and generated distributions [12]. How-\never, this method relies on an ef\ufb01cient way of constraining the Lipschitz constant of the critic, which\nwas only partially addressed in the original paper, and the object of several follow-up works [13, 14].\nRecently, Lipschitz continuity was used in order to improve the state-of-the-art in several deep learn-\ning topics: (1) for robust learning, avoiding adversarial attacks was achieved in [15] by constraining\nlocal Lipschitz constants in neural networks. (2) For generative models, using spectral normaliza-\ntion on each layer allowed [13] to successfully train a GAN on ILRSVRC2012 dataset. (3) In deep\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flearning theory, novel generalization bounds critically rely on the Lipschitz constant of the neural\nnetwork [16, 17, 18].\nTo the best of our knowledge, the \ufb01rst upper-bound on the Lipschitz constant of a neural network was\ndescribed in [9, Section 4.3], as the product of the spectral norms of linear layers (a special case of\nour generic algorithm, see Proposition 1). More recently, the Lipschitz constant of scatter networks\nwas analyzed in [19]. Unfortunately, this analysis does not extend to more general architectures.\nOur aim in this paper is to provide a rigorous and practice-oriented study on how Lipschitz constants\nof neural networks and automatically differentiable functions may be estimated. We \ufb01rst precisely\nde\ufb01ne the notion of Lipschitz constant of vector valued functions in Section 2, and then show in Sec-\ntion 3 that its estimation is, even for 2-layer Multi-Layer-Perceptrons (MLP), NP-hard. In Section 4,\nwe both extend and improve previous estimation methods by providing AutoLip, the \ufb01rst generic\nalgorithm for upper bounding the Lipschitz constant of any automatically differentiable function.\nMoreover, we show how the Lipschitz constant of most neural network layers may be computed ef\ufb01-\nciently using automatic differentiation algorithms [20] and libraries such as PyTorch [21]. Notably,\nwe extend the power method to convolution layers using automatic differentiation to speed-up the\ncomputations. In Section 6, we provide a theoretical analysis of AutoLip in the case of sequential\nneural networks, and show that the upper bound may lose a multiplicative factor per activation layer,\nwhich may signi\ufb01cantly downgrade the estimation quality of AutoLip and lead to a very large and\nunrealistic upper bound. In order to prevent this, we propose an improved algorithm called SeqLip\nin the case of sequential neural networks, and show in Section 7 that SeqLip may signi\ufb01cantly im-\nprove on AutoLip. Finally we discuss the different algorithms on the AlexNet [1] neural network\nfor computer vision using the proposed algorithms. 1\n\n2 Background and notations\nIn the following, we denote as \u27e8x; y\u27e9 and \u2225x\u22252 the scalar product and L2-norm of the Hilbert space\nRn, x (cid:1) y the coordinate-wise product of x and y, and f \u25e6 g the composition between the functions\nf : Rk ! Rm and g : Rn ! Rk. For any differentiable function f : Rn ! Rm and any point\nx 2 Rn, we will denote as Dx f 2 Rm(cid:2)n the differential operator of f at x, also called the Jacobian\nmatrix. Note that, in the case of real valued functions (i.e. m = 1), the gradient of f is the transpose\nof the differential operator: \u2207f (x) = (Dx f )\n\u22a4. Finally, diagn;m(x) 2 Rn(cid:2)m is the rectangular\nmatrix with x 2 Rminfn;mg along the diagonal and 0 outside of it. When unambiguous, we will use\nthe notation diag(x) instead of diagn;m(x). All proofs are available as supplemental material.\nDe\ufb01nition 1. A function f : Rn ! Rm is called Lipschitz continuous if there exists a constant L\nsuch that\n\n8x; y 2 Rn; \u2225f (x) (cid:0) f (y)\u22252 (cid:20) L\u2225x (cid:0) y\u22252:\n\nThe smallest L for which the previous inequality is true is called the Lipschitz constant of f and will\nbe denoted L(f ).\n\nFor locally Lipschitz functions (i.e. functions whose restriction to some neighborhood around any\npoint is Lipschitz), the Lipschitz constant may be computed using its differential operator.\nTheorem 1 (Rademacher [22, Theorem 3.1.6]). If f : Rn ! Rm is a locally Lipschitz continuous\nfunction, then f is differentiable almost everywhere. Moreover, if f is Lipschitz continuous, then\n\nL(f ) = sup\nx2Rn\n\n\u2225 Dx f\u22252\n\n(1)\n\nwhere \u2225M\u22252 = supfx : \u2225x\u2225=1g \u2225M x\u22252 is the operator norm of the matrix M 2 Rm(cid:2)n.\nIn particular, if f is real valued (i.e. m = 1), its Lipschitz constant is the maximum norm of its\n\u2225\u2207f (x)\u22252 on its domain set. Note that the supremum in Theorem 1 is a\ngradient L(f ) = supx\nslight abuse of notations, since the differential Dx f is de\ufb01ned almost everywhere in Rn, except for\na set of Lebesgue measure zero.\n\n1The code used in this paper is available at https://github.com/avirmaux/lipEstimation.\n\n2\n\n\f3 Exact Lipschitz computation is NP-hard\n\nIn this section, we show that the exact computation of the Lipschitz constant of neural networks\nis NP-hard, hence motivating the need for good approximation algorithms. More precisely, upper\nbounds are in this case more valuable as they ensure that the variation of the function, when subject\nto an input perturbation, remains small. A neural network is, in essence, a succession of linear\noperators and non-linear activation functions. The most simplistic model of neural network is the\nMulti-Layer-Perceptron (MLP) as de\ufb01ned below.\nDe\ufb01nition 2 (MLP). A K-layer Multi-Layer-Perceptron fM LP : Rn ! Rm is the function\nwhere Tk : x 7! Mkx + bk is an af\ufb01ne function and (cid:26)k : x 7! (gk(xi))i2J1;nkK is a non-linear\n\nfM LP (x) = TK \u25e6 (cid:26)K(cid:0)1 \u25e6 (cid:1)(cid:1)(cid:1) \u25e6 (cid:26)1 \u25e6 T1(x);\n\nactivation function.\n\nMany standard deep network architectures (e.g. CNNs) follow \u2013to some extent\u2013 the MLP structure.\nIt turns out that even for 2-layer MLPs, the computation of the Lipschitz constant is NP-hard.\nProblem 1 (LIP-CST). LIP-CST is the decision problem associated to the exact computation\nof the Lipschitz constant of a 2-layer MLP with ReLU activation layers.\n\nInput: Two matrices M1 2 Rl(cid:2)n and M2 2 Rm(cid:2)l, and a constant \u2113 (cid:21) 0.\nQuestion: Let f = M2 \u25e6 (cid:26)\u25e6 M1 where (cid:26)(x) = maxf0; xg is the ReLU activation function.\nIs the Lipschitz constant L(f ) (cid:20) \u2113 ?\n\nTheorem 2 shows that, even for extremely simple neural networks, exact Lipschitz computation is\nnot achievable in polynomial time (assuming that P \u0338= NP). The proof of Theorem 2 is available\nin the supplemental material.\nTheorem 2. Problem 1 is NP-hard.\n\nTheorem 2 relies on a reduction to the NP-hard problem of quadratic concave minimization on a\nhypercube by considering well-chosen matrices M1 and M2.\n\n4 AutoLip: a Lipschitz upper bound through automatic differentiation\n\nEf\ufb01cient implementations of backpropagation in modern deep learning libraries such as Py-\nTorch [21] or TensorFlow [23] rely on on the concept of automatic differentiation [24, 20]. Simply\nput, automatic differentiation is a principled approach to the computation of gradients and differen-\ntial operators of functions resulting from K successive operations.\nDe\ufb01nition 3. A function f : Rn ! Rm is computable in K operations if it is the result of K simple\nfunctions in the following way: 9((cid:18)1; :::; (cid:18)K) functions of the input x and (g1; : : : ; gK) where gl is\na function of ((cid:18)i)i(cid:20)l(cid:0)1 such that:\n\n8k 2J1; KK; (cid:18)k(x) = gk(x; (cid:18)1(x); : : : ; (cid:18)k(cid:0)1(x)) :\n\n(cid:18)0(x) = x;\n\n(cid:18)K(x) = f (x);\n\n(2)\n\n(cid:18)2 = !\n\n(cid:18)0 = x\n\n(cid:18)3 = sin((cid:18)0)\n\ng3\n\ng4\n\n(cid:18)4 = (cid:18)1 (cid:0) (cid:18)2(cid:18)3\n\ng6\n\n(cid:18)6 = j(cid:18)4j\n\ng7\n\n(cid:18)7 = (cid:18)5 + (cid:18)6\n\ng1\n\n(cid:18)1 = (cid:18)0=2\n\ng5\n\n(cid:18)5 = ln(1 + e(cid:18)1)\n\nFigure 1: Example of a computation graph for f!(x) = ln(1 + ex=2) + jx=2 (cid:0) ! sin(x)j.\n\n3\n\n\f1: Z = f(z0; :::; zK) : 8k 2J0; KK; (cid:18)k is constant ) zk = (cid:18)k(0)g\nLk k(cid:0)1\u2211\n\nAlgorithm 1 AutoLip\nInput: function f : Rn ! Rm and its computation graph (g1; :::; gK)\nOutput: upper bound on the Lipschitz constant: ^LAL (cid:21) L(f )\n2: L0 1\n3: for k = 1 to K do\n4:\nmax\nz2Z\n5: end for\n6: return ^LAL = Lk\n\n\u2225@igk(z)\u22252Li\n\ni=1\n\nWe assume that these operations are all locally Lipschitz-continuous, and that their partial deriva-\ntives @igk(x) can be computed and ef\ufb01ciently maximized. This assumption is discussed in Section 5\nfor the main operations used in neural networks. When the function is real valued (i.e. m = 1),\nthe backpropagation algorithm allows to compute its gradient ef\ufb01ciently in time proportional to the\nnumber of operations K [25]. For the computation of the Lipschitz constant L(f ), a forward prop-\nagation through the computation graph is suf\ufb01cient. More speci\ufb01cally, the chain rule immediately\nimplies\n\n@igk((cid:18)0(x); : : : ; (cid:18)k(cid:0)1(x)) Dx (cid:18)i ;\n\n(3)\n\nk(cid:0)1\u2211\n\nDx (cid:18)k =\n\ni=1\n\nand taking the norm then maximizing over all possible values of (cid:18)i(x) leads to the AutoLip algorithm\ndescribed in Alg. (1). This algorithm is an extension of the well known product of operator norms\nfor MLPs (see e.g. [13]) to any function computable in K operations.\nProposition 1. For any MLP (see De\ufb01nition 2) with 1-Lipschitz activation functions (e.g. ReLU,\nLeaky ReLU, SoftPlus, Tanh, Sigmoid, ArcTan or Softsign), the AutoLip upper bound becomes\n\nK\u220f\n\n^LAL =\n\n\u2225Mk\u22252:\n\nk=1\n\nNote that, when the intermediate function (cid:18)k does not depend on x, it is not necessary to take a\nmaximum over all possible values of (cid:18)k(x). To this end we de\ufb01ne the set of feasible intermediate\nvalues as\n\nZ = f(z0; :::; zK) : 8k 2J0; KK; (cid:18)k is constant ) zk = (cid:18)k(0)g;\n\n(4)\nand only maximize partial derivatives over this set.\nIn practice, this is equivalent to removing\nbranches of the computation graph that are not reachable from node 0 and replacing them by con-\nstant values. To illustrate this de\ufb01nition, consider a simple matrix product operation f (x) = W x.\nOne possible computation graph for f is (cid:18)0 = x, (cid:18)1 = W and (cid:18)2 = g2((cid:18)0; (cid:18)1) = (cid:18)1(cid:18)0.\nWhile the quadratic function g2 is not Lipschitz-continuous, its derivative w.r.t. (cid:18)0 is bounded by\n@0g2((cid:18)0; (cid:18)1) = (cid:18)1 = W . Since (cid:18)1 is constant relatively to x, we have Z = f(x; 0)g and the\nalgorithm returns the exact Lipschitz constant ^LAL = L(f ) = \u2225W\u22252.\nExample. We consider the graph explicited on Figure 1. Since (cid:18)2 is a constant w.r.t. x, we can\nreplace it by its value ! in all other nodes. Then, the AutoLip algorithm runs as follows:\n\n^LAL = L7 = L6 + L5 = L1 + L4 = 2L1 + wL3 = 1 + !:\n\n(5)\nNote that, in this example, the Lipschitz upper bound ^LAL matches the exact Lipschitz constant\nL(f!) = 1 + !.\n\n5 Lipschitz constants of typical neural network layers\nLinear and convolution layers. The Lipschitz constant of an af\ufb01ne function f : x 7! M x + b\nwhere M 2 Rm(cid:2)n and b 2 Rm is the largest singular value of its associated matrix M, which\nmay be computed ef\ufb01ciently, up to a given precision, using the power method [26]. In the case of\n\n4\n\n\fconvolutions, the associated matrix may be dif\ufb01cult to access and high dimensional, hence making\nthe direct use of the power method impractical. To circumvent this dif\ufb01culty, we extend the power\nmethod to any af\ufb01ne function on whose automatic differentiation can be used (e.g. linear or convo-\nlution layers of neural networks) by noting that the only matrix multiplication of the power method\nM\nLemma 1. Let M 2 Rm(cid:2)n, b 2 Rm and f : x 7! M x + b be an af\ufb01ne function. Then, for all\nx 2 Rn, we have\n\nM x can be computed by differentiating a well-chosen function.\n\n\u22a4\n\n\u22a4\n\nM x = \u2207g(x) ;\n\nM\n\nwhere g(x) = 1\n2\n\n\u2225f (x) (cid:0) f (0)\u22252\n2.\n\u2225M x\u22252\n\nProof. By de\ufb01nition, g(x) = 1\n2\n\n2, and differentiating this equation leads to the desired result.\n\nAlgorithm 2 AutoGrad compliant power method\nInput: af\ufb01ne function f : Rn ! Rm, number of iteration N\nOutput: approximation of the Lipschitz constant L(f )\n1: for k = 1 to N do\n2:\n3:\n4:\n5: end for\n6: return L(f ) = \u2225f (v) (cid:0) f (0)\u22252\n\nv \u2207g(v) where g(x) = 1\n(cid:21) \u2225v\u22252\nv v=(cid:21)\n\n\u2225f (x) (cid:0) f (0)\u22252\n\n2\n\n2\n\nThe full algorithm is described in Alg. (2). Note that this algorithm is fully compliant with any\ndynamic graph deep learning libraries such as PyTorch. The gradient of the square norm may be\ncomputed through autograd, and the gradient of L(f ) may be computed the same way without any\nmore programming effort. Note that the gradients w.r.t. M may also be computed with the closed\nform formula \u2207M (cid:27) = u1v\n\u22a4\n1 where u1 and v1 are respectively the left and right singular vector of\nM associated to the singular value (cid:27) [27]. The same algorithm may be straightforwardly iterated to\ncompute the k-largest singular values.\n\nOther layers. Most activation functions such as ReLU, Leaky ReLU, SoftPlus, Tanh, Sigmoid,\nArcTan or Softsign, as well as max-pooling, have a Lipschitz constant equal to 1. Other common\nneural network layers such as dropout, batch normalization and other pooling methods all have\nsimple and explicit Lipschitz constants. We refer the reader to e.g. [28] for more information on this\nsubject.\n\n6 Sequential neural networks\n\n\u2225MK diag(g\n\nL(fM LP ) = sup\nx2Rn\n\nDespite its generality, AutoLip may be subject to large errors due to the multiplication of smaller\nerrors at each iteration of the algorithm. In this section, we improve on the AutoLip upper bound by\na more re\ufb01ned analysis of deep learning architectures in the case of MLPs. More speci\ufb01cally, the\nLipschitz constant of MLPs have an explicit formula using Theorem 1 and the chain rule:\n1((cid:18)1))M1\u22252;\n\u2032\nwhere (cid:18)k = Tk \u25e6 (cid:26)k(cid:0)1 \u25e6 (cid:1)(cid:1)(cid:1) \u25e6 (cid:26)1 \u25e6 T1(x) is the intermediate output after k linear layers.\nConsidering Proposition 1 and Eq. (6), the equality ^LAL = L(fM LP ) only takes place if all acti-\n\u2032\nvation layers diag(g\nk((cid:18)k)) map the \ufb01rst singular vector of Mk to the \ufb01rst singular vector of Mk+1\nby Cauchy-Schwarz inequality. However, differential operators of activation layers, being diagonal\nmatrices, can only have a limited effect on input vectors, and in practice, \ufb01rst singular vectors will\ntend to misalign, leading to a drop in the Lipschitz constant of the MLP. This is the intuition behind\nSeqLip, an improved algorithm for Lipschitz constant estimation for MLPs.\n\n\u2032\nK(cid:0)1((cid:18)K(cid:0)1))Mk(cid:0)1:::M2 diag(g\n\n(6)\n\n5\n\n\f6.1 SeqLip, an improved algorithm for MLPs\n\n\u2032\nK(cid:0)1((cid:18)K(cid:0)1)) are dif\ufb01cult to evaluate, as they may depend\nIn Eq. (6), the diagonal matrices diag(g\non the input value x and previous layers. Fortunately, as stated in Section 5, most major activation\nk(x) 2\n\u2032\nfunctions are 1-Lipschitz. More speci\ufb01cally, these activation functions have a derivative g\n[0; 1]. Hence, we may replace the supremum on the input vector x by a supremum over all possible\nvalues:\n\nL(fM LP ) (cid:20)\n\nmax\n\n8i; (cid:27)i2[0;1]ni\n\n\u2225MK diag((cid:27)K(cid:0)1)(cid:1)(cid:1)(cid:1) diag((cid:27)1)M1\u22252 ;\n\n\u2211\nwhere (cid:27)i corresponds to all possible derivatives of the activation gate. Solving the right hand side\nof Eq. (7) is still a hard problem, and the high dimensionality of the search space (cid:27) 2 [0; 1]\nK\ni=1 ni\nmakes purely combinatorial approaches prohibitive even for small neural networks.\nIn order to\ndecrease the complexity of the problem, we split the operator norm in K (cid:0) 1 parts using the SVD\ndecomposition of each matrix Mi = Ui(cid:6)iV\nK(cid:0)1 diag((cid:27)K(cid:0)1) : : : diag((cid:27)1)U1(cid:6)1\u22252 ;\nL(fM LP ) (cid:20)\n\u22a4\n\n\u22a4\ni and the submultiplicativity of the operator norm:\n\n\u22a4\nK diag((cid:27)K)UK(cid:0)1(cid:6)K(cid:0)1V\n\n\u2225(cid:6)KV\n\nmax\n\n(7)\n\n8i; (cid:27)i2[0;1]ni\n\n(cid:20) K(cid:0)1\u220f\n\n(cid:13)(cid:13)(cid:13)e(cid:6)i+1V\nK(cid:0)1\u220f\n\n\u22a4\ni+1 diag((cid:27)i+1)Ui\n\ni=1\n\nmax\n\n(cid:27)i2[0;1]ni\n\nwhere e(cid:6)i = (cid:6)i if i 2 f1; Kg and e(cid:6)i = (cid:6)1=2\n(cid:13)(cid:13)(cid:13)e(cid:6)i+1V\n\nindependently, leading to the SeqLip upper bound:\n\n^LSL =\n\ni\n\n(cid:13)(cid:13)(cid:13)\n\ne(cid:6)i\n\n;\n\n2\n\n(cid:13)(cid:13)(cid:13)\n\ne(cid:6)i\n\notherwise. Each activation layer can now be solved\n\n\u22a4\ni+1 diag((cid:27)i+1)Ui\n\ni=1\n\nmax\n\n(cid:27)i2[0;1]ni\n\n(8)\nWhen the activation layers are ReLU and the inner layers are small (ni (cid:20) 20), the gradients are g\n2\nf0; 1g and we may explore the entire search space (cid:27)i 2 f0; 1gni using a brute force combinatorial\napproach. Otherwise, a gradient ascent may be used by computing gradients via the power method\nIn our experiments, we call this heuristic Greedy SeqLip, and veri\ufb01ed that\ndescribed in Alg. 2.\nthe incurred error is at most 1% whenever the exact optimum is computable. Finally, when the\ndimension of the layer is too large to compute a whole SVD, we perform a low rank-approximation\nof the matrix Mi by retaining the \ufb01rst E eigenvectors (E = 200 in our experiments).\n\n\u2032\nk\n\n2\n\n:\n\n6.2 Theoretical analysis of SeqLip\n\nIn order to better understand how SeqLip may improve on AutoLip, we now consider a simple setting\nin which all linear layers have a large difference between their \ufb01rst and second singular values. For\nk(x) 2 [0; 1], although the\n\u2032\nsimplicity, we also assume that activation functions have a derivative g\nfollowing results easily generalize as long as the derivative remains bounded. Then, the following\ntheorem holds.\nTheorem 3. Let Mk be the matrix associated to the k-th linear layer, uk (resp. vk) its \ufb01rst left (resp.\nright) singular vector, and rk = sk;2=sk;1 the ratio between its second and \ufb01rst singular values.\nThen, we have\n\n\u221a\n(1 (cid:0) rk (cid:0) rk+1) max\n(cid:27)2[0;1]nk\n\nK(cid:0)1\u220f\n\nk=1\n\n^LSL (cid:20) ^LAL\n\n\u27e8(cid:27) (cid:1) vk+1; uk\u27e92 + rk + rk+1 + rkrk+1 :\n\nNote that max(cid:27)2[0;1]nk\u27e8(cid:27) (cid:1) vk+1; uk\u27e92 (cid:20) 1 and, when the ratios rk are negligible, then\n\n^LSL (cid:20) ^LAL\n\nmax\n\n(cid:27)2[0;1]nk\n\nk=1\n\nj\u27e8(cid:27) (cid:1) vk+1; uk\u27e9j :\n\n(9)\n\nIntuitively, each activation layer may align uk to vk+1 only to a certain extent. Moreover, when the\ntwo singular vectors uk and vk+1 are not too similar, this quantity can be substantially smaller than\n1. To illustrate this idea, we now show that max(cid:27)2[0;1]nk j\u27e8(cid:27) (cid:1) vk+1; uk\u27e9j is of the order of 1=(cid:25) if the\ntwo vectors are randomly chosen on the unit sphere.\n\n6\n\nK(cid:0)1\u220f\n\n\fLemma 2. Let x (cid:21) 0 and u; v 2 Rn be two independent random vectors taken uniformly on the\nunit sphere Sn(cid:0)1 = fx 2 Rn : \u2225x\u22252 = 1g. Then we have\n\nj\u27e8(cid:27) (cid:1) u; v\u27e9j\n\nmax\n(cid:27)2[0;1]n\n\nn!+1\n\n1\n(cid:25)\n\nalmost surely.\n\nIntuitively, when the ratios between the second and \ufb01rst singular values are suf\ufb01ciently small, each\nactivation layer decreases the Lipschitz constant by a factor 1=(cid:25) and\n\n^LSL (cid:25) ^LAL\n(cid:25)K(cid:0)1 :\n\n(10)\nFor example, for K = 5 linear layers, we have (cid:25)K(cid:0)1 (cid:25) 100 and a large improvement may be\nexpected for SeqLip compared to AutoLip. Of course, in a more realistic setting, the eigenvectors\nof different layers are not independent and, more importantly, the ratio between second and \ufb01rst\neigenvalues may not be suf\ufb01ciently small. However, this simple setting provides us with the best\nimprovement one can hope for, and our experiments in Section 7 shows that at least part of the\nsuboptimality of AutoLip is due to the misalignment of eigenvectors.\n\n7 Experimentations\n\nAs stated in Theorem 2, computing the Lipschitz constant is an NP-hard problem. However, in\nlow dimension (e.g. d (cid:20) 5), optimizing the problem in Eq. (1) can be performed ef\ufb01ciently using\na simple grid search. This will provide a baseline to compare the different estimation algorithms.\nIn high dimension, grid search is intractable and we consider several other estimation methods: (1)\ngrid search for Eq. (1), (2) simulated annealing for Eq. (1), (3) product of Frobenius norms of linear\nlayers [13], (4) product of spectral norms [13] (equivalent to AutoLip in the case of MLPs). Note\nthat, for MLPs with ReLU activations, \ufb01rst order optimization methods such as SGD are not usable\nbecause the function to optimize in Eq. (1) is piecewise constant. Methods (1) and (2) return lower\nbounds while (3) and (4) return upper bounds on the Lipschitz constant.\n\nIdeal scenario. We \ufb01rst show the improvement of SeqLip over AutoLip in an ideal setting where\ninner layers have a low eigenvalue ratio rk and uncorrelated leading eigenvectors. To do so, we\nconstruct an MLP with weight matrices Mi = Ui diag((cid:21))V\nsuch that Ui; Vi are random orthogonal\nmatrices and (cid:21)1 = 1; (cid:21)i>1 = r where r 2 [0; 1] is the ratio between \ufb01rst and second eigenvalue.\nFigure 2 shows the decrease of SeqLip as the number of layers of the MLP increases (each layer has\n100 neurons). The theoretical limit is tight for small eigenvalue ratio. Note that the AutoLip upper\nbound is always 1 as, by construction of the network, all layers have a spectral radius equal to one.\n\n\u22a4\ni\n\nMLP. We construct a 2-dimensional dataset from a Gaussian Process with RBF Kernel with mean\n0 and variance 1. We use 15000 generated points as a synthetic dataset. An example of such a\ndataset may be seen in Figure 3. We train MLPs of several depths with 20 neurons at each layer, on\nthe synthetic dataset with MSE loss and ReLU activations. Note that in all simulations, the greedy\nSeqLip algorithm is within a 0:01% error compared to SeqLip, which justify its usage in higher\ndimension.\n\n# layers Frobenius AutoLip\n33:04\n134:4\n294:6\n19248:2\n\n648:2\n4283:1\n22341\n7343800\n\n4\n5\n7\n10\n\nUpper bounds\nSeqLip\n21:47\n72:87\n130:2\n2463:44\n\nGreedy SeqLip Dataset Annealing Grid Search\n\nLower bounds\n\n21:47\n72:87\n130:2\n2463:36\n\n4:36\n6:77\n5:4\n10:04\n\n4:55\n5:8\n5:27\n5:77\n\n6:56\n7:1\n6:51\n17:1\n\nFigure 4: AutoLip and SeqLip for MLPs of various size.\n\nFirst, since the dimension is low (d = 2), grid search returns a very good approximation of the\nLipschitz constant, while simulated annealing is suboptimal, probably due to the presence of local\n\n7\n\n\fFigure 2: SeqLip in the ideal scenario.\n\nFigure 3: Synthetic function used to train MLPs.\n\nmaxima. For upper bounds, SeqLip outperforms its competitors reducing the gap between upper\nbounds and, in this case, the true Lipschitz constant computed using grid search.\n\nCNN. We construct simple CNNs with increasing number of layers that we train independently\non the MNIST dataset [29].The details of the structure of the CNNs are given in the supplementary\nmaterial. SeqLip improves by a factor of 5 the upper bound given by AutoLip for the CNN with 10\nlayers. Note that the lower bounds obtained with simulated annealing is probably too low, as shown\nin the previous experiments.\n\n# layers AutoLip Greedy SeqLip Ratio Dataset Annealing\n\nUpper bounds\n\nLower bounds\n\n4\n5\n7\n10\n\n174\n790:1\n12141\n4:5 (cid:1) 106\n\n86\n335\n3629\n8:2 (cid:1) 105\n\n2\n2:4\n3:3\n5:4\n\n12:64\n16:79\n31:22\n38:26\n\n25:5\n22:2\n43:6\n107:8\n\nFigure 5: AutoLip and SeqLip for MNIST-CNNs of various size.\n\nAlexNet. AlexNet [1] is one of the \ufb01rst successes of deep learning in computer vision. The Au-\ntoLip algorithm \ufb01nds that the Lipschitz constant is upper bounded by 3:62 (cid:2) 107 which remains\nextremely large and probably well above the true Lipschitz constant. As for the experiment on a\nCNN, we use the 200 highest singular values of each linear layer for Greedy SeqLip. We obtain\n5:45 (cid:2) 106 as an upper bound approximation, which remains large despite its 6 fold improvement\nover AutoLip. Note that we do not get the same results as [9, Section 4.3] as we did not use the same\nweights.\n\n8 Conclusion\n\nIn this paper, we studied the Lispchitz regularity of neural networks. We \ufb01rst showed that exact com-\nputation of the Lipschitz constant is an NP-hard problem. We then provided a generic upper bound\ncalled AutoLip for the Lipschitz constant of any automatically differentiable function. In doing so,\nwe introduced an algorithm to compute singular values of af\ufb01ne operators such as convolution in\na very ef\ufb01cient way using autograd mechanism. We \ufb01nally proposed a re\ufb01nement of the previous\nmethod for MLPs called SeqLip and showed how this algorithm can improve on AutoLip theoreti-\ncally and in applications, sometimes improving up to a factor of 8 the AutoLip upper bound. While\nthe AutoLip and SeqLip upper bounds remain extremely large for neural networks of the computer\nvision literature (e.g. AlexNet, see Section 7), it is yet an open question to know if these values are\nclose to the true Lipschitz constant or substantially overestimating it.\n\n8\n\n12345678910Number of laye s10\u2212410\u2212310\u2212210\u22121100SeqLip uppe boundAutoLiptheo etical limiteigenvalue atio: 0.5eigenvalue atio: 0.1eigenvalue atio: 0.01\fAcknowledgements\n\nThe authors thank the whole team at Huawei Paris and in particular Igor Colin, Moez Draief, Sylvain\nRobbiano and Albert Thomas for useful discussions and feedback.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, pages\n1097\u20131105, 2012.\n\n[2] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 2818\u20132826, 2016.\n\n[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-\ntion (CVPR), pages 770\u2013778, 2016.\n\n[4] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 2261\u20132269, 2017.\n\n[5] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural\n\nnetworks. In International Conference on Machine Learning, pages 1764\u20131772, 2014.\n\n[6] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A genera-\ntive model for raw audio. In SSW, page 125. ISCA, 2016.\n\n[7] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space\nword representations. In Proceedings of the 2013 Conference of the North American Chapter\nof the Association for Computational Linguistics: Human Language Technologies, pages 746\u2013\n751, 2013.\n\n[8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0142ukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Advances in Neural Infor-\nmation Processing Systems, pages 6000\u20136010, 2017.\n\n[9] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-\nfellow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of the Inter-\nnational Conference on Learning Representations (ICLR), 2014.\n\n[10] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-\nsarial examples. In Proceedings of the International Conference on Learning Representations\n(ICLR), 2015.\n\n[11] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial net-\nIn Proceedings of the 34th International Conference on Machine Learning, ICML,\n\nworks.\npages 214\u2013223, 2017.\n\n[12] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[13] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-\ntion for generative adversarial networks. In Proceedings of the International Conference on\nLearning Representations (ICLR), 2018.\n\n[14] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of Wasserstein GANs. In Advances in Neural Information Processing Sys-\ntems, pages 5769\u20135779, 2017.\n\n9\n\n\f[15] Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh,\nand Luca Daniel. Evaluating the Robustness of Neural Networks: An Extreme Value The-\nory Approach. In Proceedings of the International Conference on Learning Representations\n(ICLR), 2018.\n\n[16] Ulrike von Luxburg and Olivier Bousquet. Distance\u2013based classi\ufb01cation with lipschitz func-\n\ntions. J. Mach. Learn. Res., 5:669\u2013695, December 2004.\n\n[17] Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems 30: Annual Con-\nference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach,\nCA, USA, pages 6241\u20136250, 2017.\n\n[18] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring gen-\neralization in deep learning. In Advances in Neural Information Processing Systems 30: An-\nnual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long\nBeach, CA, USA, pages 5949\u20135958, 2017.\n\n[19] R. Balan, M. K. Singh, and D. Zou. Lipschitz properties for deep convolutional networks. to\n\nappear in Contemporary Mathematics, 2018.\n\n[20] Louis B. Rall. Automatic Differentiation: Techniques and Applications, volume 120 of Lecture\n\nNotes in Computer Science. Springer, Berlin, 1981.\n\n[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. 2017.\n\n[22] Herbert Federer. Geometric measure theory. Classics in Mathematics. Springer-Verlag Berlin\n\nHeidelberg, 1969.\n\n[23] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel-\nlow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz\nKaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore,\nDerek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Ku-\nnal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals,\nPete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow:\nLarge-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from\ntensor\ufb02ow.org.\n\n[24] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of\n\nalgorithmic differentiation, volume 105. Siam, 2008.\n\n[25] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as\na Taylor expansion of the local rounding errors. Master\u2019s Thesis (in Finnish), Univ. Helsinki,\npages 6\u20137, 1970.\n\n[26] RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsau\ufb02\u00f6sung.\nZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f\u00fcr Angewandte Mathe-\nmatik und Mechanik, 9(1):58\u201377, 1929.\n\n[27] Jan R. Magnus. On Differentiating Eigenvalues and Eigenvectors. Econometric Theory,\n\n1(2):pp. 179\u2013191, 1985.\n\n[28] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\n[29] Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.\n\n10\n\n\f", "award": [], "sourceid": 1901, "authors": [{"given_name": "Aladin", "family_name": "Virmaux", "institution": "Huawei"}, {"given_name": "Kevin", "family_name": "Scaman", "institution": "Huawei Technologies, Noah's Ark"}]}