{"title": "Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15390, "page_last": 15402, "abstract": "Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. While heuristic approaches such as the gradient penalty have seen much practical success, it is challenging to achieve similar practical performance while provably enforcing a Lipschitz constraint. In principle, one can design Lipschitz constrained architectures using the composition property of Lipschitz functions, but Anil et al. recently identified a key obstacle to this approach: gradient norm attenuation. They showed how to circumvent this problem in the case of fully connected networks by designing each layer to be gradient norm preserving. We extend their approach to train scalable, expressive, provably Lipschitz convolutional networks. In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. We show that even though the space of orthogonal convolutions is disconnected, the largest connected component of BCOP with 2n channels can represent arbitrary BCOP convolutions over n channels. Our BCOP parameterization allows us to train large convolutional networks with provable Lipschitz bounds. Empirically, we find that it is competitive with existing approaches to provable adversarial robustness and Wasserstein distance estimation.", "full_text": "Preventing Gradient Attenuation in\n\nLipschitz Constrained Convolutional Networks\n\nQiyang Li\u2217, Saminul Haque\u2217, Cem Anil, James Lucas, Roger Grosse, J\u00f6rn-Henrik Jacobsen\n\n{jlucas, rgrosse}@cs.toronto.edu\n\nj.jacobsen@vectorinstitute.ai\n\nUniversity of Toronto, Vector Institute\n\n{qiyang.li, saminul.haque, cem.anil}@mail.utoronto.ca\n\nAbstract\n\nLipschitz constraints under L2 norm on deep neural networks are useful for prov-\nable adversarial robustness bounds, stable training, and Wasserstein distance esti-\nmation. While heuristic approaches such as the gradient penalty have seen much\npractical success, it is challenging to achieve similar practical performance while\nprovably enforcing a Lipschitz constraint. In principle, one can design Lipschitz\nconstrained architectures using the composition property of Lipschitz functions,\nbut Anil et al. [2] recently identi\ufb01ed a key obstacle to this approach: gradient\nnorm attenuation. They showed how to circumvent this problem in the case of\nfully connected networks by designing each layer to be gradient norm preserving.\nWe extend their approach to train scalable, expressive, provably Lipschitz convo-\nlutional networks. In particular, we present the Block Convolution Orthogonal\nParameterization (BCOP), an expressive parameterization of orthogonal convolu-\ntion operations. We show that even though the space of orthogonal convolutions is\ndisconnected, the largest connected component of BCOP with 2n channels can rep-\nresent arbitrary BCOP convolutions over n channels. Our BCOP parameterization\nallows us to train large convolutional networks with provable Lipschitz bounds.\nEmpirically, we \ufb01nd that it is competitive with existing approaches to provable\nadversarial robustness and Wasserstein distance estimation. 2\n\n1\n\nIntroduction\n\nThere has been much interest in training neural networks with known upper bounds on their Lipschitz\nconstants under L2 norm3. Enforcing Lipschitz constraints can provide provable robustness against\nadversarial examples [48], improve generalization bounds [46], and enable Wasserstein distance\nestimation [2, 3, 22]. Heuristic methods for enforcing Lipschitz constraints, such as the gradient\npenalty [22] and spectral norm regularization [52], have seen much practical success, but provide no\nguarantees about the Lipschitz constant. It remains challenging to achieve similar practical success\nwhile provably satisfying a Lipschitz constraint.\nIn principle, one can design provably Lipschitz-constrained architectures by imposing a Lipschitz\nconstraint on each layer; the Lipschitz bound for the network is then the product of the bounds\nfor each layer. Anil et al. [2] identi\ufb01ed a key dif\ufb01culty with this approach: because a layer with\na Lipschitz bound of 1 can only reduce the norm of the gradient during backpropagation, each\nstep of backprop gradually attenuates the gradient norm, resulting in a much smaller Jacobian for\nthe network\u2019s function than is theoretically allowed. We refer to this problem as gradient norm\nattenuation. They showed that Lipschitz-constrained ReLU networks were prevented from using\n\n\u2217Equal contributions\n2Code is available at: github.com/ColinQiyangLi/LConvNet\n3Unless speci\ufb01ed otherwise, we refer to Lipschitz constant as the Lipschitz constant under L2 norm.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftheir full nonlinear capacity due to the need to prevent gradient norm attenuation. To counteract this\nproblem, they introduced gradient norm preserving (GNP) architectures, where each layer preserves\nthe gradient norm. For fully connected layers, this involved constraining the weight matrix to be\northogonal and using a GNP activation function called GroupSort. Unfortunately, the approach of\nAnil et al. [2] only applies to fully-connected networks, leaving open the question of how to constrain\nthe Lipschitz constants of convolutional networks.\nAs many state-of-the-art deep learning applications rely on convolutional networks, there have been\nnumerous attempts for tightly enforcing Lipschitz constants of convolutional networks. However,\nthose existing techniques either hinder representational power or induce dif\ufb01culty in optimization.\nCisse et al. [12], Tsuzuku et al. [48], Qian and Wegman [40] provide loose bounds on the Lipschitz\nconstant that can limit the parameterizable region. Gouk et al. [21] obtain a tight bound on the\nLipschitz constant, but tend to lose expressive power during training due to vanishing singular values.\nThe approach of Sedghi et al. [45] is computationally intractable for larger networks.\nIn this work, we introduce convolutional GNP networks with an ef\ufb01cient parameterization of orthogo-\nnal convolutions by adapting the construction algorithm from Xiao et al. [51]. This parameterization\navoids the issues of loose bounds on the Lipschitz constant and computational intractability observed\nin the aforementioned approaches. Furthermore, we provide theoretical analysis that demonstrates\nthe disconnectedness of the orthogonal convolution space, and how our parameterization alleviates\nthe optimization challenge engendered by the disconnectedness.\nWe evaluate our GNP networks in two situations where expressive Lipschitz-constrained networks are\nof central importance. The \ufb01rst is provable norm-bounded adversarial robustness, which is the task of\nclassi\ufb01cation and additionally certifying that the network\u2019s classi\ufb01cation will not change under any\nnorm-bounded perturbation. Due to the tight Lipschitz properties, the constructed GNP networks can\neasily give non-trivial lower bounds on the robustness of the network\u2019s classi\ufb01cation. We demonstrate\nthat our method outperforms the state-of-the-art in provable deterministic robustness under L2 metric\non MNIST and CIFAR-10. The other application is Wasserstein distance estimation. Wasserstein\ndistance estimation can be rephrased as a maximization over 1-Lipschitz functions, allowing our\nLipschitz-constrained networks to be directly applied to this problem. Moreover, the restriction to\nGNP we impose is not necessarily a hindrance, as it is shown by Gemici et al. [19] that the optimal\n1-Lipschitz function is also GNP almost everywhere. We demonstrate that our GNP convolutional\nnetworks can obtain tighter Wasserstein distance estimates than competing architectures.\n\n2 Background\n\n2.1 Lipschitz Functions under L2 Norm\n\nIn this work, we focus on Lipschitz functions with respect to the L2 norm. We say a function\nf : Rn \u2192 Rm is l-Lipschitz if and only if\n\n(1)\nWe denote Lip(f ) as the smallest K for which f is l-Lipschitz, and call it the Lipschitz constant of\nf. For two Lipschitz continuous functions f and g, the following property holds:\n\n||f (x1) \u2212 f (x2)||2 \u2264 l||x1 \u2212 x2||2,\u2200x1, x2 \u2208 Rn\n\nLip(f \u25e6 g) \u2264 Lip(f ) Lip(g)\n\n(2)\nThe most basic neural network design consists of a composition of linear transformations and non-\nlinear activation functions. The property above (Equation 2) allows one to upper-bound the Lipschitz\nconstant of a network by the product of the Lipschitz constants of each layer. However, as modern\nneural networks tend to possess many layers, the resultant upper-bound is likely to be very loose, and\nconstraining it increases the risk of diminishing the Lipschitz constrained network capacity that can\nbe utilized.\n\n2.2 Gradient Norm Preservation (GNP)\nLet y = f (x) be 1-Lipschitz, and L be a loss function. The norm of the gradient after backpropagating\nthrough a 1-Lipschitz function is no larger than the norm of the gradient before doing so:\n(cid:107)\u2207xL(cid:107)2 = (cid:107)(\u2207yL)(\u2207xf )(cid:107)2 \u2264 (cid:107)\u2207yL(cid:107)2 (cid:107)\u2207xf(cid:107)2 \u2264 Lip(f )(cid:107)\u2207yL(cid:107)2 \u2264 (cid:107)\u2207yL(cid:107)2\n\nAs a consequence of this relation, the gradient norm will likely be attenuated during backprop if\nno special measures are taken. One way to \ufb01x the gradient norm attenuation problem is to enforce\n\n2\n\n\feach layer to be gradient norm preserving (GNP). Formally, f : Rn (cid:55)\u2192 Rm is GNP if and only if its\ninput-output Jacobian, J \u2208 Rm\u00d7n, satis\ufb01es the following property:\n\n(cid:13)(cid:13)J T g(cid:13)(cid:13)2 = ||g||2,\u2200g \u2208 G.\n\nwhere G \u2286 Rm de\ufb01nes the possible values that the gradient vector g can take. Note that when\nm = n, this condition is equivalent to orthogonality of J. In this work, we consider a slightly stricter\nde\ufb01nition where G = Rm because this allows us to directly compose two GNP (strict) functions\nwithout reasoning about their corresponding G. For the rest of the paper, unless speci\ufb01ed otherwise, a\nGNP function refers to this more strict de\ufb01nition.\nBased on the de\ufb01nition of GNP, we can deduce that GNP functions are 1-Lipschitz in the 2-norm.\nSince the composition of GNP functions is also GNP, one can design a GNP network by stacking GNP\nbuilding blocks. Another favourable condition that GNP networks exhibit is dynamical isometry [51,\n37, 38] (where the entire distribution of singular values of input-output Jacobian is close to 1), which\nhas been shown to improve training speed and stability.\n\n2.3 Provable Norm-bounded Adversarial Robustness\n\nWe consider a classi\ufb01er f with T classes that takes in input x and produces a logit for each of the\nyT ]. An input data point x with label t \u2208 {1, 2,\u00b7\u00b7\u00b7 , T} is provably\nclasses: f (x) = [y1\nrobustly classi\ufb01ed by f under perturbation norm of \u0001 if\n\n\u00b7\u00b7\u00b7\n\ny2\n\nf (x + \u03b4)i = t,\u2200\u03b4 : ||\u03b4||2 \u2264 \u0001.\n\narg max\n\ni\n\nThe margin of the prediction for x is given by Mf (x) = max(0, yt \u2212 maxi(cid:54)=t yi). If f is l-Lipschitz,\n2l\u0001 < Mf (x) (See Appendix P for the proof).\nwe can certify that f is robust with respect to x if\n\n\u221a\n\n2.4 Wasserstein Distance Estimation\n\nWasserstein distance is a distance metric between two probability distributions [39]. The Kantorovich-\nRubinstein formulation of Wasserstein distance expresses it as a maximization problem over 1-\nLipschitz functions [3]:\n\n(cid:18)\n\n(cid:19)\n\nW (P1, P2) = sup\n\nf :Lip(f )\u22641\n\nE\n\nx\u223cP1(x)\n\n[f (x)] \u2212 E\n\nx\u223cP2(x)\n\n[f (x)]\n\n.\n\n(3)\n\nIn Wasserstein GAN architecture, Arjovsky et al. [3] proposed to parametrize the scalar-valued\nfunction f using a Lipschitz constrained network, which serves as the discriminator that estimates\nthe Wasserstein distance between the generator and data distribution. One important property to\nnote is that the optimal scalar function f is GNP almost everywhere (See Corollary 1 in Gemici\net al. [19]). Naturally, this property favours the optimization approach that focuses on searching over\nGNP functions. Indeed, Anil et al. [2] found that GNP networks can achieve tighter lower bounds\ncompared to non-GNP networks.\n\n3 Orthogonal Convolution Kernels\n\nThe most crucial step to building a GNP convolutional network is constructing the GNP convolution\nitself. Since a convolution operator is a linear operator, making the convolution kernel GNP is\nequivalent to making its corresponding linear operator orthogonal. While there are numerous\nmethods for orthogonalizing arbitrary linear operators, it is not immediately clear how to do this for\nconvolutions, especially when preserving kernel size. We \ufb01rst summarize the orthogonal convolution\nrepresentations from Kautsky and Turcajov\u00e1 [28] and Xiao et al. [51] (Section 3.1). Then, we\nanalyze the topology of the space of orthogonal convolution kernels and demonstrate that the space is\ndisconnected (with at least O(n2) connected components for a 2 \u00d7 2 2-D convolution layer), which\nis problematic for gradient-based optimization methods because they are con\ufb01ned to one component\n(Section 3.2). Fortunately, this problem can be \ufb01xed by increasing the number of channels: we\ndemonstrate that a single connected component of the space of orthogonal convolutions with 2n\nchannels can represent any orthogonal convolution with n channels (Section 3.3).\n\n3\n\n\fFigure 1: Visualization of a 1-D orthogonal convolution, [P I \u2212 P ], applied to a 1-D input tensor\nv \u2208 R2\u00d73 with a length of 3 and channel size of 2. P \u2208 R2\u00d72 here is the orthogonal projection\nonto the x-axis, which makes I \u2212 P the complementary projection onto the y-axis. Each cell of v\ncorresponds to one of the three spatial locations, and the vector contained within it represents the\nvector along the channel dimension in said spatial location.\n\n3.1 Constructing Orthogonal Convolutions\n\nTo begin analysing orthogonal convolution kernels, we must \ufb01rst understand the symmetric projector,\nwhich is a fundamental building block of orthogonal convolutions. An n \u00d7 n matrix P is de\ufb01ned\nto be a symmetric projector if P = P 2 = P T . Geometrically, a symmetric projector P represents\nan orthogonal projection onto the range of P . From this geometric interpretation, it is not hard to\nsee that the space of projectors has n + 1 connected components, based on the rank of the projector\n(for a more rigorous treatment, see Remark 4.1 in Appendix K). For notation simplicity, we denote\nP(n) as the set of all n \u00d7 n symmetric projectors and P(n, k) as the subset of all n \u00d7 n symmetric\nprojectors with ranks of k.\nNow that the concept of symmetric projectors has been established, we will consider how to construct\n1-D convolutional kernels. As shown by Kautsky and Turcajov\u00e1 [28], all 1-D orthogonal convolution\nkernels with a kernel size K can be represented as:\n\nW(H, P1:K\u22121) = H(cid:3) [P1\n\n(I \u2212 P1)] (cid:3)\u00b7\u00b7\u00b7 (cid:3) [PK\u22121\n\n(I \u2212 PK\u22121)]\n\nwith Zi =(cid:80)\u221e\n\n(4)\nwhere H \u2208 O(n) is an orthogonal matrix of n \u00d7 n, Pi \u2208 P(n), and (cid:3) represents block convolution,\nwhich is convolution using matrix multiplication rather than scalar multiplication:\n\n\u00b7\u00b7\u00b7 Xp] (cid:3) [Y1 Y2\n\n\u00b7\u00b7\u00b7 Yq] = [Z1 Z2\n\n[X1 X2\ni(cid:48)=\u2212\u221e Xi(cid:48)Yi\u2212i(cid:48), where the out-of-range elements are all zero (e.g., X<1 = 0, X>p =\n0, Y<1 = 0, Y>q = 0). Unlike regular convolutions, the block convolution does not commute since\nmatrix multiplication does not commute. One important property of block convolution is that it\ncorresponds to composition of the kernel operators. That is, X \u2217 (Y \u2217 v) = (X(cid:3)Y ) \u2217 v, where\nA \u2217 v represents the resulting tensor after applying convolution A to v. This composition property\nallows us to decompose the representation (Equation 4) into applications of orthogonal convolutions\nwith kernel size of 2 (Figure 1 demonstrates the effect of it) along with a channel-wise orthogonal\ntransformation (H).\nXiao et al. [51] extended the 1-D representation to the 2-D case using alternating applications of\northogonal convolutions of size 2:\n\n\u00b7\u00b7\u00b7 Zp+q\u22121]\n\nW(H, P1:K\u22121, Q1:K\u22121) = H(cid:3)\n\n(cid:3) [Q1\n\nI \u2212 Q1] (cid:3)\u00b7\u00b7\u00b7\n\n(cid:20) P1\n(cid:21)\n(cid:20) PK\u22121\n\nI \u2212 P1\n\n(cid:21)\n\n\u00b7\u00b7\u00b7(cid:3)\n\nI \u2212 PK\u22121\nto\n\nsimilarly\n\n(cid:3) [QK\u22121\nthe\n\nI \u2212 QK\u22121]\n\n(5)\n\ni(cid:48)=\u2212\u221e(cid:80)\u221e\n(cid:80)\u221e\n\nX(cid:3)Y\n\n=\n\nis\n\nde\ufb01ned\n\nwhere Z\n\n=\nj(cid:48)=\u2212\u221e [Xi(cid:48),j(cid:48)Yi\u2212i(cid:48),j\u2212j(cid:48)], and Pi, Qi \u2208 P(n). Unlike in 1-D, we discovered that\nthis 2-D representation could only represent a subset of 2-D orthogonal convolutions (see Appendix\nO for an example). However, we do not know whether simple modi\ufb01cations to this parameterization\nwill result in a complete representation of all 2-D orthogonal convolutions (see Appendix O for\ndetails on the open question).\n\n1-D case with Zij\n\n3.2 Topology of the Orthogonal Convolution Space\n\nBefore utilizing this space of orthogonal convolutions, we would like to analyze some fundamental\nproperties of this space. Since P(n) has n + 1 connected components and orthogonal convolutions\n\n4\n\n\fAlgorithm 1: Block Convolution Orthogonal Parameterization (BCOP)\n\nInput: co \u00d7 ci unconstrained matrix O, ci \u00d7(cid:4) ci\n\n1 to k \u2212 1, assuming ci \u2265 co\n\n2\n\n(cid:5) unconstrained matrices Mi, Ni for i from\n\nResult: Orthogonal Convolution Kernel W \u2208 Rco\u00d7ci\u00d7K\u00d7K\nH \u2190 Orthogonalize(O); (cid:46) any differentiable orthogonalization procedure (e.g., Bj\u00f6rck [5]);\nInitialize W as a 1 \u00d7 1 convolution with W [0, 0] = H;\nfor i from 1 to K \u2212 1 do\n\nRP , RQ \u2190 Orthogonalize(Mi), Orthogonalize(Ni);\nP, Q \u2190 RP RT\nW \u2190 W(cid:3)\n\n(cid:3) [Q I \u2212 Q]\n\n(cid:20) P\n\nP , RQRT\nI \u2212 P\n\n(cid:21)\n\nQ; (cid:46) Construct symmetric projectors with half of the full rank;\n\nend\nOutput: W\n\nare constructed out of many projectors, it is to be expected that there are many connected components\nin the space of orthogonal convolutions. Indeed, we see the \ufb01rst result in 1-D (Theorem 1).\nTheorem 1 (Connected Components of 1-D Orthogonal Convolution). The 1-D orthogonal convolu-\ntion space is compact and has 2(K \u2212 1)n + 2 connected components, where K is the kernel size and\nn is the number of channels.\nIn 2-D, we analyze case of kernel size of 2 (2 \u00d7 2 kernels) and show that the number of connected\ncomponents grows at least quadratically with respect to the channel size:\nTheorem 2 (Connected Components of 2-D Orthogonal Convolution with K = 2). 2-D orthogonal\nconvolution space with a kernel size of 2 \u00d7 2 has at least 2(n + 1)2 connected components, where n\nis the number of channels.\n\nThe disconnectedness in the space of orthogonal convolution imposes an intrinsic dif\ufb01culty in\noptimizing over the space of orthogonal convolution kernels, as gradient-based optimizers are\ncon\ufb01ned to their initial connected component. We refer readers to Appendix K for the proof of\nTheorem 1 and Appendix M for the proof of Theorem 2.\n\n3.3 Block Convolution Orthogonal Parameterization (BCOP)\n\nTo remedy the disconnectedness issue, we show the following:\nTheorem 3 (BCOP Construction with Auxiliary Dimension). For any convolution C =\nW(H, P1:K\u22121, Q1:K\u22121) with input and output channels n and Pi, Qi \u2208 P(n), there exists a con-\nvolution C(cid:48) = W(H(cid:48), P (cid:48)\n1:K\u22121, Q(cid:48)\n1:K\u22121) with input and output channels 2n constructed from only\ni \u2208 P(2n, n)) such that C(cid:48)(x)1:n = C(x1:n). That is, the \ufb01rst n channels\nn-rank projectors (P (cid:48)\ni , Q(cid:48)\nof the output is the same with respect to the \ufb01rst n channels of the input under both convolutions.\nThe idea behind this result is that some projectors in P(2n, n) may use their \ufb01rst n dimensions to\nrepresent P(n) and then use the latter n dimensions in a trivial capacity so that the total rank is n (see\nAppendix N for the detailed proof).\nTheorem 3 implies that all connected components of orthogonal convolutions constructed by W\nwith n channels can all be equivalently represented in a single connected component of convolutions\nconstructed by W with 2n channels by only using projectors that have rank n. (This comes at the\ncost of requiring 4 times as many parameters.)\nThis result motivates us to parameterize the connected subspace of orthogonal convolutions de\ufb01ned by\n\nW(H, \u02dcP1:K\u22121, \u02dcQ1:K\u22121) where \u02dcPi \u2208 P(n,(cid:4) n\n\n(cid:5)). We refer to this method as the\n\n(cid:5)) and \u02dcQi \u2208 P(n,(cid:4) n\n\nBlock Convolution Orthogonal Parameterization (BCOP). The procedure for BCOP is summarized in\nAlgorithm 1 (See Appendix H for implementation details).\n\n2\n\n2\n\n4 Related Work\n\nReshaped Kernel Method (RK) This method reshapes a convolution kernel with dimensions\n(co, ci, k, k) into a (co, k2ci) matrix. The Lipschitz constant (or spectral norm) of a convolution\n\n5\n\n\foperator is bounded by a constant factor of the spectral norm of its reshaped matrix [12, 48, 40],\nwhich enables bounding of the convolution operator\u2019s Lipschitz constant by bounding that of the\nreshaped matrix. However, this upper-bound can be conservative, causing a bias towards convolution\noperators with low Lipschitz constants, limiting the method\u2019s expressive power. In this work, we\nstrictly enforce orthogonality of the reshaped matrix rather than softly constrain it via regularization,\nas done in Cisse et al. [12]. We refer to this variant as reshaped kernel orthogonalization, or RKO.\n\nOne-Sided Spectral Normalization (OSSN) This variant of spectral normalization [36] scales the\nkernel so that the spectral norm of the convolution operator is at most 1 [21]. This is a projection\nunder the matrix 2-norm but not the Frobenius norm. It is notable because when using Euclidean\nsteepest descent with this projection, such as in constrained gradient-based optimization, there is\nno guarantee to converge to the correct solution (see an explicit example and further analysis in\nAppendix A). In practice, we found that projecting during the forward pass (as in Miyato et al. [36])\nyields better performance than projecting after each gradient update.\n\nSingular Value Clipping and Masking (SVCM) Unlike spectral normalization, singular value\nclipping is a valid projection under the Frobenius norm. Sedghi et al. [45] demonstrates a method to\nperform an approximation of the optimal projection to the orthogonal kernel space. Unfortunately,\nthis method needs many expensive iterations to enforce the Lipschitz constraint tightly, making this\napproach computationally intractable in training large networks with provable Lipschitz constraints.\n\nComparison to BCOP OSSN and SVCM\u2019s run-time depend on the input\u2019s spatial dimensions,\nwhich prohibits scalability (See Appendix C for a time complexity analysis). RKO does not guarantee\nan exact Lipschitz constant, which may cause a loss in expressive power. Additionally, none of these\nmethods guarantee gradient norm preservation. BCOP avoids all of the issues above.\n\n4.1 Provable Adversarial Robustness\n\nCertifying the adversarial robustness of a network subject to norm ball perturbation is dif\ufb01cult. Exact\ncerti\ufb01cation methods using mixed-integer linear programming or SMT solvers scale poorly with\nthe complexity of the network [27, 10]. Cohen et al. [15] and Salman et al. [43] use an estimated\nsmoothed classi\ufb01er to achieve very high provable robustness with high con\ufb01dence. In this work, we\nare primarily interested in providing deterministic provable robustness guarantees.\nRecent work has been focusing on guiding the training of the network to be veri\ufb01ed or certi\ufb01ed\n(providing a lower-bound on provable robustness) easier [49, 50, 18, 17, 23]. For example, Xiao\net al. [50] encourage weight sparsity and perform network pruning to speed up the exact veri\ufb01cation\nprocess for ReLU networks. Wong et al. [49] optimize the network directly towards a robustness\nlower-bound using a dual optimization formulation.\nAlternatively, rather than modifying the optimization objective to incentivize robust classi\ufb01cation,\none can train networks to have a small global Lipschitz constant, which allows an easy way to certify\nrobustness via the output margin. Cohen et al. [14] deploy spectral norm regularization on weight\nmatrices of a fully connected network to constrain the Lipschitz constant and certify the robustness of\nthe network at the test time. Tsuzuku et al. [48] estimate an upper-bound of the network\u2019s Lipschitz\nconstant and train the network to maximize the output margin using a modi\ufb01ed softmax objective\nfunction according to the estimated Lipschitz constant. In contrast to these approaches, Anil et al.\n[2] train fully connected networks that have a known Lipschitz constant by enforcing gradient norm\npreservation. Our work extends this idea to convolutional networks.\n\n5 Experiments\n\nThe primary point of interest for the BCOP method (Section 3.3) is its expressiveness compared\nagainst other common approaches of paramterizing Lipschitz constrained convolutions (Section 4).\nTo study this, we perform an ablation study on two tasks using these architectures: The \ufb01rst task is\nprovably robust image classi\ufb01cation tasks on two datasets (MNIST [31] and CIFAR-10 [30])4. We\n\ufb01nd our method outperformed other Lipschitz constrained convolutions under the same architectures\nas well as the state-of-the-art in this task (Section 5.2). The second task is 1-Wasserstein distance\n\n4We only claim in the deterministic case as recent approaches have much higher probabilistic provable\n\nrobustness [15, 43].\n\n6\n\n\festimation of GANs where our method also outperformed other competing Lipschitz-convolutions\nunder the same architecutre (Section 5.3).\n\n5.1 Network Architectures and Training Details\n\nA bene\ufb01t of training GNP networks is that we enjoy the property of dynamical isometry, which\ninherently affords greater training stability, thereby reducing the need for common techniques that\nwould otherwise be dif\ufb01cult to incorporate into a GNP network. For example, if a 1-Lipschitz residual\nconnection maintains GNP, the residual block must be an identity function with a constant bias (see\nan informal justi\ufb01cation in Appendix D.1). Also, batch normalization involves scaling the layer\u2019s\noutput, which is not necessarily 1-Lipschitz, let alone GNP. For these reasons, residual connections\nand batch normalization are not included in the model architecture. We also use cyclic padding to\nsubstitute zero-padding since zero-padded orthogonal convolutions must be size 1 (see an informal\nproof in Appendix D.2). Finally, we use \u201cinvertible downsampling\u201d [25] in replacement of striding\nand pooling to achieve spatial downsampling while maintaining the GNP property. The details for\nthese architectural decisions are in Appendix D.\nBecause of these architectural constraints, we base our networks on architectures that do not in-\nvolve residual connections. For provable robustness experiments, we use the \u201cSmall\u201d and \u201cLarge\u201d\nconvolutional networks from Wong et al. [49]. For Wasserstein distance estimation, we use the\nfully convolutional critic from Radford et al. [41] (See Appendix E, F for details). Unless speci\ufb01ed\notherwise, each experiment is repeated 5 times with mean and standard deviation reported.\n\n5.2 Provable Adversarial Robustness\n\n5.2.1 Robustness Evaluation\n\nFor adversarial robustness evaluation, we use the L2-norm-constrained threat model [8], where the\nadversary is constrained to L2-bounded perturbation with the L2 norm constrained to be below \u0001. We\nrefer to clean accuracy as the percentage of un-perturbed examples that are correctly classi\ufb01ed and\nrobust accuracy as the percentage of examples that are guaranteed to be correctly classi\ufb01ed under the\nthreat model. We use the margin of the model prediction to determine a lower bound of the robust\naccuracy (as described in Section 2.3). We also evaluate the empirical robustness of our model around\nunder two gradient-based attacks and two decision-based attacks: (i) PGD attack with CW loss [34, 7],\n(ii) FGSM [47], (iii) Boundary attack (BA) [6], (iv) Point-wise attack (PA) [44]. Speci\ufb01cally, the\ngradient-based methods ((i) and (ii)) are done on the whole test dataset; the decision-based attacks\n((iii) and (iv)) are done only on the \ufb01rst 100 test data points since they are expensive to run.5\n\n5.2.2 Comparison of Different Methods for Enforcing Spectral Norm of Convolution\n\nWe compare the performance of OSSN, RKO, SVCM, and BCOP on margin training for adversarial\nrobustness on MNIST and CIFAR-10. To make the comparison fair, we ensure all the methods\nhave a tight Lipschitz constraint of 1. For OSSN, we use 10 power iterations and keep a running\nvector for each convolution layer to estimate the spectral norm and perform the projection during\nevery forward pass. For SVCM, we perform the singular value clipping projection with 50 iterations\nafter every 100 gradient updates to ensure the Lipschitz bound is tight. For RKO, instead of using a\nregularization term to enforce the orthogonality (as done in Cisse et al. [12]), we use Bj\u00f6rck [5] to\northogonalize the reshaped matrix before scaling down the kernel. We train two different convolutional\narchitectures with the four aforementioned methods of enforcing Lipschitz convolution layers on\nimage classi\ufb01cation tasks. To achieve large output margins, we use \ufb01rst-order, multi-class, hinge loss\nwith a margin of 2.12 on MNIST and 0.7071 on CIFAR-10.\nOur approach (BCOP) outperforms all competing methods across all architectures on both MNIST and\nCIFAR-10 (See Table 1 and Appendix I, Table 7). To understand the performance gap, we visualize\nthe singular value distribution of a convolution layer before and after training in Figure 2. We observe\nthat OSSN and RKO push many singular values to 0, suggesting that the convolution layer is not fully\nutilizing the expressive power it is capable of. This observation is consistent with our hypothesis\nthat these methods bias the convolution operators towards sub-optimal regions caused by the loose\nLipschitz bound (for RKO) and improper projection (for OSSN). In contrast, SVCM\u2019s singular values\nstarted mostly near 0.5 and some of them were pushed up towards 1, which is consistent with the\n\n5We use foolbox [42] for the two decision-based methods.\n\n7\n\n\fFigure 2: Singular value distribution at initialization (blue) and at the end of training (orange) for the\nsecond layer of the \u201cLarge\u201d baseline using different methods to enforce Lipschitz convolution.\n\nBCOP\n\nDataset\n\nMNIST\n(\u0001 = 1.58)\n\nOSSN\n\nRKO\n\n97.54 \u00b1 0.06\n45.84 \u00b1 0.90\n98.77 \u00b1 0.05\n56.66 \u00b1 0.23\n64.53 \u00b1 0.30\n50.01 \u00b1 0.21\n72.41 \u00b1 0.22\n58.72 \u00b1 0.23\nTable 1: Clean and robust accuracy on MNIST and CIFAR-10 using different Lipschitz convolutions.\nThe provable robust accuracy is evaluated at \u0001 = 1.58 for MNIST and at \u0001 = 36/255 for CIFAR-10.\n\n97.28 \u00b1 0.08\n43.58 \u00b1 0.44\n98.44 \u00b1 0.05\n55.18 \u00b1 0.46\n61.77 \u00b1 0.63\n47.46 \u00b1 0.53\n70.01 \u00b1 0.26\n55.76 \u00b1 0.16\n\n96.86 \u00b1 0.13\n42.95 \u00b1 1.09\n98.31 \u00b1 0.03\n53.77 \u00b1 1.02\n62.18 \u00b1 0.66\n48.03 \u00b1 0.54\n67.51 \u00b1 0.47\n53.64 \u00b1 0.49\n\nClean\nRobust\nClean\nRobust\nClean\nRobust\nClean\nRobust\n\nSVCM\n\n97.24 \u00b1 0.09\n28.94 \u00b1 1.58\n97.93 \u00b1 0.05\n38.00 \u00b1 1.82\n62.39 \u00b1 0.46\n47.59 \u00b1 0.56\n69.65 \u00b1 0.38\n53.61 \u00b1 0.51\n\nCIFAR-10\n(\u0001 = 36/255)\n\nSmall\n\nLarge\n\nSmall\n\nLarge\n\nDataset\nMNIST\n(\u0001 = 1.58)\nCIFAR-10\n(\u0001 = 36/255)\n\nClean\nRobust\nClean\nRobust\n\nBCOP-Large\n98.77 \u00b1 0.05\n56.66 \u00b1 0.23\n72.41 \u00b1 0.22\n58.72 \u00b1 0.23\n\nFC-3\n\n98.71 \u00b1 0.02\n54.46 \u00b1 0.30\n62.60 \u00b1 0.39\n49.97 \u00b1 0.35\n\nKW-Large KW-Resnet\n\n88.12\n44.53\n\n59.76\n50.60\n\n-\n-\n\n61.20\n51.96\n\nTable 2: Comparison of our convolutional networks and the fully connected baseline in Anil et al.\n[2] (FC-3) against provably robust models in previous works. The numbers for KW-Large and\nKW-Resnet are directly obtained from Table 4 of in the Appendix of their paper [49].\n\nprocedure being an optimal projection. BCOP has all of its singular values at 1 throughout training\nby design due to its gradient norm preservation and orthogonality. Thus, we empirically verify\nthe downsides of other methods and show that our proposed method enables maximally expressive\nLipschitz constrained convolutional layers with guaranteed gradient-norm-preservation.\n\n5.2.3 State-of-the-art Comparison\n\nTo further demonstrate the expressive power of orthogonal convolution, we compare our networks\nwith models that achieve state-of-the-art deterministic provable adversarial robustness performance\n(Table 2 and Appendix J, Table 8 and 9). We also evaluate the empirical robustness of our model\nagainst common attacks on CIFAR-10. Comparing against Wong et al. [49], our approach reaches\nsimilar performance for \u201cSmall\u201d architecture and better performance for \u201cLarge\u201d architecture (Table\n3).\n\n5.3 Wasserstein Distance Estimation\n\nIn this section, we consider the problem of estimating the Wasserstein distance between two high\ndimensional distributions using neural networks. Anil et al. [2] showed that in the fully connected\nsetting, ensuring gradient norm preservation is critical for obtaining tighter lower bounds on the\nWasserstein distance. We observe the same phenomenon in the convolutional setting.\n\n8\n\n\fKW\n\nBCOP\n\nKW\n\nBCOP\n\nSmall\nClean\n54.39\nPGD\n49.94\nFGSM 49.98\nLarge\nClean\n60.14\nPGD\n55.53\nFGSM 55.55\n\n64.53 \u00b1 0.30\n51.26 \u00b1 0.17\n51.57 \u00b1 0.18\n\n72.41 \u00b1 0.22\n64.39 \u00b1 0.26\n64.53 \u00b1 0.25\n\nSmall\nClean (*)\nBA (*)\nPA (*)\nLarge\nClean (*)\nBA (*)\nPA (*)\n\n63.00\n60.00\n63.00\n\n68.00\n64.00\n68.00\n\n74.20 \u00b1 2.23\n61.20 \u00b1 2.99\n74.00 \u00b1 2.28\n\n77.60 \u00b1 1.74\n71.20 \u00b1 1.60\n77.20 \u00b1 1.60\n\nTable 3: Comparison of our networks with Wong et al. [49] on CIFAR-10 dataset. Left: results of the\nevaluation on the entire CIFAR-10 test dataset. Right (*): results of the evaluation on the \ufb01rst 100\ntest samples. The KW models [49] are directly taken from their of\ufb01cial repository.\n\nSTL-10 MaxMin\nReLU\nCIFAR-10 MaxMin\nReLU\n\nOSSN\n\n7.39 \u00b1 0.31\n7.06 \u00b1 0.72\n3.29 \u00b1 0.05\n3.07 \u00b1 0.12\n\nRKO\n\n8.95 \u00b1 0.12\n7.82 \u00b1 0.21\n4.95 \u00b1 0.08\n4.20 \u00b1 0.06\n\nBCOP\n\n9.91 \u00b1 0.11\n8.28 \u00b1 0.19\n5.34 \u00b1 0.07\n4.39 \u00b1 0.07\n\nTable 4: Comparison of different Lipschitz constrained architectures on the Wasserstein distance\nestimation task between the data and generator distributions of STL-10 and CIFAR-10 GANs. Each\nestimate is a strict lower bound (estimated using 6,400 pairs of randomly sampled real and generated\nimage examples), hence larger values indicate better performance.\n\nWe trained our networks to estimate the Wasserstein distance between the data and generator distribu-\ntions of GANs6 [20] trained on RGB images from the STL-10 dataset [13] and CIFAR-10 dataset [30]\n(resized to 64x64). After training the GANs, we froze the generator weights and trained Lipschitz\nconstrained convolutional networks to estimate the Wasserstein distance. We adapted the fully\nconvolutional discriminator model used by Radford et al. [41] by removing all batch normalization\nlayers and replacing all vanilla convolutional layers with Lipschitz candidates (BCOP, RKO, and\nOSSN)7. We trained each model with ReLU or MaxMin activations [2]. The results are in Table 4.\nBaking in gradient norm preservation in the architecture leads to signi\ufb01cantly tighter lower bounds on\nthe Wasserstein distance. The only architecture that has gradient norm preserving layers throughout\n(BCOP with MaxMin) leads to the best estimate. Although OSSN has the freedom to learn orthogonal\nkernels, this does not happen in practice and leads to poorer expressive power.\n\n6 Conclusion and Future Work\n\nWe introduced convolutional GNP networks with an ef\ufb01cient construction method of orthogonal\nconvolutions (BCOP) that overcomes the common issues of Lipschitz constrained networks such\nas loose Lipschitz bounds and gradient norm attenuation. In addition, we showed the space of\northogonal convolutions has many connected components and demonstrated how BCOP parameteri-\nzation alleviates the optimization challenges caused by the disconnectedness. Our GNP networks\noutperformed the state-of-the-art for deterministic provable adversarial robustness on L2 metrics\nwith both CIFAR-10 and MNIST, and obtained tighter Wasserstein distance estimates between high\ndimensional distributions than competing approaches. Despite its effectiveness, our parameterization\nis limited to only expressing a subspace of orthogonal convolutions. A complete parameterization\nof the orthogonal convolution space may enable training even more powerful GNP convolutional\nnetworks. We presented potential directions to achieve this and left the problem for future work.\n\n6Note that any GAN variant could have been chosen here.\n7We omit SVCM for comparison due to its computational intractability\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Lechao Xiao, Arthur Rabinovich, Matt Koster, and Siqi Zhou for their\nvaluable insights and feedback. We would like to thank Sherjil Ozair for spotting a bug in our code,\nwhose \ufb01x improved our results. RG acknowledges support from the CIFAR Canadian AI Chairs\nprogram.\n\nReferences\n[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,\nand Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[2] Cem Anil, James Lucas, and Roger Grosse. Sorting out Lipschitz function approximation. In\nKamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International\nConference on Machine Learning, volume 97 of Proceedings of Machine Learning Research,\npages 291\u2013301, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR. URL http://\nproceedings.mlr.press/v97/anil19a.html.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[4] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48\n\n(3):334\u2013334, 1997.\n\n[5] \u00c5ke Bj\u00f6rck and Clazett Bowie. An iterative algorithm for computing the best estimate of an\n\northogonal matrix. SIAM Journal on Numerical Analysis, 8(2):358\u2013364, 1971.\n\n[6] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks:\nReliable attacks against black-box machine learning models. In International Conference on\nLearning Representations, 2018. URL https://openreview.net/forum?id=SyZI0GWCZ.\n\n[7] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\n2017 IEEE Symposium on Security and Privacy (SP), pages 39\u201357. IEEE, 2017.\n\n[8] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris\nTsipras, Ian Goodfellow, and Aleksander Madry. On evaluating adversarial robustness. arXiv\npreprint arXiv:1902.06705, 2019.\n\n[9] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[10] Chih-Hong Cheng, Georg N\u00fchrenberg, and Harald Ruess. Maximum resilience of arti\ufb01cial\nneural networks. In International Symposium on Automated Technology for Veri\ufb01cation and\nAnalysis, pages 251\u2013268. Springer, 2017.\n\n[11] Artem Chernodub and Dimitri Nowicki. Norm-preserving orthogonal permutation linear unit\n\nactivation functions (oplu). arXiv preprint arXiv:1604.02313, 2016.\n\n[12] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.\nParseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th\nInternational Conference on Machine Learning-Volume 70, pages 854\u2013863. JMLR. org, 2017.\n\n[13] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper-\nvised feature learning. In Proceedings of the fourteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 215\u2013223, 2011.\n\n[14] Jeremy EJ Cohen, Todd Huster, and Ra Cohen. Universal Lipschitz approximation in bounded\n\ndepth neural networks. arXiv preprint arXiv:1904.04861, 2019.\n\n[15] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certi\ufb01ed adversarial robustness via\n\nrandomized smoothing. arXiv preprint arXiv:1902.02918, 2019.\n\n10\n\n\f[16] David Cox and Nicolas Pinto. Beyond simple features: A large-scale feature search approach to\n\nunconstrained face recognition. In Face and Gesture 2011, pages 8\u201315. IEEE, 2011.\n\n[17] Francesco Croce and Matthias Hein. Provable robustness against all adversarial lp-perturbations\n\nfor p \u2265 1. arXiv preprint arXiv:1905.11213, 2019.\n\n[18] Francesco Croce, Maksym Andriushchenko, and Matthias Hein. Provable robustness of relu\nnetworks via maximization of linear regions. In Kamalika Chaudhuri and Masashi Sugiyama,\neditors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine\nLearning Research, pages 2057\u20132066. PMLR, 16\u201318 Apr 2019. URL http://proceedings.\nmlr.press/v89/croce19a.html.\n\n[19] Mevlana Gemici, Zeynep Akata, and Max Welling. Primal-dual wasserstein gan. arXiv preprint\n\narXiv:1805.09575, 2018.\n\n[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[21] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural\n\nnetworks by enforcing Lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018.\n\n[22] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in neural information processing systems,\npages 5767\u20135777, 2017.\n\n[23] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classi\ufb01er\nagainst adversarial manipulation. In Advances in Neural Information Processing Systems, pages\n2266\u20132276, 2017.\n\n[24] Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha. Limitations of the Lipschitz constant as\na defense against adversarial examples. In Joint European Conference on Machine Learning\nand Knowledge Discovery in Databases, pages 16\u201329. Springer, 2018.\n\n[25] J\u00f6rn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-RevNet: Deep in-\nvertible networks. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=HJsjkMb0Z.\n\n[26] Hyeonwoo Kang. pytorch-generative-model-collections, 2016. URL https://github.com/\n\nznxlwm/pytorch-generative-model-collections.\n\n[27] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex:\nAn ef\ufb01cient SMT solver for verifying deep neural networks. In International Conference on\nComputer Aided Veri\ufb01cation, pages 97\u2013117. Springer, 2017.\n\n[28] Jaroslav Kautsky and Radka Turcajov\u00e1. A matrix approach to discrete wavelets. In Wavelet\n\nAnalysis and Its Applications, volume 5, pages 117\u2013135. Elsevier, 1994.\n\n[29] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (canadian institute for advanced\n\nresearch). URL http://www.cs.toronto.edu/~kriz/cifar.html.\n\n[31] Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[32] Stamatios Lefkimmiatis, John Paul Ward, and Michael Unser. Hessian schatten-norm regular-\nization for linear inverse problems. IEEE transactions on image processing, 22(5):1873\u20131888,\n2013.\n\n[33] Mario Lezcano-Casado and David Mart\u00ednez-Rubio. Cheap orthogonal constraints in neu-\nral networks: A simple parametrization of the orthogonal and unitary group.\nIn Kama-\nlika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International\nConference on Machine Learning, volume 97 of Proceedings of Machine Learning Re-\nsearch, pages 3794\u20133803, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR. URL\nhttp://proceedings.mlr.press/v97/lezcano-casado19a.html.\n\n11\n\n\f[34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. In International Conference on\nLearning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.\n[35] John Milnor and James D Stasheff. Characteristic Classes. (AM-76), volume 76, pages 55\u201357.\n\nPrinceton University Press, 1974.\n\n[36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. In International Conference on Learning Representations,\n2018. URL https://openreview.net/forum?id=B1QRgziT-.\n\n[37] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep\nlearning through dynamical isometry: theory and practice. In Advances in neural information\nprocessing systems, pages 4785\u20134795, 2017.\n\n[38] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral\nuniversality in deep networks. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings\nof the Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics, volume 84\nof Proceedings of Machine Learning Research, pages 1924\u20131932, Playa Blanca, Lanzarote,\nCanary Islands, 09\u201311 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/\npennington18a.html.\n\n[39] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport.\n\narXiv:1803.00567, 2018.\n\narXiv preprint\n\n[40] Haifeng Qian and Mark N. Wegman. L2-nonexpansive neural networks. In International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=ByxGSsR9FQ.\n\n[41] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n[42] Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A Python toolbox to benchmark\nthe robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017. URL\nhttp://arxiv.org/abs/1707.04131.\n\n[43] Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and\nSebastien Bubeck. Provably robust deep learning via adversarially trained smoothed classi\ufb01ers.\narXiv preprint arXiv:1906.04584, 2019.\n\n[44] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the \ufb01rst adver-\nsarially robust neural network model on MNIST. In International Conference on Learning\nRepresentations, 2019. URL https://openreview.net/forum?id=S1EHOsC9tX.\n\n[45] Hanie Sedghi, Vineet Gupta, and Philip M. Long. The singular values of convolutional layers.\nIn International Conference on Learning Representations, 2019. URL https://openreview.\nnet/forum?id=rJevYoA9Fm.\n\n[46] Jure Sokoli\u00b4c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin\n\ndeep neural networks. IEEE Transactions on Signal Processing, 65(16):4265\u20134280.\n\n[47] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[48] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable\nIn Advances in Neural\n\ncerti\ufb01cation of perturbation invariance for deep neural networks.\nInformation Processing Systems, pages 6541\u20136550, 2018.\n\n[49] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial\n\ndefenses. In Advances in Neural Information Processing Systems, pages 8400\u20138409, 2018.\n\n[50] Kai Y. Xiao, Vincent Tjeng, Nur Muhammad (Mahi) Sha\ufb01ullah, and Aleksander Madry. Training\nIn International\nfor faster adversarial robustness veri\ufb01cation via inducing reLU stability.\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=BJfIVjAcKm.\n\n12\n\n\f[51] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Penning-\nton. Dynamical isometry and a mean \ufb01eld theory of CNNs: How to train 10,000-layer vanilla\nconvolutional neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 5393\u20135402, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR. URL http://proceedings.mlr.press/v80/xiao18a.html.\n\n[52] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generaliz-\n\nability of deep learning. arXiv preprint arXiv:1705.10941, 2017.\n\n13\n\n\f", "award": [], "sourceid": 8869, "authors": [{"given_name": "Qiyang", "family_name": "Li", "institution": "University of Toronto"}, {"given_name": "Saminul", "family_name": "Haque", "institution": "University of Toronto"}, {"given_name": "Cem", "family_name": "Anil", "institution": "University of Toronto; Vector Institute"}, {"given_name": "James", "family_name": "Lucas", "institution": "University of Toronto"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}, {"given_name": "Joern-Henrik", "family_name": "Jacobsen", "institution": "Vector Institute"}]}