{"title": "Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness", "book": "Advances in Neural Information Processing Systems", "page_first": 14785, "page_last": 14796, "abstract": "This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). Evaluated on these adversarially transformed examples, standard and adversarial training with such regularizers achieves a relative error reduction of 20% for CIFAR-10 with the same computational budget. This even surpasses handcrafted spatial-equivariant networks. Furthermore, we observe for SVHN, known to have inherent variance in orientation, that robust training also improves standard accuracy on the test set. We prove that this no-trade-off phenomenon holds for adversarial examples from transformation groups.", "full_text": "Invariance-inducing regularization using worst-case\n\ntransformations suf\ufb01ces to boost accuracy and spatial robustness\n\nFanny Yang\u2020,\ufffd, Zuowen Wang\ufffd, Christina Heinze-Deml\ufffd\n\nStanford University\u2020, ETH Zurich\ufffd\n\n{fan.yang@stat.math.ethz.ch, wangzu@ethz.ch, heinzedeml@stat.math.ethz.ch}\n\nAbstract\n\nThis work provides theoretical and empirical evidence that invariance-inducing\nregularizers can increase predictive accuracy for worst-case spatial transformations\n(spatial robustness). Evaluated on these adversarially transformed examples, we\ndemonstrate that adding regularization on top of standard augmented or adversarial\ntraining reduces the relative robust error on CIFAR-10 by 20% with minimal\ncomputational overhead. Similar relative gains hold for SVHN and CIFAR-100.\nRegularized augmentation-based methods in fact even outperform handcrafted\nnetworks that were explicitly designed to be spatial-equivariant. Furthermore, we\nobserve for SVHN, known to have inherent variance in orientation, that robust\ntraining also improves standard accuracy on the test set. We prove that this no-\ntrade-off phenomenon holds for adversarial examples from transformation groups\nin the in\ufb01nite data limit.\n\n1\n\nIntroduction\n\nAs deployment of machine learning systems in the real world has steadily increased over recent\nyears, the trustworthiness of these systems has become a crucial requirement. This is particularly\nthe case for safety-critical applications. For example, the vision system in a self-driving car should\ncorrectly classify an obstacle or human irrespective of their orientation. Besides being relevant from\na security perspective, the ability to be invariant against small spatial transformations also helps to\ngauge interpretability and reliability of a model. If an image of a child rotated by 8\u25e6 is classi\ufb01ed as a\ntrash can, can we really trust the system in the wild?\n\nAs neural networks have been shown to be expressive both theoretically [18, 4, 15] and empirically\n[47], in this work we study to what extent standard neural networks predictors can be made invariant\nto small rotations and translations. In contrast to enforcing conventional invariance on entire group\norbits, we weaken the goal to invariance on smaller so-called transformation sets. This requirement\nre\ufb02ects the aim to be invariant to transformations that do not affect the labeling by a human. During\ntest time we assess transformation set invariance by computing the prediction accuracy on the worst-\ncase (adversarial) transformation in the (small) transformation set of each image in the test data.\nThe higher this worst-case prediction accuracy of a model is, the more spatially robust we say it is.\nImportantly, we use the same terminology as in the very active \ufb01eld of adversarially robust learning\n[39, 29, 23, 33, 6, 26, 37, 38, 35, 43, 28], but we consider adversarial examples with respect to spatial\ninstead of \ufffdp-transformations of an image.\nRecently, it was observed (see e.g.[11, 13, 34, 20, 14, 2, 10]) that worst-case prediction performance\ndrops dramatically for neural network classi\ufb01ers obtained using standard training, even for rather\nsmall transformation sets. In this context, we examine the effectiveness of regularization that explic-\nitly encourages the predictor to be constant for transformed versions of the same image, which we\nrefer to as being invariant on the transformation sets. Broadly speaking, there are two approaches\nto encourage invariance of neural network predictors. On the one hand, the relative simplicity of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe mathematical model for rotations and translations has led to carefully hand-engineered archi-\ntectures that incorporate spatial invariance directly [19, 24, 8, 27, 44, 42, 12, 40]. On the other\nhand, augmentation-based methods [3, 46] constitute an alternative approach to encourage desired\ninvariances on transformation sets. Speci\ufb01cally, the idea is to augment the training data by a random\nor smartly chosen transformation of every image for which the predictor output is enforced to be\nclose to the output of the original image. The latter can be achieved by adding a invariance-inducing\nregularization term to the classi\ufb01cation loss.\n\nWhile augmentation-based methods can be used out of the box whenever it is possible to generate\nsamples in the transformation set of interest, it is unclear how they compare to architectures that\nare tuned for the particular type of transformation using prior knowledge. Studying robustness\nagainst spatial transformations in particular allows us to compare the robust performance of these two\napproaches, as spatial-equivariant networks have been somewhat successful in enforcing invariance.\nIn contrast, this cannot be claimed for higher-dimensional \ufffdp-type perturbations. In the empirical\nsections of this paper, we hence want to explore the following questions:\n\n1. To what extent can augmentation and regularization based methods improve spatial robust-\n\nness of common deep neural networks?\n\n2. How does augmentation-based invariance-inducing regularization perform in case of small\nspatial transformations compared to representative specialized architectures designed to\nachieve invariance against entire transformation groups?\n\nAs a justi\ufb01cation for employing this form of invariance-inducing regularization, we prove in our the-\noretical section 2 that when perturbations come from transformation groups, predictors that optimize\nthe robust loss are in fact invariant on the set of transformed images. Although recent works show a\nfundamental trade-off between robust and standard accuracy in constructed \ufffdp perturbation settings\n[41, 48, 36], we additionally show that this is fundamentally different for spatial transformations due\nto their group structure.\n\nIn Section 4 we present our empirical \ufb01ndings and evaluate spatial robustness of various augmentation\nbased training methods for ResNet [16] architectures on SVHN [32], CIFAR-10 and CIFAR-100 [22]\nas described in Sec. 3 . Across all datasets, we observe \u223c 20% relative adversarial error reduction for\nmethods using invariance-induced regularization compared to previous ones including standard ad-\nversarial training, with only negligible computational overhead. In fact, regularization can drastically\nreduce the required training time to reach a \ufb01xed robust accuracy (see Figure 2). Furthermore, we\nshow that regularized augmentation-based methods outperform representative handcrafted networks\nthat were explicitly designed for invariance against all group transformations.\n\n2 Theoretical results for invariance-inducing regularization\nIn this section, we \ufb01rst introduce our notion of transformation sets and formalize robustness against a\nsmall range of translations and rotations. We then prove that, on a population level, constraining or\nregularizing for transformation set invariance yields models that minimize the robust loss. Moreover,\nwhen the label distribution is constant on each transformation set, we show that the set of robust\nminimizers not only minimizes the natural loss but, under mild conditions on the distribution over the\ntransformations, is even equivalent to the set of natural minimizers.\n\nAlthough the framework can be applied to general problems and transformation groups, we consider\nimage classi\ufb01cation for concreteness. In the following, X \u2208 X \u2282 Rd is an image and Y \u2208 Rp a\none-hot vector for multiclass labels that follow a joint distribution P. The function f : Rd \u2192 Rp in\nfunction space F (e.g. deep neural network in experiments) maps the input image to a logit vector\nthat is then used for prediction via a softmax layer.\n\n2.1 Transformation sets\nInvariance with respect to spatial transformations is often thought of in terms of group equivariance\nof the representation and prediction. Instead of invariance with respect to all spatial transformations\nin a group, we impose a weaker requirement, that is invariance against transformation sets, de\ufb01ned\nas follows. We denote by Gz a compact subset of images in the support of P that can be obtained\nby transformation of an image z \u2208 X . Gz is called a transformation set. For example in the case of\n\n\frotations, the transformation set Gz corresponds to the set of observed images in a dataset that are\ndifferent versions of the same image z, that can be obtained by small rotations of one another.\n\nBy the technical assumption on the space of real images that the sampling operator is bijective, the\nmapping z \u2192 Gz is bijective. We can hence de\ufb01ne G, a set of transformation sets, by G = \u222az\u2208X Gz\nfor a given transformation group. Importantly, the bijectivity assumption also leads to Gz being\ndisjoint for different images z \u2208 X . The above de\ufb01nition is distribution dependent and G partitions\nthe support \ufffdX of the distribution. More details on the aforementioned concepts and de\ufb01nitions can be\nfound in Sec. A.1 in the Appendix.\nWe say that a function f is (transformation-)invariant if f (x) = f (x\ufffd) for all x, x\ufffd \u2208 U for all U \u2208 G\nand denote the class of all such functions by V. Using this notation, \ufb01tting a model with high accuracy\nunder worst-case \u201csmall\u201d transformations of the input can be mathematically captured by the robust\noptimization formulation [5] of minimizing the robust loss\n\nLrob(f ) := E X,Y sup\nx\ufffd\u2208GX\n\n\ufffd(f (x\ufffd), Y )\n\n(1)\n\nin some function space F. We call the solution of this problem the (spatially) robust minimizer.\nWhile adversarial training aims to optimize the empirical version of Eq. (1), the converged predictor\nmight be far from the global population minimum, in particular in the case of nonconvex optimization\nlandscapes encountered when training neural networks. Furthermore, we show in the following\nsection that for robustness over transformation sets, constraining the model class to invariant functions\nleads to the same optimizer of the robust loss. These facts motivate invariance-inducing regularization\nwhich we then show to exhibit improved robust test accuracy in practice.\n\n2.2 Regularization to encourage invariance\nFor any regularizer R, we de\ufb01ne the corresponding constrained set of functions V(R) as\n\nV(R) := {f : R(f, x, y) = 0 \u2200(x, y) \u2208 supp(P)},\n\nwhere supp(P) denotes the support of P. When R(f, x, y) = supx\ufffd\u2208Gx h(f (x), f\ufffd(x)) and h is a\nsemimetric1 on Rp, we have V(R) = V. We now consider constrained optimization problems of the\nform\n\nmin\nf\u2208F\nmin\nf\u2208F\n\nE \ufffd(f (X), Y ) s.t. f \u2208 V(R),\nE sup\nx\ufffd\u2208GX\n\n\ufffd(f (x\ufffd), Y ) s.t. f \u2208 V(R).\n\n(O1)\n\n(O2)\n\nThe following theorem shows that (O1), (O2) are equivalent to (1) if the set of all invariant functions\nV is a subset of the function space F.\nTheorem 1. If V \u2286 F, all minimizers of the adversarial loss (1) are in V. If furthermore V(R) \u2286 V,\nany solution of the optimization problems (O1), (O2) minimizes the adversarial loss.\n\nThe proof of Theorem 1 can be found in the Appendix in Sec. A.2. Since exact projection onto the\nconstrained set is in general not achievable for neural networks, an alternative method to induce\ninvariance is to relax the constraints by only requiring f \u2208 {f : R(f, x, y) \u2264 \ufffd \u2200(x, y) \u2208 supp(P)}.\nUsing Lagrangian duality, (O1) and (O2) can then be rewritten in penalized form for some scalar\n\u03bb > 0 as\n\nmin\n\nf\u2208F Lnat(f ; R, \u03bb) := min\nf\u2208F\nf\u2208F Lrob(f ; R, \u03bb) := min\nf\u2208F\n\nmin\n\nE \ufffd(f (X), Y ) + \u03bbR(f, X, Y ),\n\n\ufffd(f (x\ufffd), Y ) + \u03bbR(f, X, Y ).\n\nE sup\nx\ufffd\u2208GX\n\n(2)\n\n(3)\n\nIn Sec. 2.4 we discuss how ordinary adversarial training, and modi\ufb01ed variants that have been\nproposed thereafter, can be viewed as special cases of Eqs. (2) and (3). On the other hand, the\nconstrained regularization formulation corresponds to restricting the function space and is hence\ncomparable with hand-crafted network architecture design as described in Sec. 3.1.\n\n1The weaker notion of a semimetric satis\ufb01es almost all conditions for a metric without having to satisfy the\n\ntriangle inequality.\n\n\f2.3 Trade-off between natural and robust accuracy\n\nEven though high robust accuracy (1) might be the main goal in some applications, one might wonder\nwhether the robust minimizer exhibits lower accuracy on untransformed images (natural accuracy)\nde\ufb01ned as Lnat(f ) := E X,Y \ufffd(f (X), Y ) [41, 48]. In this section we address this question and identify\nthe conditions for transformation set perturbations under which minimizing the robust loss does not\nlead to decreased natural accuracy. Notably, it even increases under mild assumptions.\n\nOne reason why adversarial examples have attracted a lot of interest is because the prediction of a\ngiven classi\ufb01er can change in a perturbation set in which all images appear the same to the human\neye. Mathematically, in the case of transformation sets, the latter can be modeled by a property of\nthe true distribution. Namely, it translates into the conditional distribution Y given x, denoted by\nPGx, being constant for all x belonging to the same subset U \u2208 G. In other words, Y is conditionally\nindependent of X given GX, i.e. Y \u22a5\u22a5 X|GX. Under this assumption the next theorem shows that\nthere is no trade-off in natural accuracy for the transformation robust minimizer.\n\nTheorem 2 (Trade-off natural vs. robust accuracy). Under the assumption of Theorem 1 and if\nY \u22a5\u22a5 X|GX holds, the adversarial minimizer also minimizes the natural loss. If moreover, PGz has\nsupport Gz for every z \u2208 \ufffdX and the loss \ufffd is injective, then every minimizer of the natural loss also\nhas to be invariant.\n\nAs a consequence, minimizing the constrained optimization problem (O1) could potentially help\nin \ufb01nding the optimal solution to minimize standard test error. Practically, the assumption on the\ndistribution of the transformation sets Gz corresponds to assuming non-zero inherent transformation\nvariance in the natural distribution of the dataset. In practice, we indeed observe a boost in natural\naccuracy for robust invariance-inducing methods in Sec. 4 on SVHN, a commonly used benchmark\ndataset for spatial-equivariant networks for this reason.\n\nOne might wonder how this result relates to several recent publications such as [41, 48] that presented\ntoy examples for which the \ufffd\u221e robust solution must have higher natural loss than the Bayes optimal\nsolution even in the in\ufb01nite data limit. On a fundamental level, \ufffd\u221e perturbation sets are of different\nnature compared to transformation sets on generic distributions of X . In the distribution considered\nin [41, 48], there is no unique mapping from x \u2208 X to a perturbation set and thus the conditional\nindependence property does not hold in general.\n\n2.4 Different regularizers and practical implementation\n\nIn order to improve robustness against spatial transformations we consider different choices of\nR(f, x, y) in the regularized objectives (2) and (3) that we then compare empirically in Sec. 4. This\nallows us to view a number of variants of adversarial training in a uni\ufb01ed framework. Broadly\nspeaking, each approach listed below consists of \ufb01rst searching an adversarial example according\nto some mechanism which is then included in a regularizing function, often some weak notion of\ndistance between the prediction at X and the new example. The following choices of regularizers\ninvolve the maximization of a regularizing function over the transformation set\n\n\ufffd(f (x\ufffd), Y ) \u2212 \ufffd(f (X), Y ) (equivalent to [39, 26] for Lnat)\n\nRAT(f, X, Y ) = sup\nx\ufffd\u2208GX\nR\ufffd2 (f, X, Y ) = sup\n\nRKL(f, X, Y ) = sup\nx\ufffd\u2208GX\n\nx\ufffd\u2208GX \ufffdf (X) \u2212 f (x\ufffd)\ufffd2\n\n2\n\nDKL(f (x\ufffd), f (X)) (equivalent to [48] for Lnat)2\n\nwhere DKL is the KL divergence on the softmax of the (logit) vectors f \u2208 Rp. In all cases we refer\nto the maximizer as an adversarial example that is found using defense mechanisms as discussed in\nSection 3.3. Note that for R\ufffd2 and RKL the assumption V(R) \u2286 V in Theorem 1 is satis\ufb01ed.\nInstead of performing a maximization of the regularizing function to \ufb01nd the adversarial example x\ufffd,\nwe can also choose x\ufffd in alternative ways The following variants are explored in the paper, two of\n\n\fwhich are reminiscent of previous work\nRALP(f, X, Y ) = \ufffdf (x\ufffd) \u2212 f (X)\ufffd2\nRKL-C(f, X, Y ) = DKL(f (x\ufffd), f (X)) with x\ufffd = arg max\n\n2 with x\ufffd = arg max\nu\u2208GX\n\nu\u2208GX\nRh\u2212DA(f, X) = E x\ufffd\u2208GX h(f, X, X\ufffd) (similar to [17])\n\n\ufffd(f (u), Y ) (equivalent to [21])\n\n\ufffd(f (u), Y )\n\nThe last regularizer suggests using an additive penalty on top of data augmentation, with either one or\neven multiple random draws, where the penalty can be any of the above semimetrics h between f (X)\nand f (x\ufffd), such as the \ufffd2 or DKL distance. Albeit suboptimal, the experimental results in Section 4\nsuggest that simply adding the additive regularization penalty on top of randomly drawn data matches\ngeneral adversarial training in terms of robust prediction at a fraction of the computational cost. In\naddition, Theorem 2 suggests that even when the goal is to improve standard accuracy and one expects\ninherent variance of nuisance factors in the data distribution it is likely helpful to use regularized data\naugmentation with Rh\u2212DA instead of vanilla data augmentation. Empirically we observe this on the\nSVHN dataset in Section 4.\nAdversarial example for spatial transformation sets Since GX is not a closed group and we do\nnot even know whether the observation X lies at the boundary of GX or in the interior, we cannot\nsolve the maximization constrained to GX in practice. However, for an appropriate choice of set S,\nwe can instead minimize an upper bound of (1) which reads\n\ufffd(f (T (X, \u0394)), Y ) \u2265 min\nf\u2208F\n\n\ufffd(f (x\ufffd), Y )\n\n(4)\n\nmin\nf\u2208F\n\nE sup\n\u0394\u2208S\n\nE sup\nx\ufffd\u2208GX\n\nwhere S is the set of transformations that we search over and T (X, \u0394) denotes the transformed image\nwith transformation \u0394 (see Sec. A.1 in the Appendix for an explicit construction of the transformation\nsearch set S). The left hand side in (4) is hence what we aim to solve in practice where the expectation\nis over the empirical joint distribution of X, Y . The relaxation of GX to a range of transformations\nof X that is {T (X, \u0394) : \u0394 \u2208 S} is also used for the maximization within the regularizers.\nIn Figure 1 one pair of example images is shown: the original image (panel (a)) is depicted along with\na transformed version T (\u00b7, \u0394) with \u0394 \u2208 S (panel (b)) and the respective predictions by a standard\nneural network classi\ufb01er.\n\n3 Experimental setup\nIn our experiments, we compare invariance-inducing regularization incorporated via various\naugmentation-based methods (as described in Section 2.4) used on standard networks and rep-\nresentative spatial equivariant networks trained using standard optimization procedures.\n\n3.1 Spatial equivariant networks\n\nWe compare the robust prediction accuracies from networks trained with the regularizers with three\nspecialized architectures, designed to be equivariant against spatial transformations and translations:\n(a) G-ResNet44 (GRN) [8] using p4m convolutional layers (90 degree rotations, translations and\nmirror re\ufb02ections) on CIFAR-10; (b) Equivariant Transformer Networks (ETN) [40], a generalization\nof Polar Transformer Networks (PTN) [12], on SVHN; and (c) Spatial Transformer Networks (STN)\n[19] on SVHN. A more comprehensive discussion of the literature on equivariant networks can be\nfound in Sec. 5. We choose the architectures listed above based on availability of reproducible code\nand previously reported state-of-the art standard accuracies on SVHN and CIFAR-10. We train GRN,\nSTN and ETN using standard augmentation as described in Sec. 3.4 (std) and random rotations in\naddition (std\ufffd). Out of curiosity we also trained a \u201ctwo-stage\u201d STN where we train the localization\nnetwork separately in a supervised fashion. Speci\ufb01cally, we use a randomly transformed version\nof the training data, treating the transformation parameters as prediction targets. Details about the\nimplementation and results can be found in Sec. B in the Appendix.\n\n3.2 Transformations\nThe transformations that we consider in Sec. 4 are small rotations (of up to 30\u25e6) and translations in\ntwo dimensions of up to 3 px corresponding to approx. 10% of the image size. For augmentation\n\n\fFigure 1: Example images and classi\ufb01cations by the Standard\nmodel. (a) An image that is correctly classi\ufb01ed for most of the\nrotations in the considered grid. (b) One rotation for which\nthe image shown in (b) is misclassi\ufb01ed as \u201cairplane\u201d.\n\n(a)\n\n(b)\n\nbased methods we need to generate such small transformations for a given test image. Although the\nde\ufb01nition of a transformation T (X, \u0394) in the theoretical section using the corresponding continuous\nimage functions is clean, we do not have acccess to the continuous function in practice since the\nmapping is in general not bijective.\nInstead, we use bilinear interpolation, as implemented in\nTensorFlow and in a differentiable version of a transformer [19] for \ufb01rst order attack and defense\nmethods.\n\nOn top of interpolation, rotation also creates edge artifacts at the boundaries, as the image is only\nsampled in a bounded set. The empty space that results from translating and rotating an image is\n\ufb01lled with black pixels (constant padding) if not noted otherwise. Fig. 1 (b) shows an example.\n[11] additionally analyze a \u201cblack canvas\u201c setting where the images are padded with zeros prior to\napplying the transformation, ensuring that no information is lost due to cropping. Their experiments\nshow that the reduced accuracy of the models cannot be attributed to this effect. Since both versions\nyield similar results, we report results on the \ufb01rst version of pad and crop choices, having input\nimages of the same size as the original.\n\n3.3 Attacks and defenses\n\nThe attacks and defenses we choose essentially follow the setup in [11]. The defense refers to the\nprocedure at training time which aims to make the resulting model robust to adversarial examples. It\ngenerally differs from the (extensive) attack mechanism performed at evaluation time to assess the\nmodel\u2019s robustness due to computational constraints.\nConsidered attacks First order methods such as projected gradient descent that have proven to be\nmost effective for \ufffd\u221e transformations are not optimal for \ufb01nding adversarial examples with respect\nto rotations and translations. In particular, our experiments con\ufb01rm the observations reported in [11]\nthat the most adversarial examples can be found through a grid search. For the grid search attack, the\ncompact perturbation set S is discretized to \ufb01nd the transformation resulting in a misclassi\ufb01cation with\nthe largest loss \ufffd. In contrast to the case of \ufffd\u221e-adversarial examples, this method is computationally\nfeasible for the 3-dimensional spatial parameters. We consider a default grid of 5 values per translation\ndirection and 31 values for rotation, yielding 775 transformed examples that are evaluated for each\nXi. We refer to the accuracy attained under this attack as grid accuracy. 3\nConsidered defenses For the adversarial example which maximizes either the loss or regularization\nfunction, we use the following defense mechanisms:\n\n\u2022 worst-of-k: At every iteration t, we sample k different perturbations for each image in the\nbatch. The one resulting in the highest function value is used as the maximizer. Most of our\nexperiments are conducted with k = 10 consistent with [11] as a higher k only improved\nperformance minimally (see Table 5).\n\n\u2022 Spatial PGD: In analogy to common practice for \ufffdp adversarial training as in e.g. [39, 26],\nthe S-PGD mechanism uses projected gradient descent with respect to the translation and\nrotation parameters with projection on the constrained set S of transformations. We consider\n5 steps of PGD, starting from a random initialization, with step sizes of [0.03, 0.03, 0.3]\n(following [11]) for horizontal-, vertical translation and rotation respectively. A discussion\non the discrepancy between S-PGD as a defense and attack mechanism can be found in\nSection C.2.\n\n\u2022 Random: Data augmentation with a distinct random perturbation per image and iteration.\nThis can be seen as the most naive \u201cadversarial\u201d example as it corresponds to worst-of-k\nwith k = 1.\n\n3Since a \ufb01ner grid of 7500 transformations showed only minor accuracy reductions for a subset of the\nexperiments (summarized in Table 10), we chose the coarser grid for the entire set of experiments for faster\ncomputation.\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f3.4 Training details\nThe experiments are conducted with deep neural networks as the function space F and \ufffd is the\ncross-entropy loss. In the main paper we consider the datasets SVHN [32] and CIFAR-10 [22]. For\nthe non-specialized architectures, we train a ResNet-32 [16], implemented in TensorFlow [1]. For the\nTransformer networks STN and ETN we use a 3-layer CNN as localization according to the default\nsettings in the provided code of both networks for SVHN and rot-MNIST. In the Appendix we also\nreport results for CIFAR-100 [22] using a ResNet-50 [16].\n\nWe train the baseline models with standard data augmentation: random left-right \ufb02ips and random\ntranslations of \u00b14px followed by normalization. Below we refer to the models trained in this fashion\nas \u201cstd\u201d. For the models trained with one of the defenses described in Sec. 3.3, we only apply\nrandom left-right \ufb02ipping since translations are part of the adversarial search. The special case of\ndata augmentation (with translations and rotations, i.e. the defense \u201crandom\u201d) without regularization\nis refered to as std\ufffd.\nFor optimization of the empirical training loss, we run standard minibatch SGD with a momentum\nterm with parameter 0.9 and weight decay parameter 0.0002. We use an initial learning rate of\n0.1 which is divided by 10 after half and three-quarters of the training steps. Independent of the\ndefense method, we \ufb01x the number of iterations to 80000 for SVHN and CIFAR-10, and to 120000\nfor CIFAR-100. For comparability across all methods, the number of unique original images in\neach iteration is 64 in all cases. For the baselines std, std\ufffd and Adversarial training, we additionally\ntrained with a conventional batch size of 128 and report the higher accuracy of both versions. For the\nregularized methods, the value of \u03bb is chosen based on the test grid accuracy. All models are trained\nusing a single GPU on a node equipped with an NVIDIA GeForce GTX 1080 Ti and two 10-core\nXeon E5-2630v4 processors.\n\n4 Empirical Results\nWe now compare the natural test accuracy (standard\naccuracy on the test set, abbreviated as nat) and test\ngrid accuracy (as de\ufb01ned in Sec. 3.3, abbreviated as\nrob) achieved by standard and regularized (adversar-\nial) training techniques as well as specialized spatial\nequivariant architectures described in Sec. 3.1. For\nclarity of presentation, we refer to our training proce-\ndures using the following de\ufb01ning factors: (a) Reg :\nrefers to what regularizer was used (AT, ALP, \ufffd2,\nKL, or KL-C as de\ufb01ned in Section 2.4); (b) batch:\nindicates whether the gradient of the loss is taken\nwith respect to the adversarial examples (rob), natural\nexamples (nat) or both (mix), and (c) def: the mech-\nanism used to \ufb01nd the adversarial example, including\nrandom (rnd), worst-of-k (Wo-k) and spatial PGD\n(S-PGD) as described in Sec. 3.3. Thus, Reg (batch,\ndef) corresponds to using Reg as the regularization\nfunction, the examples de\ufb01ned by batch in the gra-\ndient of the loss and the defense mechanism def to\n\ufb01nd the augmented or adversarial examples.\nIn Table 1, we report results for a subset of the Reg (batch, def) combinations to facilitate compar-\nisons. Tables with many more combinations can be found Tables 4\u20138 in the Appendix. We report\naverages (standard deviations are contained in Tables 4\u20138) computed over \ufb01ve training runs with\nidentical hyperparameter settings. We compare all methods by computing absolute and relative error\nreductions (de\ufb01ned as absolute error drop\n). It is insightful to present both numbers since the absolute\nvalues vary drastically between datasets.\nEffectiveness of augmentation-based invariance-inducing regularization In Table 1 (top), the\nthree leftmost columns represent unregularized methods which all perform worse in grid accuracy\nthan regularized methods and the two right-most columns represent adversarial examples with respect\nto the classi\ufb01cation cross entropy loss found via S-PGD. When considering the three regularizers (KL,\n\nFigure 2: Mean runtime for different methods on\nCIFAR-10. The connected points correspond to\nWo-k defenses with k \u2208 {1, 10, 20}. The exact\nnumbers can be found in Table 6\n\nprior error\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fTable 1: Mean accuracies of models for SVHN and CIFAR-10 trained with various forms of regular-\nized adversarial training as well as standard augmentation techniques (top) and spatial equivariant\nnetworks (bottom). std\ufffd denotes standard augmentation plus random rotations.\n\n(rob)\n\nSVHN (nat) 95.48 93.97\n18.85 82.60\nCIFAR (nat) 92.11 89.93\n58.29\n\n(rob)\n\nstd\ufffd AT (rob,\nWo-10)\n96.03\n90.35\n91.78\n70.97\n\nKL (rob,\nWo-10)\n96.13\n92.71\n90.31\n76.61\n\n\ufffd2 (rob,\nWo-10)\n96.53\n92.55\n90.53\n77.06\n\nALP (rob,\nWo-10)\n96.30\n92.04\n89.87\n75.67\n\nKL-C (mix,\n\nS-PGD)\n96.14\n92.42\n89.82\n78.79\n\nALP (rob,\nS-PGD)\n96.11\n92.32\n89.91\n77.68\n\nstd\n\n9.52\n\nSVHN (nat)\n(rob)\n\nGRN GRN\ufffd\n95.05\n96.07\n25.12\n84.9\n\nETN ETN\ufffd\n95.57\n95.53\n84.21\n13.15\n\nSTN STN\ufffd\n95.61\n36.68\n\n95.55 CIFAR (nat)\n79.28\n(rob)\n\nGRN GRN\ufffd\n93.08\n93.39\n71.64\n16.85\n\nTable 2: Mean accuracies of models trained with various forms of regularized adversarial training.\nLeft: All adversarial examples were found via Wo-10; right: unregularized (std\ufffd) and regularized\ndata augmentation where the optimum is bolded for each row.\n\nKL (nat,\nWo-10)\n96.00\n92.27\n90.63\n77.18\n\n\ufffd2 (nat,\nWo-10)\n96.05\n92.16\n88.32\n75.64\n\nALP (nat,\nWo-10)\n96.39\n91.98\n88.55\n75.06\n\nstd\ufffd\n93.97\n82.60\n89.93\n58.29\n\n\ufffd2 (nat,\n\nrnd)\n96.34\n90.51\n87.80\n71.60\n\nKL (nat,\n\nrnd)\n96.16\n90.69\n89.19\n73.32\n\n\ufffd2 (rob,\n\nrnd)\n96.09\n90.48\n88.75\n71.49\n\nKL (rob,\n\nrnd)\n96.23\n90.92\n89.43\n73.32\n\nSVHN (nat)\n(rob)\nCIFAR (nat)\n(rob)\n\n\ufffd2, ALP) with the same batch and def (here chosen to be \u201crob\u201d and Wo-10) regularized adversarial\ntraining improves the grid accuracy from 70.97% to 77.06% on CIFAR-10 and 90.35% to 92.71%\non SVHN, corresponding to a relative error reduction of 21% and 24% respectively. The same can be\nobserved when comparing data augmentation std\ufffd and its regularized variants \ufffd2(\u00b7, rnd), KL(\u00b7, rnd) in\nTable 2. Together with Table 5, S-PGD seems to be the more ef\ufb01cient defense mechanism compared\nto worst-of-k even when k is raised to 20, with comparable computation time.\nComputational considerations In Figure 2, we plot the grid accuracy vs. the runtime (in hours)\nfor a subset of regularizers and defense mechanisms on CIFAR-10 for clarity of presentation. How\nmuch overhead is needed to obtain the reported gains? Comparing AT(rob, Wo-k) (green line) and\nALP(rob, Wo-k) (red line) shows that signi\ufb01cant improvements in grid accuracy can be achieved by\nregularization with only a small computational overhead. What if we make the defense stronger?\nWhile the leap in robust accuracy from Wo-1 (also referred to as rnd) to Wo-10 is quite large,\nincreasing k to 20 only gives diminishing returns while requiring \u223c 3\u00d7 more training time. This\nobservation is summarized exemplarily for both KL and ALP regularizer on CIFAR-10 in Table 6.\nFurthermore, for any \ufb01xed training time, regularized methods exhibit higher robust accuracies where\nthe gap varies with the particular choice of regularizer and defense mechanism.\nComparison with spatial equivariant networks Although the rotation-augmented G-ResNet44\nobtains higher grid (SVHN: 84.9%, CIFAR-10: 71.64%) and natural accuracies (SVHN: 95.05%,\nCIFAR-10: 93.08%) than the rotation-augmented Resnet-32 on both SVHN (grid: 82.60%, nat:\n93.97%) and CIFAR-10 (grid: 58.29%, nat: 89.93%), regularizing standard data augmentation (i.e.\nregularizers with \u201crnd\u201d, see Table 2 (right)) using both the \ufffd2 distance and the KL divergence matches\nthe G-ResNet44 on CIFAR-10 (\ufffd2: 71.60%, KL: 73.32%) and surpasses it on SVHN on grid (\ufffd2:\n90.51%, KL: 90.69%) and natural accuracies by a relative grid error reduction of \u223c 37%. The same\nphenomenon is observed for the augmented ETN and STN on SVHN.4 In conclusion, regularized\naugmentation based methods match or outperform representative end-to-end networks handcrafted to\nbe equivariant to spatial transformations.\nTrade-off natural vs. adversarial accuracy SVHN is one of the main datasets (without arti\ufb01cial\naugmentation like in rot-MNIST [25]) where spatial equivariant networks have reported improvements\n\n4We had dif\ufb01culties to train both ETN and STN to higher than 86% natural accuracy for CIFAR-10 even\n\nafter an extensive learning rate and schedule search so we do not report the numbers here.\n\n\fon natural accuracy. This is due to the inherent orientation variance in the data. In our mathematical\nframework, this corresponds to the assumption in Theorem 2 of the distribution on the transformation\nsets having support Gz. Furthermore, as all numbers in SVHN have the same label irrespective of\nsmall rotations of at most 30 degrees, the \ufb01rst assumption in Theorem 2 is also ful\ufb01lled. Table 1 and\n2 con\ufb01rm the statement in the Theorem that improving robust accuracy may not hurt natural accuracy\nor even improve it: For SVHN, adding regularization to samples obtained both via Wo-10 adversarial\nsearch or random transformation (rnd) consistently not only helps robust but also standard accuracy.\nComparing the effects of different regularization parameters on test grid accuracy We study\nTables 1 and 2 and attempt to disentangle the effects by varying only one parameter. For example\nwe can observe that, computational cost aside, \ufb01xing any regularizer defense to Wo-10, the robust\nregularized loss Reg (rob, Wo-10) (i.e., Lrob(f ; R)) does better (or not statistically signi\ufb01cantly\nworse) than Reg (nat, Wo-10) (i.e., Lnat(f ; R)). Furthermore, the KL regularizer generally performs\nbetter than \ufffd2 for a large number of settings. A possible explanation for the latter could be that DKL\nupper bounds the squared \ufffd2 loss on the probability simplex and is hence more restrictive.\nChoice of \u03bb The optimal \u03bb in terms of grid accuracy depend on the regularization method. However,\nthe regularized predictors outperform unregularized methods in a large range of \u03bb values (see Figures 4\nand 5 in the Appendix), suggesting that effective values of \u03bb are not dif\ufb01cult to \ufb01nd in practice.\n\nThere are many more interesting experiments we have conducted for subsets of the defenses and\ndatasets illustrating different phenomena that we observe. For example we have analyzed a \ufb01ner grid\nfor the grid search attack and evaluated S-PGD as an attack mechanism. A detailed discussion of\nthese experiments can be found in Sec. C.2.\n\n5 Related work\nGroup equivariant networks There are in general two types of approaches to incorporate spatial\ninvariance into the network. In one of the earlier works in the neural net era, Spatial Transformer\nNetworks were introduced [19] which includes a transformer module that predicts transformation\nparameters followed by a transformer. Later on, one line of work proposed multiple \ufb01lters that are\ndiscrete group transformations of each other [24, 27, 8, 50, 44]. For continuous transformations,\nsteerability [42, 9] and coordinate transformation [12, 40] based approaches have been suggested.\nAlthough these approaches have resulted in improved standard accuracy performances, it has not\nbeen rigorously studied whether or by how much they improve upon regular networks with respect to\nrobust test accuracy.\nRegularized training Using penalty regularization to encourage robustness and invariance when\ntraining neural networks has been studied in different contexts: for distributional robustness [17],\ndomain generalization [30], \ufffdp adversarial training [31, 21, 48], robustness against simple transfor-\nmations [7] and semi-supervised learning [49, 45]. These approaches are based on augmenting the\ntraining data either statically [17, 30, 7, 45], ie. before \ufb01tting the model, or adaptively in the sense\nof adversarial training, with different augmented examples per training image generated in every\niteration [21, 31, 48].\n\n6 Conclusion\nIn this work, we have explored how regularized augmentation-based methods compare against spe-\ncialized spatial equivariant networks in terms of robustness against small translations and rotations.\nStrikingly, even though augmentation can be applied to encourage any desired invariance, the regu-\nlarized methods adapt well and perform similarly or better than specialized networks. Furthermore,\nwe have introduced a theoretical framework incorporating many forms of regularization techniques\nthat have been proposed in the literature. Both theoretically and empirically, we showed that for\ntransformation invariances and under certain practical assumptions on the distribution, there is no\ntrade-off between natural and adversarial accuracy which stands in contrast to the debate around\n\ufffdp-perturbation sets. In summary, it is advantageous to replace unregularized with regularized training\nfor both augmentation and adversarial defense methods. With regard to the regularization parameter\nchoice we have seen that improvements can be obtained for a large range of \u03bb values, indicating that\nthis additional hyperparameter is not dif\ufb01cult to tune in practice. In future work, we aim to explore\nwhether specialized architectures can be combined with regularized adversarial training to improve\nupon the best results reported in this work.\n\n\f7 Acknowledgements\n\nWe thank Luzius Brogli for initial experiments, Nicolai Meinshausen and Armeen Taeb for valuable\nfeedback on the manuscript and Ludwig Schmidt for helpful discussions. FY was supported by the\nInstitute for Theoretical Studies ETH Zurich, the Dr. Max R\u00f6ssler and Walter Haefner Foundation,\nETH Foundations of Data Science and the Of\ufb01ce of Naval Research Young Investigator Award\nN00014-19-1-2288. ZW was supported by the ETH Foundations of Data Science.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning\n\non heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[2] Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh\nNguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar\nobjects. arXiv preprint arXiv:1811.11553, 2018.\n\n[3] Henry S Baird. Document image defect models. In Structured Document Image Analysis, pages\n\n546\u2013556. Springer, 1992.\n\n[4] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.\n\nIEEE Trans. Info. Theory, 39(3):930\u2013945, 1993.\n\n[5] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization, volume 28.\n\nPrinceton University Press, 2009.\n\n[6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\nProceedings of the IEEE Symposium on Security and Privacy (SP), pages 39\u201357. IEEE, 2017.\n[7] Gong Cheng, Junwei Han, Peicheng Zhou, and Dong Xu. Learning rotation-invariant and Fisher\ndiscriminative convolutional neural networks for object detection. IEEE Transactions on Image\nProcessing, 28(1):265\u2013278, 2019.\n\n[8] Taco Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of\n\nthe International Conference on Machine Learning, pages 2990\u20132999, 2016.\n\n[9] Taco S Cohen, Mario Geiger, Jonas K\u00f6hler, and Max Welling. Spherical CNNs. In Proceedings\n\nof the International Conference on Learning Representations, 2018.\n\n[10] Beranger Dumont, Simona Maggio, and Pablo Montalvo. Robustness of rotation-equivariant\n\nnetworks to adversarial perturbations. arXiv preprint arXiv:1802.06627, 2018.\n\n[11] Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry.\nExploring the landscape of spatial robustness. In Proceedings of the International Conference\non Machine Learning, 2019.\n\n[12] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar trans-\nformer networks. In Proceedings of the International Conference on Learning Representations,\n2018.\n\n[13] A. Fawzi and P. Frossard. Manitest: Are classi\ufb01ers really invariant? In British Machine Vision\n\nConference (BMVC), 2015.\n\n[14] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Sch\u00fctt, Matthias Bethge, and\nFelix A Wichmann. Generalisation in humans and deep neural networks. In Advances in Neural\nInformation Processing Systems, pages 7549\u20137561, 2018.\n\n[15] Boris Hanin. Universal function approximation by deep neural nets with bounded width and\n\nrelu activations. arXiv preprint arXiv:1708.02691, 2017.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition,\npages 770\u2013778, 2016.\n\n[17] Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain\n\nshift robustness. arXiv preprint arXiv:1710.11469, 2017.\n\n[18] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n\f[19] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial Transformer Networks. In\n\nAdvances in Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[20] Can Kanbak, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Geometric robustness of\ndeep networks: analysis and improvement. In Proceedings of the IEEE Conference on Computer\nVision and Patern Recognition, pages 4441\u20134449, 2018.\n\n[21] Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial Logit Pairing. arXiv preprint\n\narXiv:1803.06373, 2018.\n\n[22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical Report 4, University of Toronto, 2009.\n\n[23] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. arXiv preprint arXiv:1607.02533, 2016.\n\n[24] Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Pollefeys. TI-POOLING:\nTransformation-invariant pooling for feature learning in convolutional neural networks. In\nProceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages\n289\u2013297, 2016.\n\n[25] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An\nempirical evaluation of deep architectures on problems with many factors of variation. In\nProceedings of the 24th International Conference on Machine Learning, pages 473\u2013480, 2007.\n[26] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. In Proceedings of the International\nConference on Learning Representations, 2018.\n\n[27] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector\n\ufb01eld networks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n5058\u20135067, 2017.\n\n[28] Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for\nprovably robust neural networks. In Proceedings of the International Conference on Machine\nLearning, pages 3575\u20133583, 2018.\n\n[29] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple\nand accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on\nComputer Vision and Patern Recognition, pages 2574\u20132582, 2016.\n\n[30] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gianfranco Doretto. Uni\ufb01ed deep\nsupervised domain adaptation and generalization. In Proceedings of the IEEE International\nConference on Computer Vision, volume 2, page 3, 2017.\n\n[31] Taesik Na, Jong Hwan Ko, and Saibal Mukhopadhyay. Cascade adversarial machine learning\nregularized with a uni\ufb01ed embedding. In Proceedings of the International Conference on\nLearning Representations, 2018.\n\n[32] Y. Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading\ndigits in natural images with unsupervised feature learning. In NIPS workshop on Deep Learning\nand Unsupervised Feature Learning, page 5, 2011.\n\n[33] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan-\nthram Swami. Practical black-box attacks against machine learning. In Proceedings of the ACM\nAsia Conference on Computer and Communications Security, pages 506\u2013519. ACM, 2017.\n\n[34] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Towards practical veri\ufb01cation of\nmachine learning: The case of computer vision systems. arXiv preprint arXiv:1712.01785,\n2017.\n\n[35] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certi\ufb01ed defenses against adversarial\nexamples. In Proceedings of the International Conference on Learning Representations, 2018.\n[36] Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, and Percy Liang. Adversar-\n\nial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019.\n\n[37] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classi-\n\ufb01ers against adversarial attacks using generative models. In Proceedings of the International\nConference on Learning Representations, 2018.\n\n\f[38] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\nprincipled adversarial training. In Proceedings of the International Conference on Learning\nRepresentations, 2018.\n\n[39] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-\nIn Proceedings of the\n\nIntriguing properties of neural networks.\n\nfellow, and Rob Fergus.\nInternational Conference on Learning Representations, 2014.\n\n[40] Kai Sheng Tai, Peter Bailis, and Gregory Valiant. Equivariant Transformer Networks. In\n\nProceedings of the International Conference on Machine Learning, 2019.\n\n[41] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.\nRobustness may be at odds with accuracy. In Proceedings of the International Conference on\nLearning Representations, 2019.\n\n[42] Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable \ufb01lters for rotation\nequivariant CNNs. In Proceedings of the IEEE Conference on Computer Vision and Patern\nRecognition, 2018.\n\n[43] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer\nadversarial polytope. In International Conference on Machine Learning, pages 5283\u20135292,\n2018.\n\n[44] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Har-\nIn Proceedings of the IEEE\n\nmonic networks: Deep translation and rotation equivariance.\nConference on Computer Vision and Patern Recognition, pages 5028\u20135037, 2017.\n\n[45] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data\n\naugmentation. arXiv preprint arXiv:1904.12848, 2019.\n\n[46] Larry S. Yaeger, Richard F. Lyon, and Brandyn J. Webb. Effective training of a neural network\ncharacter classi\ufb01er for word recognition. In Advances in Neural Information Processing Systems,\npages 807\u2013816, 1997.\n\n[47] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In Proceedings of the International Conference\non Learning Representations, 2015.\n\n[48] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I.\nJordan. Theoretically principled trade-off between robustness and accuracy. In Proceedings of\nthe International Conference on Machine Learning, 2019.\n\n[49] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of\ndeep neural networks via stability training. In Proceedings of the IEEE Conference on Computer\nVision and Patern Recognition, pages 4480\u20134488, 2016.\n\n[50] Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented response networks. In\nProceedings of the IEEE Conference on Computer Vision and Patern Recognition, pages\n519\u2013528, 2017.\n\n\f", "award": [], "sourceid": 8392, "authors": [{"given_name": "Fanny", "family_name": "Yang", "institution": "Stanford University, ETH Zurich"}, {"given_name": "Zuowen", "family_name": "Wang", "institution": "ETH Zurich"}, {"given_name": "Christina", "family_name": "Heinze-Deml", "institution": "ETH Zurich"}]}