NeurIPS 2020

GradAug: A New Regularization Method for Deep Neural Networks

Review 1

Summary and Contributions: The paper proposes a new regularization call GradAug to better train deep neural networks. The main contributions are: 1. A multi forward method of different data augmentations using different sub-networks is proposed. And the method is viewed as gradient augmentation technique by the authors. 2. Extensive experiments on image classification, object detection, instance segmentation, adversarial attack and low data setting are conducted to demonstrate the effectiveness of the proposed method.

Strengths: 1. The authors proposes to regularize neural nets by forward and backward the combination of multiple data augmentations. And different data augmentations go into different sub-networks, and the final sub-network sampling is simply keeping the first w (w\in[0,1]) percent of the total filters and outputs channels of each layer. The whole procedure is introduced as a special gradient augmentation. 2. The authors show the improvements of the proposed methods in many different tasks and datasets. Beyond the accuracy of recognition, detection and segmentation, the model robustness to adversarial samples are also improved.

Weaknesses: 1. The training time and memory cost could increase by several times. 2. Since the training is more time consuming, many other solutions such as mimicking / distilling based solutions are suggested to compare with. 3. Only a very simple sub-network sampling strategy is considered, what about randomly choosing a sub-network each time, or keeping the most part fixed and a some portion randomly chosen? 4. Only a very simple data augmentation (different input training size) is considered in the sub-network training. What about other choices? ====== post rebuttal ======= I misunderstood some important points in the paper in my original review comments. 1. The difference between GradAug and GradAug+. Now that I know that GradAug+ (GradAug with CutMix augmentations) does achieve STOA results on several tasks. For example, for ImageNet classification, the result of 79.6% acc of top 1 is STOA as I far as I know. 2. Experiments show that GradAug need less epochs to converge to a good result, and this alleviate my concern about the time cost to some extent. 3. The idea also seems to work good in the setting of stochastic depth.

Correctness: The method looks reasonable, but more experiments such as different sub-network sampling and data augmentation strategies are suggested.

Clarity: the paper is good written and easy to understand.

Relation to Prior Work: The relate work is clearly introduced and compared with.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: After rebuttal and discussion with other reviewers I have updated my score. However, I do point out several concerns of mine which the authors could consider further validation for: It's good that the authors performed the time/memory comparison in the rebuttal as that was a significant concern of mine. My concerns mostly revolve around what other techniques should we compare this against? Given that this algorithm takes 3-4x the time with comparison to the baseline, I could for example: 1: Train a much larger network and then use compression techniques to slim it to the same size. 2: Train an even larger network at a lower quantization then use compression. 3: Train a larger network with competing regularization techniques (e.g. Mixup which is still ~70% faster. see rebuttal) then use compression. Is it sufficiently true that this approach is in fact *not* aliasing the idea of training a larger network, then compressing it? Is it sufficiently true that the novelty of the approach (i.e. subnetwork learning) cannot be captured by training a larger network, then compressing? These questions would be nice to be answered with convincing validation, and my score would certainly be higher if they were validated. This paper proposes a novel form of regularization dubbed 'GradAug'. GradAug consists of two components. One is structured subsampling of neural networks similar to dropout. A second is a 'self guided gradient augmentation' technique is also used. GradAug iterates upon previous work such as dropout/zoneout/freezeout to improve the robustness of neural networks by incorporating training them as 'ensembles' of smaller subnetworks. Some insight is provided into the underlying operating mechanism behind GradAug. Validation is also performed on GradAug demonstrating improved test performance on CIFAR10/Imagenet. The general applicability of the technique is demonstrated by testing on other image tasks. Finally, it is demonstrated that GradAug improves the robustness of the network to adversarial attacks and image corruption.

Strengths: The empirical results shown by the approach are quite strong, and certainly demonstrate the possible validity of the approach. There is some amount of novelty in the approach, however it builds upon previous work considering a network as a an ensemble of smaller subnetworks (i.e. dropout et. al.).

Weaknesses: There remain several weaknesses in the paper, mostly centering around thoroughness of validation, and inadequate comparison to other (semi-related) work. In particular I found the explanation as to how the two proposed approaches in GradAug. It is not sufficiently explained (e.g. in 3.2) *how* the structured subnetwork sampling, as well as the 'self guided gradient augmentation' aid in building a robust network. Although I agree the empirical results are impressive, I'm not sure how this approach works outside of imagining it as some sort of 'Dropout, but better,' technique. From what I can tell the 'self guided gradient augmentation' prefers networks which are robust (i.e. preferring little to no change in the output, given scaling/transformations according to Eq. 4) to certain types of transformations (e.g. scaling, rotation, translation etc.). This should improve robustness, however the authors claim this is some sort of 'self guiding'. It's not exactly clear what is meant by this. The authors need to better explain exactly what is hoped to be achieved by the self guiding, and also to furnish evidence through empirical experiments, or perhaps proof/symbolic intuition. It is also not fully well explored how the structured subnetwork training improves upon Dropout, and how it aids in training and regularizing networks. The authors mention a 'a larger sub-network always share the representations of a smaller sub-network in a weights-sharing training fashion, so it can leverage the representations learned in smaller sub-networks.' What is precisely meant by this? I'm not sure I understand, or believe that such an assertion is true. This is not elaborated upon in later sections or empirically in the validation. Two additional weaknesses in the paper are a lack of mention/evaluation on the performance cost of GradAug. From what I can tell, GradAug requires several forward passes for each step of training. How significantly does this affect memory/training time? This would have been helpful to help evaluate this work, as typical regularization methods are usually cheap to apply. In the proposed approach, several forward passes are made using subnetworks to provide gradient for a single backward pass. This can be thought of as distilling the knowledge of a more expressive, powerful network into a smaller one. In this way this approach can be viewed somewhat analogous to neural network compression. Although this comparison is not perfect, it possibly deserves further comparison. Overall, this paper, though shows strong empirical results, has not fully or well explored its proposed approach. Due to this reason, I don't believe it is ready for publication.

Correctness: Overall I agree with the central claim that GradAug improves performance and robustness as demonstrated. However the mechanism of action remains unclear, and is not well validated in the paper. Thus some of the claims regarding how GradAug improves robustness/generalization are not well validated.

Clarity: The paper is written 'ok'. Certainly some of it can be improved using clearer text. In particular I found the assertions and claims made in 3.2 difficult to understand. What does it mean for the 'disturbances' to be self guiding? I'm not sure what I should take away from this section as to the overall operating mechanism of GradAug as it is not clearly stated.

Relation to Prior Work: This is mostly done, however I have some concerns which I have elaborated upon in the weaknesses (see second to last paragraph).

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper proposed a new regularization method that leverages differently transformed input to regularize a set of sub-networks originated from the full-network. The idea is that sub-networks should recognize transformed images as the same objects with the full-network. The author analyzes its effect from the gradient view and conducted thorough experiments to validate the idea. The method is demonstrated to outperform state-of-the-art methods on different tasks.

Strengths: 1.The idea of leveraging different transformed images to regularize the sub-networks and thereby providing self-guided augmentation to the gradients is novel and interesting. It is simple to implement and can be combined with other regularization schemes. Overall, the paper is well-presented and the proposed idea is clearly conveyed. 2.The analysis on the gradient property of different methods is reasonable and useful to understand the differences between this work and other regularization techniques. The claim is also validated in the experiments. 3.The experimental results are very promising based on a set of experiments for a range of tasks. Specifically, the authors demonstrate that state-of-the-art methods can hardly improve in downstream tasks and are not effective in low data setting, while the proposed method shows effectiveness in these tasks.

Weaknesses: 1.The author may need more experiments to show the effect of different transformations. Though in the supplementary the authors experimented with random rotation and the combination of random scale and rotation, more experiments on different models and larger datasets will make it more convincing. There might be space issue, but I think this experiment should also be put in the main paper rather than the supplementary. 2. The claim on model robustness to adversarial attack may be too strong. FGSM is just one type of adversarial attack approach. To claim the general model robustness, the authors may need more experiments on different adversarial attacks. While I can understand that the proposed method is not focused on adversarial attack, it still would be more precise to claim the robustness to FGSM attack based on the experiments in the paper. 3. Although it is stated that the training procedure of the proposed GradAug is similar to the regular network, the training time for each epoch might be longer due to sub-networks. It would be good to include such discussion and analysis. 4. In ImageNet classification experiment, the images are randomly resized to one of {224, 192, 160, 128}. It is not clear why these particular resolutions are selected. How the image resolution and the number of resolutions that the sub-networks can choose would affect the performance?

Correctness: The method is clear and straightforward and appears correct. However, as mentioned in the weakness section, the claim on model robustness to adversarial attack may be too strong without more experiments on other adversarial attack approaches. It would be more precise to claim the robustness to FGSM attack based on the experiments.

Clarity: The paper is clearly written and well structured, and the figures and algorithm pseudo-code help in the understanding.

Relation to Prior Work: The distinctions between this work and prior research are clearly discussed. In particular, the gradient flow analysis of different regularization techniques presented in Sec. 3.2 provides strong evidence on the differences.

Reproducibility: Yes

Additional Feedback: =====post-rebuttal=======: After reading through the rebuttal and considering the other reviewer's comments, I again feel this paper is a very good submission. I found the work to be clear and well-reasoned, and the results to be impressive on various tasks. The most significant contribution would be the idea of regularizing the sub-networks with transformed input, which is new and interesting. In the rebuttal, the authors have run additional experiments demonstrating (1) the training cost (memory and time) of the proposed GradAug is comparable with the state-of-the-art regularization approaches, which is attributed to the faster convergence speed of GradAug; (2) GradAug is able to generalize to sub-networks generated by shrinking the full-network’s depth (R4’s suggestion), which also reveals the flexibility of the framework (another plus of GradAug). I agree the strategy of generating the sub-networks is a good direction to explore further, and it probably deserves a thorough theoretical analysis and experimental investigation, which, in my opinion, is beyond the scope of this paper - describing the general idea and framework. Overall, I think this is a decent research work in terms of the method and experimental evaluation. And developing regularization techniques from both data and network aspects might be worth further exploring.

Review 4

Summary and Contributions: This paper proposes a new regularization method, GradAug. During training, GradAug trains sub-networks with various input transformations, and aggregates the losses generated by the full network and sub-networks.

Strengths: 1. This method significantly boosts image classification and the object detection task.

Weaknesses: 1. Similarity with slimmable network [*] Defining sub-networks using width ratio is quite similar to that of slimmable network, and also many training tricks were adopted from slimmable network paper. Surely, applying data augmentation for each sub-network is different, but this training scheme is not a new thing. How critical the adopted two training tricks (soft labels and smallest sub-network) for the performance of GradAug? How much degrades the accuracy on ImageNet without these tricks? How about applying GradAug upon Slimmable network? Would it works better than or not? Inversely, can GradAug-trained network be pruned as in Slimmable network? [*] Universally Slimmable Networks and Improved Training Techniques, ICCV 2019.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The sub-networks of GradAug are sampled in a similar manner with slimmable network. How about utilizing stochastic depth [*], which randomly drops residual blocks to make depth-shrinked sub networks, to sample sub-nets for GradAug? It would be great if GradAug scheme is generalized like this. [*] Deep Networks with Stochastic Depth, ECCV 2016. ======= Post-rebuttal comment ======== After reading the rebuttal and other reviewers' comments, I would like to incline to keep my original rating (6). I'm happy to see the authors conducting depth-shrinked sub-network experiments, which is one of my major concerns (generalization of the method). I think all the additional experiments in the rebuttal should be included by conducting on ImageNet dataset in the final copy to make the paper more convincing and rigorous.