{"title": "Implicit Semantic Data Augmentation for Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 12635, "page_last": 12644, "abstract": "In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation. Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e.g., adding sunglasses or changing backgrounds. As a consequence, translating training samples along many semantic directions in the feature space can effectively augment the dataset to improve generalization. To implement this idea effectively and efficiently, we first perform an online estimate of the covariance matrix of deep features for each class, which captures the intra-class semantic variations. Then random vectors are drawn from a zero-mean normal distribution with the estimated covariance to augment the training data in that class. Importantly, instead of augmenting the samples explicitly, we can directly minimize an upper bound of the expected cross-entropy (CE) loss on the augmented training set, leading to a highly efficient algorithm. In fact, we show that the proposed ISDA amounts to minimizing a novel robust CE loss, which adds negligible extra computational cost to a normal training procedure. Although being simple, ISDA consistently improves the generalization performance of popular deep models (ResNets and DenseNets) on a variety of datasets, e.g., CIFAR-10, CIFAR-100 and ImageNet. Code for reproducing our results are available at https://github.com/blackfeather-wang/ISDA-for-Deep-Networks.", "full_text": "Implicit Semantic Data Augmentation for Deep\n\nNetworks\n\nYulin Wang1\u2217 Xuran Pan1\u2217 Shiji Song1 Hong Zhang2 Cheng Wu1 Gao Huang1\u2020\n\n1Department of Automation, Tsinghua University, Beijing, China\n\nBeijing National Research Center for Information Science and Technology (BNRist),\n\n{yulin.bh, fykalviny}@gmail.com, pxr18@mails.tsinghua.edu.cn,\n\n{shijis, wuc, gaohuang}@tsinghua.edu.cn\n\n2Baidu Inc., China\n\nAbstract\n\nIn this paper, we propose a novel implicit semantic data augmentation (ISDA)\napproach to complement traditional augmentation techniques like \ufb02ipping, trans-\nlation or rotation. Our work is motivated by the intriguing property that deep\nnetworks are surprisingly good at linearizing features, such that certain directions\nin the deep feature space correspond to meaningful semantic transformations, e.g.,\nadding sunglasses or changing backgrounds. As a consequence, translating train-\ning samples along many semantic directions in the feature space can effectively\naugment the dataset to improve generalization. To implement this idea effectively\nand ef\ufb01ciently, we \ufb01rst perform an online estimate of the covariance matrix of\ndeep features for each class, which captures the intra-class semantic variations.\nThen random vectors are drawn from a zero-mean normal distribution with the\nestimated covariance to augment the training data in that class. Importantly, instead\nof augmenting the samples explicitly, we can directly minimize an upper bound\nof the expected cross-entropy (CE) loss on the augmented training set, leading to\na highly ef\ufb01cient algorithm. In fact, we show that the proposed ISDA amounts\nto minimizing a novel robust CE loss, which adds negligible extra computational\ncost to a normal training procedure. Although being simple, ISDA consistently\nimproves the generalization performance of popular deep models (ResNets and\nDenseNets) on a variety of datasets, e.g., CIFAR-10, CIFAR-100 and ImageNet.\nCode for reproducing our results is available at https://github.com/blackfeather-\nwang/ISDA-for-Deep-Networks.\n\n1\n\nIntroduction\n\nData augmentation is an effective technique to alleviate the over\ufb01tting problem in training deep\nnetworks [1, 2, 3, 4, 5]. In the context of image recognition, this usually corresponds to applying\ncontent preserving transformations, e.g., cropping, horizontal mirroring, rotation and color jittering,\non the input samples. Although being effective, these augmentation techniques are not capable of\nperforming semantic transformations, such as changing the background of an object or the texture\nof a foreground object. Recent work has shown that data augmentation can be more powerful if\n(class identity preserving) semantic transformations are allowed [6, 7, 8]. For example, by training\na generative adversarial network (GAN) for each class in the training set, one could then sample\nan in\ufb01nite number of samples from the generator. Unfortunately, this procedure is computationally\nintensive because training generative models and inferring them to obtain augmented samples are\n\n\u2217Equal contribution.\n\u2020Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fForward\n\nDeep Feature Space \n\nDeep Features \n\nAugmented Samples\n\n\u2026\n\nAugmented Images\n\n(Not Shown Explicitly) \n\nTraining Data\n\nDeep Networks\n\nCorresponding to\n\nAugment \n\nSemantically\n\nISDA Loss\n\nFigure 1: An overview of ISDA. Inspired by the observation that certain directions in the feature space\ncorrespond to meaningful semantic transformations, we augment the training data semantically by\ntranslating their features along these semantic directions, without involving auxiliary deep networks.\nThe directions are obtained by sampling random vectors from a zero-mean normal distribution with\ndynamically estimated class-conditional covariance matrices. In addition, instead of performing\naugmentation explicitly, ISDA boils down to minimizing a closed-form upper-bound of the expected\ncross-entropy loss on the augmented training set, which makes our method highly ef\ufb01cient.\n\nboth nontrivial tasks. Moreover, due to the extra augmented data, the training procedure is also likely\nto be prolonged.\n\nIn this paper, we propose an implicit semantic data augmentation (ISDA) algorithm for training deep\nimage recognition networks. The ISDA is highly ef\ufb01cient as it does not require training/inferring\nauxiliary networks or explicitly generating extra training samples. Our approach is motivated by the\nintriguing observation made by recent work showing that the features deep in a network are usually\nlinearized [9, 10]. Speci\ufb01cally, there exist many semantic directions in the deep feature space, such\nthat translating a data sample in the feature space along one of these directions results in a feature\nrepresentation corresponding to another sample with the same class identity but different semantics.\nFor example, a certain direction corresponds to the semantic translation of \"make-bespectacled\".\nWhen the feature of a person, who does not wear glasses, is translated along this direction, the\nnew feature may correspond to the same person but with glasses (The new image can be explicitly\nreconstructed using proper algorithms as shown in [9]). Therefore, by searching for many such\nsemantic directions, we can effectively augment the training set in a way complementary to traditional\ndata augmenting techniques.\n\nHowever, explicitly \ufb01nding semantic directions is not a trivial task, which usually requires extensive\nhuman annotations [9]. In contrast, sampling directions randomly is ef\ufb01cient but may result in\nmeaningless transformations. For example, it makes no sense to apply the \"make-bespectacled\"\ntransformation to the \u201ccar\u201d class. In this paper, we adopt a simple method that achieves a good\nbalance between effectiveness and ef\ufb01ciency. In speci\ufb01c, we perform an online estimate of the\ncovariance matrix of the features for each class, which captures the intra-class variations. Then we\nsample directions from a zero-mean multi-variate normal distribution with the estimated covariance,\nand apply them to the features of training samples in that class to augment the dataset. In this way,\nthe chance of generating meaningless semantic transformations can be signi\ufb01cantly reduced.\n\nTo further improve the ef\ufb01ciency, we derive a closed-form upper bound of the expected cross-entropy\n(CE) loss with the proposed data augmentation scheme. Therefore, instead of performing the\naugmentation procedure explicitly, we can directly minimize the upper bound, which is, in fact, a\nnovel robust loss function. As there is no need to generate explicit data samples, we call our algorithm\nimplicit semantic data augmentation (ISDA). Compared to existing semantic data augmentation\nalgorithms, the proposed ISDA can be conveniently implemented on top of most deep models without\nintroducing auxiliary models or noticeable extra computational cost.\n\nAlthough being simple, the proposed ISDA algorithm is surprisingly effective, and complements\nexisting non-semantic data augmentation techniques quite well. Extensive empirical evaluations\non several competitive image classi\ufb01cation benchmarks show that ISDA consistently improves the\ngeneralization performance of popular deep networks, especially with little training data and powerful\ntraditional augmentation techniques.\n\n2 Related Work\n\nIn this section, we brie\ufb02y review existing research on related topics.\n\n2\n\n\fData augmentation is a widely used technique to alleviate over\ufb01tting in training deep networks. For\nexample, in image recognition tasks, data augmentation techniques like random \ufb02ipping, mirroring\nand rotation are applied to enforce certain invariance in convolutional networks [4, 5, 3, 11]. Recently,\nautomatic data augmentation techniques, e.g., AutoAugment [12], are proposed to search for a\nbetter augmentation strategy among a large pool of candidates. Similar to our method, learning\nwith marginalized corrupted features [13] can be viewed as an implicit data augmentation technique,\nbut it is limited to simple linear models. Complementarily, recent research shows that semantic\ndata augmentation techniques which apply class identity preserving transformations (e.g. changing\nbackgrounds of objects or varying visual angles) to the training data are effective as well [14, 15,\n6, 8]. This is usually achieved by generating extra semantically transformed training samples with\nspecialized deep structures such as DAGAN [8], domain adaptation networks [15] or other GAN-\nbased generators [14, 6]. Although being effective, these approaches are nontrivial to implement and\ncomputationally expensive, due to the need to train generative models beforehand and infer them\nduring training.\n\nRobust loss function. As shown in the paper, ISDA amounts to minimizing a novel robust loss\nfunction. Therefore, we give a brief review of related work on this topic. Recently, several robust loss\nfunctions are proposed for deep learning. For example, the Lq loss [16] is a balanced noise-robust\nform for the cross entropy (CE) loss and mean absolute error (MAE) loss, derived from the negative\nBox-Cox transformation. Focal loss [17] attaches high weights to a sparse set of hard examples to\nprevent the vast number of easy samples from dominating the training of the network. The idea of\nintroducing large margin for CE loss has been proposed in [18, 19, 20]. In [21], the CE loss and\nthe contrastive loss are combined to learn more discriminative features. From a similar perspective,\ncenter loss [22] simultaneously learns a center for deep features of each class and penalizes the\ndistances between the samples and their corresponding class centers in the feature space, enhancing\nthe intra-class compactness and inter-class separability.\n\nSemantic transformations in deep feature space. Our work is motivated by the fact that high-\nlevel representations learned by deep convolutional networks can potentially capture abstractions\nwith semantics [23, 10]. In fact, translating deep features along certain directions is shown to be\ncorresponding to performing meaningful semantic transformations on the input images. For example,\ndeep feature interpolation [9] leverages simple interpolations of deep features from pre-trained\nneural networks to achieve semantic image transformations. Variational Autoencoder(VAE) and\nGenerative Adversarial Network(GAN) based methods [24, 25, 26] establish a latent representation\ncorresponding to the abstractions of images, which can be manipulated to edit the semantics of\nimages. Generally, these methods reveal that certain directions in the deep feature space correspond to\nmeaningful semantic transformations, and can be leveraged to perform semantic data augmentation.\n\n3 Method\n\nDeep networks are known to excel at forming high-level representations in the deep feature space\n[4, 5, 9, 27], where the semantic relations between samples can be captured by the relative positions\nof their features [10]. Previous work has demonstrated that translating features towards speci\ufb01c\ndirections corresponds to meaningful semantic transformations when the features are mapped to the\ninput space [9, 28, 10]. Based on this observation, we propose to directly augment the training data\nin the feature space, and integrate this procedure into the training of deep models.\n\nThe proposed implicit semantic data augmentation (ISDA) has two important components, i.e., online\nestimation of class-conditional covariance matrices and optimization with a robust loss function.\nThe \ufb01rst component aims to \ufb01nd a distribution from which we can sample meaningful semantic\ntransformation directions for data augmentation, while the second saves us from explicitly generating\na large amount of extra training data, leading to remarkable ef\ufb01ciency compared to existing data\naugmentation techniques.\n\n3.1 Semantic Transformations in Deep Feature Space\nAs aforementioned, certain directions in the deep feature space correspond to meaningful semantic\ntransformations like \u201cmake-bespectacled\u201d or \u2018change-view-angle\u2019. This motivates us to augment\nthe training set by applying such semantic transformations on deep features. However, manually\nsearching for semantic directions is infeasible for large scale problems. To address this problem,\nwe propose to approximate the procedure by sampling random vectors from a normal distribution\nwith zero mean and a covariance that is proportional to the intra-class covariance matrix, which\ncaptures the variance of samples in that class and is thus likely to contain rich semantic information.\n\n3\n\n\fIntuitively, features for the person class may vary along the \u201cwear-glasses\u201d direction, while having\nnearly zero variance along the \u201chas-propeller\u201d direction which only occurs for other classes like the\nplane class. We hope that directions corresponding to meaningful transformations for each class are\nwell represented by the principal components of the covariance matrix of that class.\nConsider training a deep network G with weights \u0398 on a training set D = {(xi, yi)}N\ni=1, where\nyi \u2208 {1, . . . , C} is the label of the i-th sample xi over C classes. Let the A-dimensional vector\nai = [ai1, . . . , aiA]T = G(xi, \u0398) denote the deep features of xi learned by G, and aij indicate the\njth element of ai.\n\nTo obtain semantic directions to augment ai, we randomly sample vectors from a zero-mean multi-\nvariate normal distribution N (0, \u03a3yi ), where \u03a3yi is the class-conditional covariance matrix estimated\nfrom the features of all the samples in class yi. In implementation, the covariance matrix is computed\nin an online fashion by aggregating statistics from all mini-batches. The online estimation algorithm\nis given in Section A in the supplementary.\n\nDuring training, C covariance matrices are computed, one for each class. The augmented feature \u02dcai\nis obtained by translating ai along a random direction sampled from N (0, \u03bb\u03a3yi ). Equivalently, we\nhave\n\n\u02dcai \u223c N (ai, \u03bb\u03a3yi ),\n\n(1)\nwhere \u03bb is a positive coef\ufb01cient to control the strength of semantic data augmentation. As the\ncovariances are computed dynamically during training, the estimation in the \ufb01rst few epochs are not\nquite informative when the network is not well trained. To address this issue, we let \u03bb = (t/T )\u00d7\u03bb0\nbe a function of the current iteration t, thus to reduce the impact of the estimated covariances on our\nalgorithm early in the training stage.\n\nImplicit Semantic Data Augmentation (ISDA)\n\n3.2\nA naive method to implement ISDA is to explicitly augment each ai for M times, forming an\naugmented feature set {(a1\ni is k-th copy of augmented\nfeatures for sample xi. Then the networks are trained by minimizing the cross-entropy (CE) loss:\n\ni , yi), . . . , (aM\ni\n\n, yi)}N\n\ni=1 of size M N , where ak\n\nLM (W , b, \u0398) =\n\n1\nN\n\nN\n\nX\n\ni=1\n\n1\nM\n\nM\n\nX\n\nk=1\n\n\u2212log(\n\nk\ni +byi\n\nT\nyi\n\na\n\new\nj=1 ewT\nPC\n\nj\n\n),\n\nak\n\ni +bj\n\n(2)\n\nwhere W = [w1, . . . , wC]T \u2208 RC\u00d7A and b = [b1, . . . , bC]T \u2208 RC are the weight matrix and\nbiases corresponding to the \ufb01nal fully connected layer, respectively.\n\nObviously, the naive implementation is computationally inef\ufb01cient when M is large, as the feature\nset is enlarged by M times. In the following, we consider the case that M grows to in\ufb01nity, and \ufb01nd\nthat an easy-to-compute upper bound can be derived for the loss function, leading to a highly ef\ufb01cient\nimplementation.\n\nUpper bound of the loss function. In the case M \u2192 \u221e, we are in fact considering the expectation\nof the CE loss under all possible augmented features. Speci\ufb01cally, L\u221e is given by:\n\nL\u221e(W , b, \u0398|\u03a3) =\n\n1\nN\n\nN\n\nX\n\ni=1\n\nE\u02dcai [\u2212log(\n\nT\nyi\n\n\u02dcai+byi\n\new\nj=1 ewT\nPC\n\nj \u02dcai+bj\n\n)].\n\n(3)\n\nIf L\u221e can be computed ef\ufb01ciently, then we can directly minimize it without explicitly sampling\naugmented features. However, Eq. (3) is dif\ufb01cult to compute in its exact form. Alternatively, we\n\ufb01nd that it is possible to derive an easy-to-compute upper bound for L\u221e, as given by the following\nproposition.\nProposition 1. Suppose that \u02dcai \u223c N (ai, \u03bb\u03a3yi ), then we have an upper bound of L\u221e, given by\n\nL\u221e(W , b, \u0398|\u03a3) \u2264\n\n1\nN\n\nN\n\nX\n\ni=1\n\n\u2212log(\n\nPC\n\nj=1 ewT\n\nj\n\new\nai+bj + \u03bb\n\nT\nyi\n\nai+byi\n\n2 (wT\n\nj \u2212wT\nyi\n\n)\u03a3yi (wj \u2212wyi )\n\n) , L\u221e.\n\n(4)\n\nProof. According to the de\ufb01nition of L\u221e in (3), we have:\n\nL\u221e(W , b, \u0398|\u03a3) =\n\n1\nN\n\nN\n\nX\n\ni=1\n\nE\u02dcai [log(\n\nC\n\nX\n\nj=1\n\ne(w\n\nT\nj \u2212w\n\nT\nyi\n\n)\u02dcai+(bj \u2212byi ))]\n\n(5)\n\n4\n\n\f\u2264\n\n=\n\n1\nN\n\n1\nN\n\nN\n\nX\n\ni=1\n\nN\n\nX\n\ni=1\n\nlog(\n\nlog(\n\nC\n\nX\n\nj=1\n\nC\n\nX\n\nj=1\n\nE\u02dcai [e(w\n\nT\nj \u2212w\n\nT\nyi\n\n)\u02dcai+(bj \u2212byi )])\n\n(6)\n\ne(w\n\nT\nj \u2212w\n\nT\nyi\n\n)ai+(bj \u2212byi )+ \u03bb\n\n2 (w\n\nT\nj \u2212w\n\nT\nyi\n\n)\u03a3yi (wj \u2212wyi ))\n\n(7)\n\n(8)\nIn the above, the Inequality (6) follows from the Jensen\u2019s inequality E[logX] \u2264 logE[X], as the\nlogarithmic function log(\u00b7) is concave. The Eq. (7) is obtained by leveraging the moment-generating\nfunction:\n\n= L\u221e.\n\nE[etX ] = et\u00b5+ 1\n\n2 \u03c32t2\n\n, X \u223c N (\u00b5, \u03c32),\n\ndue to the fact that (wT\n\nj \u2212wT\n\nyi )\u02dcai +(bj \u2212byi ) is a Gaussian random variable, i.e.,\n\n(wT\n\nj \u2212wT\n\nyi )\u02dcai +(bj \u2212byi ) \u223c N (cid:0)(wT\n\nj \u2212wT\nEssentially, Proposition 1 provides a surrogate\nloss for our implicit data augmentation algo-\nrithm. Instead of minimizing the exact loss func-\ntion L\u221e, we can optimize its upper bound L\u221e\nin a much more ef\ufb01cient way. Therefore, the\nproposed ISDA boils down to a novel robust\nloss function, which can be easily adopted by\nmost deep models. In addition, we can observe\nthat when \u03bb \u2192 0, which means no features are\naugmented, L\u221e reduces to the standard CE loss.\n\nyi )ai +(bj \u2212byi ), \u03bb(wT\n\nj \u2212wT\n\nyi )\u03a3yi (wj \u2212wyi )(cid:1) .\n\nAlgorithm 1 The ISDA Algorithm.\n1: Input: D, \u03bb0\n2: Randomly initialize W , b and \u0398\n3: for t = 0 to T do\n4:\n5:\n6:\n\nSample a mini-batch {xi, yi}B\nCompute ai = G(xi, \u0398)\nEstimate the covariance matrices \u03a31, \u03a32,\n..., \u03a3C\nCompute L\u221e according to Eq. (4)\nUpdate W , b, \u0398 with SGD\n\ni=1 from D\n\n7:\n8:\n9: end for\n10: Output: W , b and \u0398\n\nIn summary, the proposed ISDA can be simply\nplugged into deep networks as a robust loss func-\ntion, and ef\ufb01ciently optimized with the stochas-\ntic gradient descent (SGD) algorithm. We present the pseudo code of ISDA in Algorithm 1. Details\nof estimating covariance matrices and computing gradients are presented in Appendix A.\n\n4 Experiments\nIn this section, we empirically validate the proposed algorithm on several widely used image clas-\nsi\ufb01cation benchmarks, i.e., CIFAR-10, CIFAR-100 [1] and ImageNet[29]. We \ufb01rst evaluate the\neffectiveness of ISDA with different deep network architectures on these datasets. Second, we\napply several recent proposed non-semantic image augmentation methods in addition to the standard\nbaseline augmentation, and investigate the performance of ISDA. Third, we present comparisons\nwith state-of-the-art robust lost functions and generator-based semantic data augmentation algorithms.\nFinally, ablation studies are conducted to examine the effectiveness of each component. We also\nvisualize the augmented samples in the original input space with the aid of a generative network.\n\n4.1 Datasets and Baselines\nDatasets. We use three image recognition benchmarks in the experiments. (1) The two CIFAR\ndatasets consist of 32x32 colored natural images in 10 classes for CIFAR-10 and 100 classes for\nCIFAR-100, with 50,000 images for training and 10,000 images for testing, respectively. In our\nexperiments, we hold out 5000 images from the training set as the validation set to search for the\nhyper-parameter \u03bb0. These samples are also used for training after an optimal \u03bb0 is selected, and\nthe results on the test set are reported. Images are normalized with channel means and standard\ndeviations for pre-procession. For the non-semantic data augmentation of the training set, we follow\nthe standard operation in [30]: 4 pixels are padded at each side of the image, followed by a random\n32x32 cropping combined with random horizontal \ufb02ipping. (2) ImageNet is a 1,000-class dataset\nfrom ILSVRC2012[29], providing 1.2 million images for training and 50,000 images for validation.\nWe adopt the same augmentation con\ufb01gurations in [2, 4, 5].\n\nNon-semantic augmentation techniques. To study the complementary effects of ISDA to tradi-\ntional data augmentation methods, two state-of-the-art non-semantic augmentation techniques are\napplied, with and without ISDA. (1) Cutout [31] randomly masks out square regions of input during\ntraining to regularize the model. (2) AutoAugment [32] automatically searches for the best augmenta-\ntion policies to yield the highest validation accuracy on a target dataset. All hyper-parameters are the\nsame as reported in the papers introducing them.\n\n5\n\n\fTable 1: Evaluation of ISDA on CIFAR with different models. The average test error over the last 10\nepochs is calculated in each experiment, and we report mean values and standard deviations in three\nindependent experiments. The best results are bold-faced.\n\nMethod\n\nResNet-32 [4]\n\nResNet-32 + ISDA\n\nResNet-110 [4]\n\nCIFAR-10\n\nCIFAR-100\n\nResNet-110 + ISDA\nSE-ResNet-110 [33]\n\nSE-ResNet-110 + ISDA\nWide-ResNet-16-8 [34]\n\nParams\n7.39 \u00b1 0.10% 31.20 \u00b1 0.41%\n0.5M\n7.09 \u00b1 0.12% 30.27 \u00b1 0.34%\n0.5M\n6.76 \u00b1 0.34% 28.67 \u00b1 0.44%\n1.7M\n6.33 \u00b1 0.19% 27.57 \u00b1 0.46%\n1.7M\n6.14 \u00b1 0.17% 27.30 \u00b1 0.03%\n1.7M\n5.96 \u00b1 0.21% 26.63 \u00b1 0.21%\n1.7M\n11.0M 4.25 \u00b1 0.18% 20.24 \u00b1 0.27%\n11.0M 4.04 \u00b1 0.29% 19.91 \u00b1 0.21%\n36.5M 3.82 \u00b1 0.15% 18.53 \u00b1 0.07%\n36.5M 3.58 \u00b1 0.15% 17.98 \u00b1 0.15%\n34.4M 3.86 \u00b1 0.14% 18.16 \u00b1 0.13%\n34.4M 3.67 \u00b1 0.12% 17.43 \u00b1 0.25%\n4.90 \u00b1 0.08% 22.61 \u00b1 0.10%\n0.8M\n4.54 \u00b1 0.07% 22.10 \u00b1 0.34%\n0.8M\n25.6M\nDenseNet-BC-190-40 + ISDA 25.6M\n\nDenseNet-BC-100-12 + ISDA\n\nWide-ResNet-28-10 + ISDA\n\nResNeXt-29, 8x64d + ISDA\n\nWide-ResNet-16-8 + ISDA\n\nDenseNet-BC-100-12 [5]\n\nDenseNet-BC-190-40 [5]\n\n3.52%\n3.24%\n\n17.74%\n17.42%\n\nWide-ResNet-28-10 [34]\n\nResNeXt-29, 8x64d [35]\n\nTable 2: Evaluation of ISDA with state-of-the-art non-semantic augmentation techniques. \u2018AA\u2019\nrefers to AutoAugment [32]. We report mean values and standard deviations in three independent\nexperiments. The best results are bold-faced.\n\nDataset\n\nNetworks\n\nCutout [31] Cutout + ISDA\n\nAA [32]\n\nAA + ISDA\n\nCIFAR-10\n\nWide-ResNet-28-10 [34]\n\n2.99 \u00b1 0.06% 2.83 \u00b1 0.04% 2.65 \u00b1 0.07% 2.56 \u00b1 0.01%\nShake-Shake (26, 2x32d) [36] 3.16 \u00b1 0.09% 2.93 \u00b1 0.03% 2.89 \u00b1 0.09% 2.68 \u00b1 0.12%\nShake-Shake (26, 2x112d) [36]\n\n2.25%\n\n1.82%\n\n2.36%\n\n2.01%\n\nCIFAR-100\n\nWide-ResNet-28-10 [34]\n\n18.05 \u00b1 0.25% 17.02 \u00b1 0.11% 16.60 \u00b1 0.40% 15.62 \u00b1 0.32%\nShake-Shake (26, 2x32d) [36] 18.92 \u00b1 0.21% 18.17 \u00b1 0.08 % 17.50 \u00b1 0.19% 17.21 \u00b1 0.33%\nShake-Shake (26, 2x112d) [36] 17.34 \u00b1 0.28% 16.24 \u00b1 0.20 % 15.21 \u00b1 0.20% 13.87 \u00b1 0.26%\n\nBaselines. Our method is compared to several baselines including state-of-the-art robust loss func-\ntions and generator-based semantic data augmentation methods. (1) Dropout [37] is a widely used\nregularization approach which randomly mutes some neurons during training. (2) Large-margin\nsoftmax loss [18] introduces large decision margin, measured by a cosine distance, to the standard CE\nloss. (3) Disturb label [38] is a regularization mechanism that randomly replaces a fraction of labels\nwith incorrect ones in each iteration. (4) Focal loss [17] focuses on a sparse set of hard examples to\nprevent easy samples from dominating the training procedure. (5) Center loss [22] simultaneously\nlearns a center of features for each class and minimizes the distances between the deep features\nand their corresponding class centers. (6) Lq loss [16] is a noise-robust loss function, using the\nnegative Box-Cox transformation. (7) For generator-based semantic augmentation methods, we train\nseveral state-of-the-art GANs [39, 40, 41, 42], which are then used to generate extra training samples\nfor data augmentation. For fair comparison, all methods are implemented with the same training\ncon\ufb01gurations when it is possible. Details for hyper-parameter settings are presented in Appendix B.\n\nTraining details. For deep networks, we implement the ResNet, SE-ResNet, Wide-ResNet, ResNeXt,\nDenseNet and PyramidNet on CIFAR, and ResNet, ResNeXt on ImageNet. Detailed con\ufb01gurations\nfor these models are given in Appendix B. The hyper-parameter \u03bb0 for ISDA is selected from the\nset {0.1, 0.25, 0.5, 0.75, 1} according to the performance on the validation set. On ImageNet, due to\nGPU memory limitation, we approximate the covariance matrices by their diagonals, i.e., the variance\nof each dimension of the features. The best hyper-parameter \u03bb0 is selected from {1, 2.5, 5, 7.5, 10}.\n\n4.2 Main Results\nTable 1 presents the performance of several state-of-the-art deep networks with and without ISDA. It\ncan be observed that ISDA consistently improves the generalization performance of these models,\nespecially with fewer training samples per class. On CIFAR-100, for relatively small models like\nResNet-32 and ResNet-110, ISDA reduces test errors by about 1%, while for larger models like\n\n6\n\n\f28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\n)\n\n%\n\n(\n\nr\no\nr\nr\nE\nt\ns\ne\nT\n\nWRN-28-10\nWRN-28-10 + ISDA\nWRN-28-10 + AA\nWRN-28-10 + AA +ISDA\n\n24\n\n23\n\n22\n\n21\n\n20\n\n19\n\n18\n\n)\n\n%\n\n(\n\ne\nt\na\nR\nr\no\nr\nr\nE\n\nResNet-152 (Test)\nResNet-152 + ISDA (Test)\nResNet-152 (Train)\nResNet-152 + ISDA (Train)\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n240\n\n17\n\n60\n\n70\n\n80\n\nEpoch\n\n90\n\nEpoch\n\n100\n\n110\n\n120\n\nFigure 2: Curves of test errors on CIFAR-100\nwith Wide-ResNet (WRN).\nTable 3: Comparisons with the state-of-the-art methods. We report mean values and standard\ndeviations of the test error in three independent experiments. Best results are bold-faced.\n\nFigure 3: Training and test errors on ImageNet.\n\nMethod\n\nLarge Margin [18]\nDisturb Label [38]\nFocal Loss [17]\nCenter Loss [22]\nLq Loss [16]\nWGAN [39]\nCGAN [40]\nACGAN [41]\ninfoGAN [42]\nBasic\nBasic + Dropout\nISDA\nISDA + Dropout\n\nResNet-110\n\nWide-ResNet-28-10\n\nCIFAR-10\n\n6.46\u00b10.20%\n6.61\u00b10.04%\n6.68\u00b10.22%\n6.38\u00b10.20%\n6.69\u00b10.07%\n6.63\u00b10.23%\n6.56\u00b10.14%\n6.32\u00b10.12%\n6.59\u00b10.12%\n6.76\u00b10.34%\n6.23\u00b10.11%\n6.33\u00b10.19%\n5.98\u00b10.20%\n\nCIFAR-100\n\n28.00\u00b10.09%\n28.46\u00b10.32%\n28.28\u00b10.32%\n27.85\u00b10.10%\n28.78\u00b10.35%\n\n-\n\n28.25\u00b10.36%\n28.48\u00b10.44%\n27.64\u00b10.14%\n28.67\u00b10.44%\n27.11\u00b10.06%\n27.57\u00b10.46%\n26.35\u00b10.30%\n\nCIFAR-10\n\n3.69\u00b10.10%\n3.91\u00b10.10%\n3.62\u00b10.07%\n3.76\u00b10.05%\n3.78\u00b10.08%\n3.81\u00b10.08%\n3.84\u00b10.07%\n3.81\u00b10.11%\n3.81\u00b10.05%\n\n-\n\nCIFAR-100\n\n18.48\u00b10.05%\n18.56\u00b10.22%\n18.22\u00b10.08%\n18.50\u00b10.25%\n18.43\u00b10.37%\n\n-\n\n18.79\u00b10.08%\n18.54\u00b10.05%\n18.44\u00b10.10%\n\n-\n\n3.82\u00b10.15%\n\n18.53\u00b10.07%\n\n-\n\n-\n\n3.58\u00b10.15%\n\n17.98\u00b10.15%\n\nWide-ResNet-28-10 and ResNeXt-29, 8x64d, our method outperforms the competitive baselines by\nnearly 0.7%. Compared to ResNets, DenseNets generally suffer less from over\ufb01tting due to their\narchitecture design, thus appear to bene\ufb01t less from our algorithm.\n\nTable 2 shows experimental results with recent proposed powerful traditional image augmentation\nmethods (i.e. Cutout [31] and AutoAugment [32]). Interestingly, ISDA seems to be even more\neffective when these techniques exist. For example, when applying AutoAugment, ISDA achieves\nperformance gains of 1.34% and 0.98% on CIFAR-100 with the Shake-Shake (26, 2x112d) and\nthe Wide-ResNet-28-10, respectively. Notice that these improvements are more signi\ufb01cant than the\nstandard situations. A plausible explanation for this phenomenon is that non-semantic augmentation\nmethods help to learn a better feature representation, which makes semantic transformations in the\ndeep feature space more reliable. The curves of test errors during training on CIFAR-100 with Wide-\nResNet-28-10 are presented in Figure 2. It is clear that ISDA achieves a signi\ufb01cant improvement\nafter the third learning rate drop, and shows even better performance after the fourth drop.\n\nTable 4 presents the performance of ISDA on\nthe large scale ImageNet dataset. It can be ob-\nserved that ISDA reduces Top-1 error rate by\n0.54% for the ResNeXt-50 model. The train-\ning and test error curves of ResNet-152 are\nshown in Figure 3. Notably, ISDA achieves a\nslightly higher training error but a lower test\nerror, indicating that ISDA performs effective\nregularization on deep networks.\n\nTable 4: Evaluation of ISDA on ImageNet.\n\nMethod\n\nResNet-50 [4]\n\nResNet-50 + ISDA\n\nResNet-152 [4]\n\nResNet-152 + ISDA\n\nResNeXt-50, 32x4d [35]\n\nResNeXt-50, 32x4d + ISDA\n\nTop-5\nTop-1\n23.58%\n6.92%\n23.30% 6.82%\n21.65%\n6.01%\n21.20% 5.67%\n22.42%\n6.42%\n21.88% 6.23%\n\n4.3 Comparison with Other Approaches\nWe compare ISDA with a number of competitive baselines described in Section 4.1, ranging from\nrobust loss functions to semantic data augmentation algorithms based on generative models. The\n\n7\n\n\fInitial Restored\n\nAugmented\n\nInitial Restored\n\nAugmented\n\nFigure 4: Visualization results of semantically augmented images.\n\nresults are summarized in Table 3, and the training curves are presented in Appendix D. One can\nobserve that ISDA compares favorably with all the competitive baseline algorithms. With ResNet-110,\nthe test errors of other robust loss functions are 6.38% and 27.85% on CIFAR-10 and CIFAR-100,\nrespectively, while ISDA achieves 6.23% and 27.11%, respectively.\n\nAmong all GAN-based semantic augmentation methods, ACGAN gives the best performance, espe-\ncially on CIFAR-10. However, these models generally suffer a performance reduction on CIFAR-100,\nwhich do not contain enough samples to learn a valid generator for each class. In contrast, ISDA\nshows consistent improvements on all the datasets. In addition, GAN-based methods require addi-\ntional computation to train the generators, and introduce signi\ufb01cant overhead to the training process.\nIn comparison, ISDA not only leads to lower generalization error, but is simpler and more ef\ufb01cient.\n\n4.4 Visualization Results\nTo demonstrate that our method is able to generate meaningful semantically augmented samples,\nwe introduce an approach to map the augmented features back to the pixel space to explicitly show\nsemantic changes of the images. Due to space limitations, we defer the detailed introduction of the\nmapping algorithm and present it in Appendix C.\n\nFigure 4 shows the visualization results. The \ufb01rst and second columns represent the original images\nand reconstructed images without any augmentation. The rest columns present the augmented images\nby the proposed ISDA. It can be observed that ISDA is able to alter the semantics of images, e.g.,\nbackgrounds, visual angles, colors and type of cars, color of skins, which is not possible for traditional\ndata augmentation techniques.\n\nTable 5: The ablation study for ISDA.\n\nCIFAR-10\n\nCIFAR-100\n\n4.5 Ablation Study\nTo get a better understanding of the ef-\nfectiveness of different components in\nISDA, we conduct a series of ablation\nstudies. In speci\ufb01c, several variants are\nconsidered: (1) Identity matrix means\nreplacing the covariance matrix \u03a3c by\nthe identity matrix. (2) Diagonal matrix\nmeans using only the diagonal elements\nof the covariance matrix \u03a3c. (3) Single\ncovariance matrix means using a global covariance matrix computed from the features of all classes.\n(4) Constant \u03bb0 means using a constant \u03bb0 without setting it as a function of the training iterations.\n\n18.58\u00b10.10%\n3.82\u00b10.15%\n18.53\u00b10.02%\n3.63\u00b10.12%\n18.23\u00b10.02%\n3.70\u00b10.15%\n18.29\u00b10.13%\n3.67\u00b10.07%\n3.69\u00b10.08%\n18.33\u00b10.16%\n3.58\u00b10.15% 17.98\u00b10.15%\n\nSetting\nBasic\nIdentity matrix\nDiagonal matrix\nSingle covariance matrix\nConstant \u03bb0\nISDA\n\nTable 5 presents the ablation results. Adopting the identity matrix increases the test error by 0.05%\non CIFAR-10 and nearly 0.56% on CIFAR-100. Using a single covariance matrix greatly degrades\nthe generalization performance as well. The reason is likely to be that both of them fail to \ufb01nd proper\ndirections in the deep feature space to perform meaningful semantic transformations. Adopting a\ndiagonal matrix also hurts the performance as it does not consider correlations of features.\n\n5 Conclusion\nIn this paper, we proposed an ef\ufb01cient implicit semantic data augmentation algorithm (ISDA) to\ncomplement existing data augmentation techniques. Different from existing approaches leveraging\ngenerative models to augment the training set with semantically transformed samples, our approach is\nconsiderably more ef\ufb01cient and easier to implement. In fact, we showed that ISDA can be formulated\nas a novel robust loss function, which is compatible with any deep network with the cross-entropy loss.\nExtensive results on several competitive image classi\ufb01cation datasets demonstrate the effectiveness\nand ef\ufb01ciency of the proposed algorithm.\n\n8\n\n\fAcknowledgments\n\nGao Huang is supported in part by Beijing Academy of Arti\ufb01cial Intelligence (BAAI) under\ngrant BAAI2019QN0106 and Tencent AI Lab Rhino-Bird Focused Research Program under grant\nJR201914.\n\nReferences\n\n[1] A. Krizhevsky and G. Hinton, \u201cLearning multiple layers of features from tiny images,\u201d Citeseer,\n\nTech. Rep., 2009.\n\n[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional\n\nneural networks,\u201d in NeurIPS, 2012, pp. 1097\u20131105.\n\n[3] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image\n\nrecognition,\u201d in ICLR, 2015.\n\n[4] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in CVPR,\n\n2016, pp. 770\u2013778.\n\n[5] G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten, and K. Weinberger, \u201cConvolutional networks\nwith dense connectivity,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence,\n2019.\n\n[6] A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. R\u00e9, \u201cLearning to compose domain-\n\nspeci\ufb01c transformations for data augmentation,\u201d in NeurIPS, 2017, pp. 3236\u20133246.\n\n[7] C. Bowles, L. J. Chen, R. Guerrero, P. Bentley, R. N. Gunn, A. Hammers, D. A. Dickie, M. del\nC. Vald\u00e9s Hern\u00e1ndez, J. M. Wardlaw, and D. Rueckert, \u201cGan augmentation: Augmenting\ntraining data using generative adversarial networks,\u201d CoRR, vol. abs/1810.10863, 2018.\n\n[8] A. Antoniou, A. J. Storkey, and H. A. Edwards, \u201cData augmentation generative adversarial\n\nnetworks,\u201d CoRR, vol. abs/1711.04340, 2018.\n\n[9] P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger,\n\n\u201cDeep feature interpolation for image content changes,\u201d in CVPR, 2017, pp. 6090\u20136099.\n\n[10] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, \u201cBetter mixing via deep representations,\u201d in\n\nICML, 2013, pp. 552\u2013560.\n\n[11] R. K. Srivastava, K. Greff, and J. Schmidhuber, \u201cTraining very deep networks,\u201d in NeurIPS,\n\n2015, pp. 2377\u20132385.\n\n[12] E. D. Cubuk, B. Zoph, D. Man\u00e9, V. Vasudevan, and Q. V. Le, \u201cAutoaugment: Learning\n\naugmentation policies from data,\u201d CoRR, vol. abs/1805.09501, 2018.\n\n[13] L. Maaten, M. Chen, S. Tyree, and K. Weinberger, \u201cLearning with marginalized corrupted\n\nfeatures,\u201d in ICML, 2013, pp. 410\u2013418.\n\n[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, \u201cReading text in the wild with\nconvolutional neural networks,\u201d International Journal of Computer Vision, vol. 116, no. 1, pp.\n1\u201320, 2016.\n\n[15] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, \u201cUnsupervised pixel-level\n\ndomain adaptation with generative adversarial networks,\u201d in CVPR, 2017, pp. 3722\u20133731.\n\n[16] Z. Zhang and M. R. Sabuncu, \u201cGeneralized cross entropy loss for training deep neural networks\n\nwith noisy labels,\u201d in NeurIPS, 2018.\n\n[17] T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Doll\u00e1r, \u201cFocal loss for dense object detection,\u201d\n\nin ICCV, 2017, pp. 2999\u20133007.\n\n[18] W. Liu, Y. Wen, Z. Yu, and M. Yang, \u201cLarge-margin softmax loss for convolutional neural\n\nnetworks.\u201d in ICML, 2016.\n\n[19] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li, \u201cSoft-margin softmax for deep classi\ufb01cation,\u201d\n\nin ICONIP, 2017.\n\n[20] X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li, \u201cEnsemble soft-margin softmax loss\n\nfor image classi\ufb01cation,\u201d in IJCAI, 2018.\n\n9\n\n\f[21] Y. Sun, X. Wang, and X. Tang, \u201cDeep learning face representation by joint identi\ufb01cation-\n\nveri\ufb01cation,\u201d in NeurIPS, 2014.\n\n[22] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, \u201cA discriminative feature learning approach for deep face\n\nrecognition,\u201d in ECCV, 2016, pp. 499\u2013515.\n\n[23] Y. Bengio et al., \u201cLearning deep architectures for ai,\u201d Foundations and trends R(cid:13) in Machine\n\nLearning, vol. 2, no. 1, pp. 1\u2013127, 2009.\n\n[24] Y. Choi, M.-J. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, \u201cStargan: Uni\ufb01ed generative\nadversarial networks for multi-domain image-to-image translation,\u201d in CVPR, 2018, pp. 8789\u2013\n8797.\n\n[25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, \u201cUnpaired image-to-image translation using\n\ncycle-consistent adversarial networks,\u201d in ICCV, 2017, pp. 2223\u20132232.\n\n[26] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, \u201cAttgan: Facial attribute editing by only changing\n\nwhat you want.\u201d CoRR, vol. abs/1711.10678, 2017.\n\n[27] S. Ren, K. He, R. Girshick, and J. Sun, \u201cFaster r-cnn: Towards real-time object detection with\n\nregion proposal networks,\u201d in NeurIPS, 2015, pp. 91\u201399.\n\n[28] M. Li, W. Zuo, and D. Zhang, \u201cConvolutional network for attribute-driven and identity-\n\npreserving human face generation,\u201d CoRR, vol. abs/1608.06434, 2016.\n\n[29] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale hierarchical\n\nimage database,\u201d in ICML, 2009, pp. 248\u2013255.\n\n[30] A. G. Howard, \u201cSome improvements on deep convolutional neural network based image\n\nclassi\ufb01cation,\u201d CoRR, vol. abs/1312.5402, 2014.\n\n[31] T. DeVries and G. W. Taylor, \u201cImproved regularization of convolutional neural networks with\n\ncutout,\u201d arXiv preprint arXiv:1708.04552, 2017.\n\n[32] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, \u201cAutoaugment: Learning\n\naugmentation policies from data,\u201d in CVPR, 2019.\n\n[33] J. Hu, L. Shen, and G. Sun, \u201cSqueeze-and-excitation networks,\u201d in CVPR, 2018, pp. 7132\u20137141.\n\n[34] S. Zagoruyko and N. Komodakis, \u201cWide residual networks,\u201d in BMVC, 2017.\n\n[35] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He, \u201cAggregated residual transformations for deep\n\nneural networks,\u201d in CVPR, 2017, pp. 1492\u20131500.\n\n[36] X. Gastaldi, \u201cShake-shake regularization,\u201d arXiv preprint arXiv:1705.07485, 2017.\n\n[37] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, \u201cDropout: a\nsimple way to prevent neural networks from over\ufb01tting,\u201d Journal of Machine Learning Research,\nvol. 15, pp. 1929\u20131958, 2014.\n\n[38] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian, \u201cDisturblabel: Regularizing cnn on the loss\n\nlayer,\u201d in CVPR, 2016, pp. 4753\u20134762.\n\n[39] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein gan,\u201d CoRR, vol. abs/1701.07875, 2017.\n\n[40] M. Mirza and S. Osindero, \u201cConditional generative adversarial nets,\u201d CoRR, vol. abs/1411.1784,\n\n2014.\n\n[41] A. Odena, C. Olah, and J. Shlens, \u201cConditional image synthesis with auxiliary classi\ufb01er gans,\u201d\n\nin ICML, 2017, pp. 2642\u20132651.\n\n[42] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, \u201cInfogan: Inter-\npretable representation learning by information maximizing generative adversarial nets,\u201d in\nNeurIPS, 2016, pp. 2172\u20132180.\n\n10\n\n\f", "award": [], "sourceid": 6875, "authors": [{"given_name": "Yulin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Xuran", "family_name": "Pan", "institution": "Tsinghua University"}, {"given_name": "Shiji", "family_name": "Song", "institution": "Department of Automation, Tsinghua University"}, {"given_name": "Hong", "family_name": "Zhang", "institution": "Baidu Inc."}, {"given_name": "Gao", "family_name": "Huang", "institution": "Tsinghua"}, {"given_name": "Cheng", "family_name": "Wu", "institution": "Tsinghua"}]}