{"title": "Progressive Augmentation of GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 6249, "page_last": 6259, "abstract": "Training of Generative Adversarial Networks (GANs) is notoriously fragile, requiring to maintain a careful balance between the generator and the discriminator in order to perform well. To mitigate this issue we introduce a new regularization technique - progressive augmentation of GANs (PA-GAN). The key idea is to gradually increase the task difficulty of the discriminator by progressively augmenting its input or feature space, thus enabling continuous learning of the generator. We show that the proposed progressive augmentation preserves the original GAN objective, does not compromise the discriminator's optimality and encourages a healthy competition between the generator and discriminator, leading to the better-performing generator. We experimentally demonstrate the effectiveness of PA-GAN across different architectures and on multiple benchmarks for the image synthesis task, on average achieving 3 point improvement of the FID score.", "full_text": "Progressive Augmentation of GANs\n\nDan Zhang\n\nBosch Center for Arti\ufb01cial Intelligence\n\ndan.zhang2@bosch.com\n\nAnna Khoreva\n\nBosch Center for Arti\ufb01cial Intelligence\n\nanna.khoreva@bosch.com\n\nAbstract\n\nTraining of Generative Adversarial Networks (GANs) is notoriously fragile, re-\nquiring to maintain a careful balance between the generator and the discriminator\nin order to perform well. To mitigate this issue we introduce a new regularization\ntechnique - progressive augmentation of GANs (PA-GAN). The key idea is to gradu-\nally increase the task dif\ufb01culty of the discriminator by progressively augmenting its\ninput or feature space, thus enabling continuous learning of the generator. We show\nthat the proposed progressive augmentation preserves the original GAN objective,\ndoes not compromise the discriminator\u2019s optimality and encourages a healthy com-\npetition between the generator and discriminator, leading to the better-performing\ngenerator. We experimentally demonstrate the effectiveness of PA-GAN across\ndifferent architectures and on multiple benchmarks for the image synthesis task, on\naverage achieving \u223c 3 point improvement of the FID score.\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) [11] are a recent development in the \ufb01eld of deep learning,\nthat have attracted a lot of attention in the research community [27, 30, 2, 15]. The GAN framework\ncan be formulated as a competing game between the generator and the discriminator. Since both the\ngenerator and the discriminator are typically parameterized as deep convolutional neural networks\nwith millions of parameters, optimization is notoriously dif\ufb01cult in practice [2, 12, 24].\nThe dif\ufb01culty lies in maintaining a healthy competition between the generator and discriminator. A\ncommonly occurring problem arises when the discriminator overshoots, leading to escalated gradients\nand oscillatory GAN behaviour [23, 4]. Moreover, the supports of the data and model distributions\ntypically lie on low dimensional manifolds and are often disjoint [1]. Consequently, there exists\na nearly trivial discriminator that can perfectly distinguish real data samples from synthetic ones.\nOnce such a discriminator is produced, its loss quickly converges to zero and the gradients used for\nupdating parameters of the generator become useless. For improving the training stability of GANs\nregularization techniques [28, 12] can be used to constrain the learning of the discriminator. But as\nshown in [4, 18] they also impair the generator and lead to the performance degradation.\nIn this work we introduce a new regularization technique to alleviate this problem - progressive\naugmentation of GANs (PA-GAN) - that helps to control the behaviour of the discriminator and thus\nimprove the overall training.1 The key idea is to progressively augment the input of the discriminator\nnetwork or its intermediate feature layers with auxiliary random bits in order to gradually increase\nthe discrimination task dif\ufb01culty (see Fig. 1). In doing so, the discriminator can be prevented from\nbecoming over-con\ufb01dent, enabling continuous learning of the generator. As opposed to standard\naugmentation techniques (e.g. rotation, cropping, resizing), the proposed progressive augmentation\ndoes not directly modify the data samples or their features, but rather structurally appends to them.\nMoreover, it can also alter the input class. For instance, in the single-level augmentation the data\nsample or its features x are combined with a random bit s and both are provided to the discriminator.\n\n1https://github.com/boschresearch/PA-GAN\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Discriminator task\n\n(b) Input space augmentation\n\nFigure 1: Visualization of progressive augmentation. At level l = 0 (no augmentation) the\ndiscriminator D aims at classifying the samples xd and xg, respectively drawn from the data Pd and\ngenerative model Pg distributions, into true (green) and fake (blue). At single-level augmentation\n(l = 1) the class of the augmented sample is set based on the combination xd and xg with s, resulting\nin real and synthetic samples contained in both classes and leading to a harder task for D. With\neach extra augmentation level (l \u2192 l + 1) the decision boundary between two classes becomes more\ncomplex and the discrimination task dif\ufb01culty gradually increases. This prevents the discriminator\nfrom easily solving the task and thus leads to meaningful gradients for the generator updates.\n\nThe class of the augmented sample (x, s) is then set based on the combination x with s, resulting in\nreal and synthetic samples contained in both classes, see Fig. 1-(a). This presents a more challenging\ntask for the discriminator, as it needs to tell the real and synthetic samples apart plus additionally\nlearn how to separate (x, s) back into x and s and understand the association rule. We can further\nincrease the task dif\ufb01culty of the discriminator by progressively augmenting its input or feature space,\ngradually increasing the number of random bits during the course of training as depicted in Fig. 1-(b).\nWe prove that PA-GAN preserves the original GAN objective and, in contrast to prior work [1, 31, 30],\ndoes not bias the optimality of the discriminator (see Sec. 3.1). Aiming at minimum changes we\nfurther propose an integration of PA-GAN into existing GAN architectures (see Sec. 3.2) and\nexperimentally showcase its bene\ufb01ts (see Sec. 4.1). Structurally augmenting the input or its features\nand mapping them to higher dimensions not only challenges the discrimination task, but, in addition,\nwith each realization of the random bits alters the loss function landscape, potentially providing a\ndifferent path for the generator to approach the data distribution.\nOur technique is orthogonal to existing work, it can be successfully employed with other regularization\nstrategies [28, 12, 30, 32, 6] and different network architectures [24, 35], which we demonstrate in\nSec. 4.2. We experimentally show the effectiveness of PA-GAN for unsupervised image generation\ntasks on multiple benchmarks (Fashion-MNIST [34], CIFAR10 [17], CELEBA-HQ [15], and Tiny-\nImageNet [7]), on average improving the FID score around 3 points. For PA combination with\nSS-GAN [6] we achieve the best FID of 14.7 for the unsupervised setting on CIFAR10, which is on\npar with the results achieved by large scale BigGAN training [4] using label supervision.\n\n2 Related Work\n\nMany recent works have focused on improving the stability of GAN training and the overall visual\nquality of generated samples [28, 24, 35, 4]. The unstable behaviour of GANs is partly attributed\nto a dimensional mismatch or non-overlapping support between the real data and the generative\nmodel distributions [1], resulting in an almost trivial task for the discriminator. Once the performance\nof the discriminator is maxed out, it provides a non-informative signal to train the generator. To\navoid vanishing gradients, the original GAN paper [11] proposed to modify the min-max based GAN\nobjective to a non-saturating loss. However, even with such a re-formulation the generator updates\ntend to get worse over the course of training and optimization becomes massively unstable [1].\nPrior approaches tried to mitigate this issue by using heuristics to weaken the discriminator, e.g.\ndecreasing its learning rate, adding label noise or directly modifying the data samples. [30] proposed\na one-sided label smoothing to smoothen the classi\ufb01cation boundary of the discriminator, thereby\npreventing it from being overly con\ufb01dent, but at the same time biasing its optimality. [1, 31] tried to\nensure a joint support of the data and model distributions to make the job of the discriminator harder\nby adding Gaussian noise to both generated and real samples. However, adding high-dimensional\nnoise introduces signi\ufb01cant variance in the parameter estimation, slowing down the training and\nrequiring multiple samples for counteraction [28]. Similarly, [29] proposed to blur the input samples\n\n2\n\nFake/True= 1= 0= 0= 1Level L=0Level l=1FakeTrueLevel l=0Level l=1Level l=2FakeTrue\fand gradually remove the blurring effect during the course of training. These techniques perform\ndirect modi\ufb01cations on the data samples.\nAlternatively, several works focused on regularizing the discriminator. [12] proposed to add a soft\npenalty on the gradient norm which ensures a 1-Lipschitz discriminator. Similarly, [28] added a\nzero-centered penalty on the weighted gradient-norm of the discriminator, showing its equivalence\nto adding input noise. On the downside, regularizing the discriminator with the gradient penalty\ndepends on the model distribution, which changes during training, and results in increased runtime\ndue to additional gradient norm computation [18]. Most recently, [4] also experimentally showed that\nthe gradient penalty may lead to the performance degradation, which corresponds to our observations\nas well (see Sec. 4.2) In addition to the gradient penalty, [4] also exploited the dropout regularization\n[32] on the \ufb01nal layer of the discriminator and reported its similar stabilizing effect. [24] proposed\nanother way to stabilize the discriminator by normalizing its weights and limiting the spectral norm\nof each layer to constrain the Lipschitz constant. This normalization technique does not require\nintensive tuning of hyper-parameters and is computationally light. Moreover, [35] showed that\nspectral normalization is also bene\ufb01cial for the generator, preventing the escalation of parameter\nmagnitudes and avoiding unusual gradients.\nSeveral methods have proposed to modify the GAN training methodology in order to further improve\nstability, e.g. by considering multiple discriminators [8], growing both the generator and discriminator\nnetworks progressively [15] or exploiting different learning rates for the discriminator and generator\n[13]. Another line of work resorts to objective function reformulation, e.g. by using the Pearson \u03c72\ndivergence [22], the Wasserstein distance [2], or f-divergence [25].\nIn this work we introduce a novel and orthogonal way of regularizing GANs by progressively\nincreasing the discriminator task dif\ufb01culty. In contrast to other techniques, our method does not\nbias the optimality of the discriminator or alter the training samples. Furthermore, the proposed\naugmentation is complementary to prior work. It can be employed with different GAN architectures\nand combined with other regularization techniques (see Sec. 4).\n\n3 Progressive Augmentation of GANs\n\n3.1 Theoretical Framework of PA-GAN\n\nThe core idea behind the GAN training [11] is to set up a competing game between two players,\ncommonly termed discriminator and generator. The discriminator aims at distinguishing the samples\nx \u2208 X respectively drawn from the data distribution Pd and generative model distribution Pg, i.e.\nperforming binary classi\ufb01cation D : X (cid:55)\u2192 [0, 1]. 2 The aim of the generator, on the other hand, is to\nmake synthetic samples into data samples, challenging the discriminator. In this work, X represents a\ncompact metric space such as the image space [\u22121, 1]N of dimension N. Both Pd and Pg are de\ufb01ned\non X . The model distribution Pg is induced by a function G that maps a random vector z \u223c Pz to a\nsynthetic data sample, i.e. xg = G(z) \u2208 X . Mathematically, the two-player game is formulated as\n(1)\n\nEPd {log [D(x)]} + EPg {log [1 \u2212 D(x)]} .\n\nmin\n\nmax\n\nG\n\nD\n\nAs being proved by [11], the inner maximum equals the Jensen-Shannon (JS) divergence between\nPd and Pg, i.e., DJS (Pd(cid:107)Pg). Therefore, the GAN training attempts to minimize the JS divergence\nbetween the model and data distributions.\nLemma 1. Let s \u2208 {0, 1} denote a random bit with uniform distribution Ps(s) = \u03b4[s]+\u03b4[s\u22121]\n, where\n\u03b4[s] is the Kronecker delta. Associating s with x, two joint distributions of (x, s) are constructed as\n\n2\n\nPx,s(x, s) \u2206=\n\nPd(x)\u03b4[s] + Pg(x)\u03b4[s \u2212 1]\n\n2\n\n, Qx,s(x, s) \u2206=\n\nPg(x)\u03b4[s] + Pd(x)\u03b4[s \u2212 1]\n\n2\n\nTheir JS divergence is equal to\n\nDJS (Px,s(cid:107)Qx,s) = DJS (Pd(cid:107)Pg) .\n\n.\n\n(2)\n\n(3)\n\n2D(x) aims to learn the probability of x being true or fake, however, it can also be regarded as the sigmoid\n\nresponse of classi\ufb01cation with cross entropy loss.\n\n3\n\n\fFigure 2: PA-GAN overview. With each level of progressive augmentation l the dimensionality of\ns is enlarged from 1 to L, s = {s1, s2, . . . , sL}. The task dif\ufb01culty of the discriminator gradually\nincreases as the length of s grows.\n\nTaking (2) as the starting point and with sl being a sequence of i.i.d. random bits of length l, the\nrecursion of constructing the paired joint distributions of (x, sl)\n\nPx,sl (x, sl) \u2206= Px,sl\u22121(x, sl\u22121)\u03b4[sl]/2 + Qx,sl\u22121(x, sl\u22121)\u03b4[sl \u2212 1]/2\nQx,sl (x, sl) \u2206= Qx,sl\u22121(x, sl\u22121)\u03b4[sl]/2 + Px,sl\u22121 (x, sl\u22121)\u03b4[sl \u2212 1]/2\n\nresults into a series of JS divergence equalities for l = 1, 2, . . . , L, i.e.,\n\nDJS (Pd(cid:107)Pg) = DJS (Px,s1(cid:107)Qx,s1) = \u00b7\u00b7\u00b7 = DJS (Px,sL(cid:107)Qx,sL) .\n\n(4)\n\n(5)\n\nTheorem 1. The min-max optimization problem of GANs [11] as given in (1) is equivalent to\n\u2200l \u2208 {1, 2, . . . , L},\n\n(6)\nwhere the two joint distributions, i.e., Px,sl and Qx,sl, are de\ufb01ned in (4) and the function D maps\n(x, sl) \u2208 X \u00d7 {0, 1}l onto [0, 1]. For a \ufb01xed G, the optimal D is\n\n{log [D(x, sl)]} + EQx,sl\n\n{log [1 \u2212 D(x, sl)]}\n\nEPx,sl\n\nmax\n\nD\n\nmin\n\nG\n\nPx,sl (x, sl)\n\nPd(x)\n\nD\u2217(x, sl) =\n\n,\n\n=\n\nPd(x) + Qd(x)\n\nPx,sl (x, sl) + Qx,sl (x, sl)\n\n(7)\nwhereas the attained inner maximum equals DJS (Px,sl(cid:107)Qx,sl ) = DJS (Pd(cid:107)Pg) for l = 1, 2, . . . , L.\nAccording to Theorem 1, solving (1) is interchangeable with solving (6). In fact, the former can be\nregarded as a corner case of the latter by taking l = 0 as the absence of the auxiliary bit vector s. As\nthe length l of s increases, the input dimension of the discriminator grows accordingly. Furthermore,\ntwo classes to be classi\ufb01ed consist of both the data and synthetic samples as illustrated in Fig. 1-(a).\nNote that, the mixture strategy of the distributions of two independent random variables in Lemma 1\ncan be extended for any generic random variables (see Sec. S.2.4 in the supp. material).\nWhen solving (1), G and D are parameterized as deep neural networks and SGD (or its variants) is\ntypically used for the optimization, updating their weights in an alternating or simultaneous manner,\nwith no guarantees on global convergence. Theorem 1 provides a series of JS divergence estimation\nproxies by means of the auxiliary bit vector s that in practice can be exploited as a regularizer to\nimprove the GAN training (see Sec. 4.1 for empirical evaluation). First, the number of possible\ncombinations of the data samples with sl grows exponentially with l, thus helping to prevent the\ndiscriminator from over\ufb01tting to the training set. Second, the task of the discriminator gradually\nbecomes harder with the length l. The input dimensionality of D becomes larger and as the label of\n(x, sl\u22121) is altered based on the new random bit sl the decision boundary becomes more complicated\n(Fig. 1-b). Given that, progressively increasing l can be exploited during training to balance the\ngame between the discriminator and generator whenever the former becomes too strong. Third, when\nthe GAN training performance saturates at the current augmentation level, adding one random bit\nchanges the landscape of the loss function and may further boost the learning.\n\n3.2\n\nImplementation of PA-GAN\n\nThe min-max problem in (6) shares the same structure as the original one in (1), thus we can exploit the\nstandard GAN training for PA-GAN, see Fig. 2. The necessary change only concerns the discriminator.\nIt involves 1) using checksum principle as a new classi\ufb01cation criterion, 2) incorporating s in addition\nto x as the network input and 3) enabling the progression of s during training.\nChecksum principle. The conventional GAN discriminator assigns TRUE (0) / FAKE (1) class\nlabel based on x being either data or synthetic samples. In contrast, the discriminator D in (6)\n\n4\n\n1L2Levell=0=1l=LDataProgression Levels...Fake/Truel\frequires sl along with x to make the decision about the class label. Starting from l = 1, the two\nclass distributions in (2) imply the label-0 for (xd, s = 0), (xg, s = 1) and label-1 for (xd, s = 1),\n(xg, s = 0). The real samples are no longer always in the TRUE class, and the synthetic samples\nare no longer always in the FAKE class, see Fig. 1-(a). To detect the correct class we can use a\nsimple checksum principle. Namely, let the data and synthetic samples respectively encode bit 0\nand 1 followed by associating the checksum 0(1) of the pair (x, s) with TRUE(FAKE). 3 For more\nthan one bit, Px,sl and Qx,sl are recursively constructed according to (4). Based on the checksum\nprinciple for the single bit case, we can recursively show its consistency for any bit sequence length\nsl, l > 1. This is a desirable property for progression. With the identi\ufb01ed checksum principle, we\nfurther discuss a way to integrate a sequence of random bits sl into the discriminator network in a\nprogressive manner.\nProgressive augmentation. With the aim of maximally reusing existing GAN architectures we\npropose two augmentation options. The \ufb01rst one is input space augmentation, where s is directly\nconcatenated with the sample x and both are fed as input to the discriminator network. The second\noption is feature space augmentation, where s is concatenated with the learned feature representations\nof x attained at intermediate hidden layers. For both cases, the way to concatenate s with x or its\nfeature maps is identical. Each entry sl creates one augmentation channel, which is replicated to\nmatch the spatial dimension of x or its feature maps. Depending on the augmentation space, either\nthe input layer or the hidden layer that further processes the feature maps will additionally take\ncare of the augmentation channels along with the original input. In both cases, the original layer\ncon\ufb01guration (kernel size, stride and padding type) remains the same except for its channel size being\nincreased by l. All the other layers of the discriminator remain unchanged. When a new augmentation\nlevel is reached, one extra input channel of the \ufb01lter is instantiated to process the bit l + 1.\nThese two ways of augmentation are bene\ufb01cial as they make the checksum computation more\nchallenging for the discriminator, i.e., making the discriminator unaware about the need of separating\nx and s from the concatenated input. We note that in order to take full advantage of the regularization\neffect of progressive augmentation, s needs to be involved in the decision making process of the\ndiscriminator either through input or feature space augmentation. Augmenting s with the output D(x)\nmakes the task trivial, thereby disabling the regularization effect of the progressive augmentation.\nIn this work we only exploit s by concatenating it with either the input or the hidden layers of the\nnetwork. However, it is also possible to combine it with other image augmentation strategies, e.g.\nusing s as an indicator for the rotation angle, as in [6], or the type of color augmentation that is\nimposed on the input x and encouraging D to learn the type through the checksum principle.\nProgression scheduling. To schedule the progression we rely on the kernel inception distance (KID)\nintroduced by [3] to decide if the performance of G at the current augmentation level saturates or even\nstarts degrading (typically happens when D starts over\ufb01tting or becomes too powerful). Speci\ufb01cally,\nafter t discriminator iterations, we evaluate KID between synthetic samples and data samples drawn\nfrom the training set. If the current KID score is less than 5% of the average of the two previous\nevaluations attained at the same augmentation level, the augmentation is leveled up, i.e. l \u2192 l + 1. To\nvalidate the effectiveness of this scheduling mechanism we exploit it for the learning rate adaptation\nas in [3] and compare it with progressive augmentation in the next section.\n\n4 Experiments\nDatasets: We consider four datasets: Fashion-MNIST [34], CIFAR10 [17], CELEBA-HQ (128 \u00d7\n128) [15] and Tiny-ImageNet (a simpli\ufb01ed version of ImageNet [7]), with the training set sizes equal\nto 60k, 50k, 27k and 100k plus the test set sizes equal to 10k, 10k, 3k, and 10k, respectively. Note\nthat we focus on unsupervised image generation and do not use class label information.\nNetworks: We employ SN DCGAN [24] and SA GAN [35], both using spectral normalization (SN) [24]\nin the discriminator for regularization. SA GAN exploits the ResNet architecture with a self-attention\n(SA) layer [35]. Its generator additionally adopts self-modulation BN (sBN) [5] together with SN.\nWe exploit the implementations provided by [18, 35]. Following [24, 35], we train SN DCGAN and\nSA GAN [35] with the non-saturation (NS) and hinge loss, respectively.\nEvaluation metrics: We use Fr\u00e9chet inception distance (FID) [14] as the main evaluation metric.\nAdditionally, we also report inception score (IS) [33] and kernel inception distance (KID) [3] in\n\n3By checksum we mean the XOR operation over a bit sequence.\n\n5\n\n\fTable 1: FID improvement of PA across different datasets and network architectures. We experiment\nwith augmenting the input and feature spaces, see Sec.4.1 for details.\n\nMethod\n\nSN DCGAN [24]\n\nSA GAN (sBN) [35]\n\nPA\n\u0017\n\ninput\nfeat\n\n\u0017\n\ninput\nfeat\n\nF-MNIST CIFAR10 CELEBA-HQ T-ImageNet \u2206PA\n\n10.6\n6.2\n6.2\n-\n-\n-\n\n26.0\n22.2\n22.6\n18.8\n16.1\n16.3\n\n24.3\n20.8\n18.8\n17.8\n15.4\n15.8\n\n-\n-\n-\n\n47.6\n44.8\n44.7\n\n4.2\n\n2.6\n\nSec. S7 of the supp. material. All measures are computed based on the same number of the test\ndata samples and synthetic samples, following the evaluation framework of [21, 18]. By default all\nreported numbers correspond to the median of \ufb01ve independent runs with 300k, 500k, 400k and 500k\ntraining iterations for Fashion-MNIST, CIFAR10, CELEBA-HQ, and Tiny-ImageNet, respectively.\nTraining details: We use uniformly distributed noise vector z \u2208 [\u22121, 1]128, the mini-batch size of 64,\nand Adam optimizer [16]. The two time-scale update rule (TTUR) [13] is considered when choosing\nthe learning rates for D and G. For progression scheduling KID4 is evaluated using samples from the\ntraining set every t = 10k iterations, except for Tiny-ImageNet with t = 20k given its approximately\n2\u00d7 larger training set. More details are provided in Sec. S8 of the supp. material.\n\n4.1 PA Across Different Architectures and Datasets\n\nTable 1 gives an overview of the FID performance achieved with and without applying the proposed\nprogressive augmentation (PA) across different datasets and networks. We observe consistent improve-\nment of the FID score achieved by PA with both the input PA (input) and feature PA (feat) space\naugmentation (see the supp. material for augmentation details and ablation study on the augmentation\nspace). From SN DCGAN to the ResNet-based SA GAN the FID reduction preserves approximately\naround 3 points, showing that the gain achieved by PA is complementary to the improvement on\nthe architecture side. In comparison to input space augmentation, augmenting intermediate level\nfeatures does not overly simplify the discriminator task, paralysing PA. In the case of SN DCGAN on\nCELEBA-HQ, it actually outperforms the input space augmentation. Overall, a stable performance\ngain of PA, independent of the augmentation space choice, showcases high generalization quality of\nPA and its easy adaptation into different network designs.5\nLower FID values achieved by PA can be attributed mostly to the improved sample diversity. By\nlooking at generated images in Fig. 3 (and Fig. S4 in the supp. material), we observe that PA increases\nthe variation of samples while maintaining the same image \ufb01delity. This is expected as PA being a\nregularizer does not modify the GAN architecture, as in PG-GAN [15] or BigGAN [4], to directly\nimprove the visual quality. Speci\ufb01cally, Fig. 3 shows synthetic images produced by SN DCGAN and\nSA GAN with and without PA, on Fashion-MNIST and CELEBA-HQ. By polar interpolation between\ntwo samples z1 and z2, from left to right we observe the clothes/gender change. PA improves\nsample variation, maintaining representative clothes/gender attributes and achieving smooth transition\nbetween samples (e.g. hair styles and facial expressions). For further evaluation, we also measure\nthe diversity of generated samples with the MS-SSIM score [26]. We use 10k synthetic images\ngenerated with SA GAN on CELEBA-HQ. Employing PA reduces MS-SSIM from 0.283 to 0.266,\nwhile PG-GAN [15] achieves 0.283, and MS-SSIM of 10k real samples is 0.263.\nComparison with SotA on Human Face Synthesis. Deviating from from low- to high-resolution\nhuman face synthesis, the recent work COCO-GAN [19] outperformed PG-GAN [15] on the CELEBA\ndataset [20] via conditional coordinating. At the resolution 64 of CELEBA, PA improves the SA GAN\nFID from 4.11 to 3.35, being better than COCO-GAN, which achieves FID of 4.0 and outperforms\nPG-GAN at the resolution 128 (FID of 5.74 vs. 7.30). Thus we conclude that the quality of samples\ngenerated by PA is comparable to the quality of samples generated by the recent state-of-the-art\nmodels [19, 15] on human face synthesis.\n\n4FID is used as the primary metric, KID is chosen for scheduling to avoid over-optimizing towards FID.\n5We also experiment with using PA for WGAN-GP [2], improving FID from 25.0 to 23.9 on CIFAR10, see\n\nSec. S.3.4 in the supp. material.\n\n6\n\n\fF-MNIST: SN DCGAN without PA\n\nF-MNIST: SN DCGAN with PA\n\nCELEBA-HQ: SA GAN without PA\n\nFigure 3: Synthetic images generated through latent space interpolation with and without using PA.\nPA helps to improve variation across interpolated samples, i.e., no close-by images looks alike.\n\nCELEBA-HQ: SA GAN with PA\n\nAblation Study. In Fig. 4 and Table 2 we present an ablation study on PA, comparing single-\nlevel augmentation (without progression) with progressive multi-level PA, showing the bene\ufb01t\nof progression. From no augmentation to the \ufb01rst level augmentation, the required number of\niterations varies over the datasets and architectures (30k\u223c 70k). Generally the number of reached\naugmentation levels is less than 15. Fig. 4 also shows that single-level augmentation already improves\nthe performance over the baseline SN DCGAN. However, the standard deviation of its FIDs across \ufb01ve\nindependent runs starts increasing at later iterations. By means of progression, we can counteract\nthis instability, while reaching a better FID result. Table 2 further compares augmentation at\ndifferent levels with and without continuing with progression. Both augmentation and progression are\nbene\ufb01cial, while progression alleviates the need of case dependent tuning of the augmentation level.\nAs a generic mechanism to monitor the GAN training, progression scheduling is usable not only for\naugmentation level-up, but also for other hyperparameter adaptations over iterations. Analogous to\n[3] here we test it for the learning rate adaptation. From Fig. 4, progression scheduling shows its\neffectiveness in assisting both the learning rate adaptation and PA for an improved FID performance.\nPA outperforms learning rate adaptation, i.e. median FID 22.2 vs. 24.0 across \ufb01ve independent runs.\nRegularization Effect of PA. Fig. 5 depicts the discriminator loss (D loss) and the generator loss (G\nloss) behaviour as well as the FID curves over iterations. It shows that the discriminator of SN DCGAN\nvery quickly becomes over-con\ufb01dent, providing a non-informative backpropagation signal to train\nthe generator and thus leading to the increase of the G loss. PA has a long lasting regularization\neffect on SN DCGAN by means of progression and helps to maintain a healthy competition between\nits discriminator and generator. Each rise of the D loss and drop of the G loss coincides with an\niteration at which the augmentation level increases, and then gradually reduces after the discriminator\ntimely adapts to the new bit. Observing the behaviour of the D and G losses, we conclude that\nboth PA (input) and PA (feat) can effectively prevent the SN DCGAN discriminator from over\ufb01tting,\nalleviating the vanishing gradient issue and thus enabling continuous learning of the generator. At the\nlevel one augmentation, both PA (feat) and PA (input) start from the similar over\ufb01tting stage, i.e.,\n(a) and (b) respectively at the iteration 60k and 70k. Combining the bit s directly with high-level\nfeatures eases the checksum computation. As a result, the D loss of PA (featN/8) reduces faster, but\nmaking its future task more dif\ufb01cult due to over\ufb01tting to the previous augmentation level. On the\nother hand, PA (input) let the bits pass through all layers, and thus its adaptation to augmentation\n\n7\n\n\fFigure 4: FID learning curves on SN DCGAN CI-\nFAR10. The curves show the mean FID with one\nstandard deviation across \ufb01ve random runs.\n\nTable 2: Median FIDs of input space augmen-\ntation starting from the level l with and with-\nout progression on CIFAR10 with SN DCGAN.\n\nAugment. level Progression\n\nl\n0\n1\n2\n3\n4\n\n\u0017\n\n26.0\n23.8\n23.6\n23.5\n23.5\n\n\u0013\n\n22.2\n22.3\n22.9\n22.9\n23.2\n\n\u2206PA\n\n3.8\n1.5\n0.7\n0.6\n0.3\n\n(a) Discriminator (D) and generator (G) loss over iterations\n\nFigure 5: Behaviour of the discriminator\nloss (D loss) and the generator loss (G loss) as\nwell as FID changes over iterations, using SN\nDCGAN on CIFAR10. PA acts as a stochastic\nregularizer, preventing the discriminator from\nbecoming overcon\ufb01dent.\n\n(b) FID over iterations\n\nprogression improves over iterations. In the end, both PA (feat) and PA (input) lead to similar\nregularization effect and result in the improved FID scores.\nIn Fig. 5 we also evaluate the Dropout [32] regularization applied on the fourth convolutional layer\nwith the keep rate 0.7 (the best performing setting in our experiments). Both Dropout and PA resort to\nrandom variables for regularization. The former randomly removes features, while the latter augments\nthem with additional random bits and adjusts accordingly the class label. In contrast to Dropout,\nPA has a stronger regularization effect and leads to faster convergence (more rapid reduction of FID\nscores). In addition, we compare PA with the Reinit. baseline, where at each scheduled progression\nall weights are reinitialized with Xavier initialization [10]. Compared to PA, using Reinit. strategy\nleads to longer adaptation time (the D loss decay is much slower) and oscillatory GAN behaviour,\nthus resulting in dramatic \ufb02uctuations of FID scores over iterations.\n\n4.2 Comparison and Combination with Other Regularizers\n\nWe further compare and combine PA with other regularization techniques, i.e., one-sided label\nsmoothing [30], GP from [12], its zero-centered alternative GPzero-cent from [28], Dropout [32], and\nself-supervised GAN training via auxiliary rotation loss (SS) [6].\nOne-sided label smoothing (Label smooth.) weakens the discriminator by smoothing its decision\nboundary, i.e., changing the positive labels from one to a smaller value. This is analogous to introduc-\ning label noise for the data samples, whereas PA alters the target labels based on the deterministic\nchecksum principle. Bene\ufb01ting from a smoothed decision boundary, Label smooth. slightly im-\nproves the performance of SN DCGAN (26.0 vs. 25.8), but underperforms in comparison to PA (input)\n(22.2) and PA (feat) (22.6). By applying PA on top of Label smooth. we observe a similar reduc-\ntion of the FID score (23.1 and 22.3 for input and feature space augmentation, respectively).\nBoth GP and GPzero-cent regularize the norms of gradients to stabilize the GAN training. The former\naims at a 1-Lipschitz discriminator, and the latter is a closed-form approximation of adding input noise.\nTable 3 shows that both of them are compatible with PA but degrade the performance of SN DCGAN\nalone and its combination with PA. This effect has been also observed in [18, 4], constraining the\n\n8\n\n0.511.522.533.544.55\u00b71052530IterationsFIDSNDCGAN-Adptlrd-Single-level(input)-PA(input)00.511.522.533.544.55\u00b71050.511.5abIterationsDLossSNDCGAN-Reinit.-Dropout(feat)-PA(input)-PA(feat)00.511.522.533.544.55\u00b71051234IterationsGLoss0.511.522.533.544.55\u00b710520406080IterationsFID\fTable 3: FID performance of PA, different regularization techniques and their combinations on\nCIFAR10, see Sec. 4.2 for details.\n\nMethod\n\nPA\n\nGAN\n\nSN DCGAN [24]\n\nSA GAN (sBN) [35]\n\n\u0017\n\n\u0017\n\n26.0\ninput 22.2\nfeat 22.6\n18.8\ninput 16.1\nfeat 16.3\n\u2206PA\n3.1\n\n-Label smooth.\n\n[30]\n25.8\n23.1\n22.3\n\u2212\n\u2212\n\u2212\n3.1\n\n-GP -GPzero-cent -Dropout -SS\n[12]\n26.7\n21.8\n22.7\n17.8\n15.8\n16.1\n3.2\n\n[32]\n22.1\n21.9\n20.6\n16.2\n15.5\n15.6\n0.8\n\n[28]\n26.5\n22.3\n23.0\n17.8\n16.1\n15.9\n2.8\n\n[6] \u2206PA\n\u2212\n\u2212 3.0\n\u2212 3.1\n15.7\n14.7 1.3\n14.9 1.3\n0.9\n2.3\n\nlearning of the discriminator improves the GAN training stability but at the cost of performance\ndegradation. Note that, however, with PA performance degradation is smaller.\nDropout shares a common stochastic nature with PA as illustrated in Fig. 5 and in the supp. material.\nWe observe from Table 3 that Dropout and PA can be both exploited as effective regularizers.\nDropout acts locally on the layer. The layer outputs are randomly and independently subsampled,\nthinning the network. In contrast, PA augments the input or the layer with extra channels containing\nrandom bits, these bits also change the class label of the input and thus alter the network decision\nprocess. Dropout helps to break-up situations where the layer co-adapts to correct errors from\nprior layers and enables the network to timely re-learn features of constantly changing synthetic\nsamples. PA regularizes the decision process of D, forcing D to comprehend the input together with\nthe random bits for correct classi\ufb01cation and has stronger regularization effect than Dropout, see\nFig. 5 and the supp. material. Hence, they have different roles. Their combination further improves\nFID by \u223c 0.8 point on average, showing the complementarity of both approaches. It is worth noting\nthat Dropout is sensitive to the selection of the layer at which it is applied. In our experiments (see\nthe supp. material) it performs best when applied at the fourth convolutional layer.\nSelf-supervised training (SS-GAN) in [6] regularizes the discriminator by encouraging it to solve\nan auxiliary image rotation prediction task. From the perspective of self-supervision, PA presents\nthe discriminator a checksum computation task, whereas telling apart the data and synthetic samples\nbecomes a sub-task. Rotation prediction task was initially proposed and found useful in [9] to improve\nfeature learning of convolutional networks. The checksum principle is derived from Theorem 1. Their\ncombination is bene\ufb01cial and achieves the best FID of 14.7 for the unsupervised setting on CIFAR10,\nwhich is the same score as in the supervised case with large scale BigGAN training [4].\nOverall, we observe that PA is consistently bene\ufb01cial when combining with other regularization\ntechniques, independent of input or feature space augmentation. Additional improvement of the FID\nscore can come along with \ufb01ne selection of the augmentation space type.\n\n5 Conclusion\n\nIn this work we have proposed progressive augmentation (PA) - a novel regularization method for\nGANs. Different to standard data augmentation our approach does not modify the training samples,\ninstead it progressively augments them or their feature maps with auxiliary random bits and casts\nthe discrimination task into the checksum computation. PA helps to entangle the discriminator\nand thus to avoid its early performance saturation. We experimentally have shown consistent\nperformance improvements of employing PA-GAN across multiple benchmarks and demonstrated\nthat PA generalizes well across different network architectures and is complementary to other\nregularization techniques. Apart from generative modelling, as a future work we are interested in\nexploiting PA for semi-supervised learning, generative latent modelling and transfer learning.\n\nReferences\n[1] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. In International Conference on Learning Representations (ICLR), 2017.\n\n9\n\n\f[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[3] Miko\u0142aj Bi\u00b4nkowski, Dougal J. Sutherland, Michael N. Arbel, and Athur Gretton. Demystifying\n\nMMD GANs. In International Conference on Learning Representations (ICLR), 2018.\n\n[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity\nnatural image synthesis. In International Conference on Learning Representations (ICLR),\n2019.\n\n[5] Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generative\nadversarial networks. In International Conference on Learning Representations (ICLR), 2019.\n\n[6] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans\nvia auxiliary rotation loss. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale\nHierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2009.\n\n[8] Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multi-adversarial networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[9] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by\npredicting image rotations. In International Conference on Learning Representations (ICLR),\nVancouver, Canada, April 2018.\n\n[10] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial\n\nneural networks.\nIntelligence and Statistics (AISTATS 2010), 2010.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems (NIPS), 2014.\n\n[12] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of Wasserstein GANs. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems (NIPS), 2017.\n\n[14] Ferenc Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood,\n\nadversary? arXiv: 1511.05101, 2015.\n\n[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for\nimproved quality, stability, and variation. In International Conference on Learning Representa-\ntions (ICLR), 2018.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2015.\n\n[17] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[18] Karol Kurach, Mario Lu\u02d8ci\u00b4c, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The GAN\nlandscape: Losses, architectures, regularization, and normalization. arXiv: 1807.04720, 2018.\n\n[19] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-\nTzong Chen. COCO-GAN: generation by parts via conditional coordinating. In IEEE Interna-\ntional Conference on Computer Vision (ICCV), 2019.\n\n[20] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In International Conference on Computer Vision (ICCV), 2015.\n\n10\n\n\f[21] Mario Lu\u02d8ci\u00b4c, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs\ncreated equal? A large-scale study. In Advances in Neural Information Processing Systems\n(NIPS), 2018.\n\n[22] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Multi-class generative\n\nadversarial networks with the L2 loss function. arXiv:1611.04076, 2016.\n\n[23] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs\n\ndo actually converge? In International Conference on Machine learning (ICML), 2018.\n\n[24] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[25] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neu-\nral samplers using variational divergence minimization. In Advances in Neural Information\nProcessing Systems (NIPS), 2016.\n\n[26] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\nauxiliary classi\ufb01er GANs. In International Conference on Learning Representations (ICLR),\n2017.\n\n[27] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. In International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[28] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training\nof generative adversarial networks through regularization. In Advances in Neural Information\nProcessing Systems (NIPS), 2017.\n\n[29] Mehdi S. M. Sajjadi, Giambattista Parascandolo, and Bernhard Sch\u00f6lkopf Arash Mehrjou.\nTempered adversarial networks. In International Conference on Machine Learning (ICML),\n2018.\n\n[30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen,\nand Xi Chen. Improved techniques for training GANs. In Advances in Neural Information\nProcessing Systems (NIPS), 2016.\n\n[31] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amor-\nIn International Conference on Learning\n\ntised map inference for image super-resolution.\nRepresentations (ICLR), 2017.\n\n[32] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research (JMLR), 2014.\n\n[33] Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. In International Conference on Learning Representations (ICLR), 2016.\n\n[34] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv: 1708.07747, 2017.\n\n[35] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention\n\ngenerative adversarial networks. arXiv: 1805.08318, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3378, "authors": [{"given_name": "Dan", "family_name": "Zhang", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Anna", "family_name": "Khoreva", "institution": "Bosch Center for Artificial Intelligence"}]}