{"title": "Memory Replay GANs: Learning to Generate New Categories without Forgetting", "book": "Advances in Neural Information Processing Systems", "page_first": 5962, "page_last": 5972, "abstract": "Previous works on sequential learning address the problem of forgetting in discriminative models. In this paper we consider the case of generative models. In particular, we investigate generative adversarial networks (GANs) in the task of learning new categories in a sequential fashion. We first show that sequential fine tuning renders the network unable to properly generate images from previous categories (i.e. forgetting). Addressing this problem, we propose Memory Replay GANs (MeRGANs), a conditional GAN framework that integrates a memory replay generator. We study two methods to prevent forgetting by leveraging these replays, namely joint training with replay and replay alignment. Qualitative and quantitative experimental results in MNIST, SVHN and LSUN datasets show that our memory replay approach can generate competitive images while significantly mitigating the forgetting of previous categories.", "full_text": "Memory Replay GANs: learning to generate images\n\nfrom new categories without forgetting\n\nChenshen Wu, Luis Herranz, Xialei Liu, Yaxing Wang,\n\nJoost van de Weijer, Bogdan Raducanu\n\n{chenshen, lherranz, xialei, yaxing, joost, bogdan}@cvc.uab.es\n\nComputer Vision Center\n\nUniversitat Aut\u00f2noma de Barcelona, Spain\n\nAbstract\n\nPrevious works on sequential learning address the problem of forgetting in dis-\ncriminative models. In this paper we consider the case of generative models. In\nparticular, we investigate generative adversarial networks (GANs) in the task of\nlearning new categories in a sequential fashion. We \ufb01rst show that sequential\n\ufb01ne tuning renders the network unable to properly generate images from previous\ncategories (i.e. forgetting). Addressing this problem, we propose Memory Replay\nGANs (MeRGANs), a conditional GAN framework that integrates a memory replay\ngenerator. We study two methods to prevent forgetting by leveraging these replays,\nnamely joint training with replay and replay alignment. Qualitative and quantitative\nexperimental results in MNIST, SVHN and LSUN datasets show that our memory\nreplay approach can generate competitive images while signi\ufb01cantly mitigating the\nforgetting of previous categories. 1\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs) [6] are a popular framework for image generation due to\ntheir capability to learn a mapping between a low-dimensional latent space and a complex distribution\nof interest, such as natural images. The approach is based on an adversarial game between a generator\nthat tries to generate good images and a discriminator that tries to discriminate between real training\nsamples and generated. The original framework has been improved with new architectures [21, 9]\nand more robust losses [2, 7, 16].\nGANs can be used to sample images by mapping a randomly sampled latent vector. While providing\ndiversity, there is little control over the semantic properties of what is being generated. Conditional\nGANs [18] enable the use of semantic conditions as inputs, so the semantic properties and the inherent\ndiversity can be decoupled. The simplest condition is just the category label, allowing to control the\ncategory of the generated image [20].\nAs most machine learning problems, image generation models have been studied in the conventional\nsetting that assumes all training data is available at training time. This assumption can be unrealistic in\npractice, and modern neural networks face scenarios where tasks and data are not known in advance,\nrequiring to continuously update their models upon the arrival of new data or new tasks. Unfortunately,\nneural networks suffer from severe degradation when they are updated in a sequential manner without\nrevisiting data from previous tasks (known as catastrophic forgetting [17]). Most strategies to prevent\nforgetting in neural networks rely on regularizing weights [4, 14] or activations [13], keeping a small\nset of exemplars from previous categories [22, 15], or memory replay mechanisms[23, 25, 10].\n\n1The code is available at https://github.com/WuChenshen/MeRGAN\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Joint training\n\n(b) Sequential \ufb01ne tuning\n\n(c) GAN with EWC[24]\n\nFigure 1: Baseline architectures.\n\nWhile previous works study forgetting in discriminative tasks, in this paper we focus on forgetting in\ngenerative models (GANs in particular) through the problem of generating images when categories\nare presented sequentially as disjoint tasks. The closest related work is [24], that adapts elastic weight\nconsolidation (EWC) [4] to GANs. In contrast, our method relies on memory replay and we describe\ntwo approaches to prevent forgetting by joint retraining and by aligning replays. The former includes\nreplayed samples in the training process, while the latter forces to synchronize the replays of the\ncurrent generator with those generated by an auxiliary generator (a snapshot taken before starting to\nlearn the new task). An advantage of studying forgetting in image generation is that the dynamics of\nforgetting and consolidation can be observed visually through the generated images themselves.\n\n2 Sequential learning in GANs\n\n2.1\n\nJoint learning\n\nWe \ufb01rst introduce our conditional GAN framework in the non-sequential setting, where all categories\nare learned jointly. In particular, this \ufb01rst baseline is based on the AC-GAN framework [20] combined\nwith the WGAN-GP loss for robust training [7]. Using category labels as conditions, the task is to\nlearn from a training set S = {S1, . . . , SM} to generate images given an image category c. Each set\nSc represents the training images for a particular category.\nThe framework consists of three components: generator, discriminator and classi\ufb01er. The discrimina-\ntor and classi\ufb01er share all layers but the last ones (task-speci\ufb01c layers). The conditional generator is\nparametrized by \u03b8G and generates an image \u02dcx = G\u03b8G (z, c) given a latent vector z and a category c.\nIn our case the conditioning is implemented via conditional batch normalization [3], that dynamically\nswitches between sets of batch normalization parameters depending on c. Note that, in contrast to\nunconditional GANs, the latent vector is completely agnostic to the category, and the same latent\nvector can be used to generate images of different categories just by using a different c.\nSimilarly, the discriminator (parametrized by \u03b8D) tries to discern whether an input image x is real (i.e.\nfrom the training set) or generated, while the generator tries to fool it by generating more realistic\nimages. In addition, AC-GAN uses an auxiliary classi\ufb01er C with parameters \u03b8C to predict the label\n\u02dcc = C\u03b8C (x), and thus forcing the generator to generate images that can be classi\ufb01ed in the same\nway as real images. This additional task improves the performance in the original task [20]. For\n\nconvenience we represent all the parameters in the conditional GAN as \u03b8 =(cid:0)\u03b8G, \u03b8D, \u03b8C(cid:1).\n\nDuring training, the network is trained to solve both the adversarial game (using the WGAN with\ngradient penalty loss [7]) and the classi\ufb01cation task by alternating the optimization of the generator,\nand the discriminator and classi\ufb01er. The generator optimizes the following problem:\n\nGAN (\u03b8, S) + LG\n\nmin\n\u03b8G\nGAN (\u03b8, S) = \u2212Ez\u223cpz,c\u223cpc [D\u03b8D (G\u03b8G (z, c))]\nLG\nCLS (\u03b8, S) = \u2212Ez\u223cpz,c\u223cpc [yc log C\u03b8C (G\u03b8G (z, c))]\nLG\n\n(1)\n\n(2)\n(3)\n\n(cid:0)LG\n\nCLS (\u03b8, S)(cid:1)\n\n2\n\n\fGAN (\u03b8, S) and \u03bbCLSLG\n\nwhere LG\nCLS (\u03b8, S) are the corresponding GAN and cross-entropy loss for\nclassi\ufb01cation, respectively, S is the training set, pc = U {1, M}, pz = N (0, 1) are the sampling\ndistributions (uniform and Gaussian, respectively), and yc is the one-hot encoding of c for computing\nthe cross-entropy. The GAN loss uses the WGAN formulation with gradient penalty. Similarly, the\noptimization problem in the discriminator and classi\ufb01er is\n\n(cid:0)LD\n\nGAN (\u03b8, S) + \u03bbCLSLD\n\nCLS (\u03b8, S)(cid:1)\n((cid:107)\u2207D\u03b8D (\u0001x + (1 \u2212 \u0001) G\u03b8G (z, c))(cid:107)2 \u2212 1)2(cid:105)\n(cid:104)\n\nmin\n\u03b8D,\u03b8C\nGAN (\u03b8, S) = \u2212E(x,c)\u223cS [D\u03b8D (x)] + Ez\u223cpz,c\u223cpc [D\u03b8D (G\u03b8G (z, c))]\nLD\n+ \u03bbGPEx\u223cS,z\u223cpz,c\u223cpc,\u0001\u223cp\u0001\nCLS (\u03b8, S) = \u2212E(x,c)\u223cS [C\u03b8C (G\u03b8G (z, c))]\nLD\n\n(6)\nwhere \u0001 are parameters of the gradient penalty term, sampled as p\u0001 = U (0, 1) . The last term of\nLD\nGAN is the gradient penalty.\n\n(4)\n\n(5)\n\n2.2 Sequential \ufb01ne tuning\n\nNow we modify the previous framework to address the sequential learning scenario. We de\ufb01ne a\nsequence of tasks T = (1, . . . , M ), each of them corresponding to learning to generate images from\na new training set St. For simplicity, we restrict each St to contain only images from a particular\ncategory c, i.e. t = c.\nThe joint training problem can be adapted easily to the sequential learning scenario as\n\nLG\n\nGAN (\u03b8t, St)\n\n(7)\n\nmin\n\u03b8G\nt\n\nwhere \u03b8t =(cid:0)\u03b8G\n\nt , \u03b8D\nt\n\n(cid:1) are the parameters during task t, which are initialized as \u03b8t = \u03b8t\u22121, i.e. the\n\nGAN (\u03b8t, St)\n\nmin\n\u03b8D\nt\n\nLD\n\n(8)\n\ncurrent task t is learned immediately after \ufb01nishing the previous task t \u2212 1. Note that there is no\nclassi\ufb01er in this case since there is only data of the current category.\nUnfortunately, when the network learns to adjust its parameters to generate images of the new domain\nvia gradient descent, that very drifting away from the original solution for the previous task will cause\ncatastrophic forgetting [17]. This has also been observed in GANs [24, 27] (shown later in Figures 3,\n5 and 7 in the experiments section).\n\n2.3 Preventing forgetting with Elastic Weight Consolidation\n\nCatastrophic forgetting can be alleviated using samples from previous tasks [22, 15] or different\ntypes of regularization that result in penalizing large changes in parameters or activations [4, 13]. In\nparticular, the elastic weight consolidation (EWC) regularization [4] has been adapted to prevent\nforgetting in GANs [24] and included as an augmented objective when training the generator as\n\nLG\n\nGAN (\u03b8t, St) +\n\nmin\n\u03b8G\nt\n\n\u03bbEW C\n\n2\n\nFt\u22121,i\n\nt,i \u2212 \u03b8G\n\nt\u22121,i\n\n(9)\n\nwhere Ft\u22121,i is the Fisher information matrix that somewhat indicates how sensitive the parameter\n\u03b8G\nt,i is to forgetting, and \u03bbEW C is a hyperparameter. We will use this approach as a baseline.\n\n3 Memory replay generative adversarial networks\n\nRather than regularizing the parameters to prevent forgetting, we propose that the generator has an\nactive role by replaying memories of previous tasks (via generative sampling), and using them during\nthe training of current task to prevent forgetting. Our framework is extended with a replay generator,\nand we describe two different methods to leverage memory replays.\nThis replay mechanism (also known as pseudorehearsal [23]) resembles the role of the hippocampus\nin replaying memories during memory consolidation [5], and has been used to prevent forgetting in\nclassi\ufb01ers [10, 25], but to our knowledge has not been used to prevent forgetting in image generation.\nNote also that image generation is a generative task and typically more complex than classi\ufb01cation.\n\n3\n\n(cid:88)\n\ni\n\n(cid:0)\u03b8G\n\n(cid:1)2\n\n\fJoint retraining with replayed samples\n\n3.1\nOur \ufb01rst method to leverage memory replays creates an extended dataset S(cid:48)\nc\u2208{1,...,t\u22121} \u02dcSc\nthat contains both real training data for the current tasks and memory replays from previous tasks. The\nreplay set \u02dcSc for a given category c typically samples a \ufb01xed number for replays \u02c6x = G\u03b8G\n(z, c).\nOnce the extended dataset is created, the network is trained using joint training (see Fig. 2a) as\n\nt = Sc\n\nt\u22121\n\n(cid:83)\n\n(cid:0)LG\n(cid:0)LD\n\nGAN (\u03b8t, S(cid:48)\nGAN (\u03b8t, S(cid:48)\n\nmin\n\u03b8G\nt\n\nmin\n\u03b8D\nt\n\nt) + \u03bbCLSLG\n\nt) + \u03bbCLSLD\n\nCLS (\u03b8t, S(cid:48)\nCLS (\u03b8t, S(cid:48)\n\nt)(cid:1)\nt)(cid:1)\n\n(10)\n\n(11)\n\nThis method could be related to the deep generative replay in [25], where the authors use an\nunconditional GAN and the category is predicted with a classi\ufb01er. In contrast, we use a conditional\nGAN where the category is an input, allowing us \ufb01ner control of the replay process, with more\nreliable sampling of (x, c) pairs since we avoid potential classi\ufb01cation errors and biased sampling\ntowards recent categories.\n\n(a) Joint retraining with replay\n\n(b) Replay alignment\n\nFigure 2: Memory Replay GANs and mechanisms to prevent forgetting (for a given current task t).\n\n3.2 Replay alignment\n\nWe can also take advantage of the fact that the current generator and the replay generator share\nthe same architecture, inputs and outputs. Their condition spaces (i.e. categories), and, critically,\ntheir latent spaces (i.e. latent vector z) and parameter spaces are also initially aligned, since the\ncurrent generator is initialized with the same parameters of the replay generator. Therefore, we\ncan synchronize both the replay generator and current one to generate the same image by the same\ncategory c and latent vector z as inputs (see Fig. 2b). In these conditions, the generated images \u02c6x and\nx should also be aligned pixelwise, so we can include a suitable pixelwise loss to prevent forgetting\n(we use L2 loss).\nIn contrast to the previous method, in this case the discriminator is only trained with images of the\ncurrent task, and there is no classi\ufb01cation task. The problem optimized by the generator includes a\nreplay alignment loss\nLG\n\n(12)\n\nGAN (\u03b8t, St) + \u03bbRALRA (\u03b8t, St)\n\nmin\n\u03b8G\nt\nLRA (\u03b8t, St) = Ex\u223cS,z\u223cpz,c\u223cU{1,t\u22121}\n\n(cid:20)(cid:13)(cid:13)(cid:13)G\u03b8G\n\nt\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\n(z, c) \u2212 G\u03b8G\n\nt\u22121\n\n(z, c)\n\n(13)\n\nNote that in this case both generators engage in memory replay for all previous tasks. The corre-\nsponding problem in the discriminator is simply min\u03b8D\n\nLD\nGAN (\u03b8t, St).\n\nt\n\n4\n\nReplay(creates extended set)Joint learning Joint replay and alignment(prevents forgetting)Learning new task\fOur approach can be seen as aligned distillation, where distillation requires spatially aligned data.\nNote that in that way it could be related to the learning without forgetting approach [13] to prevent\nforgetting. However, we want to emphasize several subtle yet important differences:\n\nDifferent tasks and data Our task is image generation where outputs have a spatial structure (i.e.\nimages), while in [13] the task is classi\ufb01cation and the output is a vector of category probabilities.\nSpatial alignment Image generation is a one-to-many task with many latent factors of variability\n(e.g. pose, location, color) that can result in completely different images yet sharing the same\ninput category. The latent vector z somewhat captures those factors and allows an unique solution\nfor a given (z, c). However, pixelwise comparison of the generated images requires that not only\nthe input but also the output representations are aligned, which is ensured in our case since at the\nbeginning of training both have the same parameters. Therefore we can use a pixelwise loss.\nTask-agnostic inputs and seen categories In [13], images of the current classi\ufb01cation task are\nused as inputs to extract output features for distillation. Note that this implicitly involves a domain\nshift, since a particular input image is always linked to an unseen category (by de\ufb01nition, in the\nsequential learning problem the network cannot be presented with images of previous tasks), and\ntherefore the outputs for the old task suffer from domain shift. In contrast, our approach does not\nsuffer from that problem since the inputs are not real data, but a category-agnostic latent vector z\nand a category label c. In addition, we only replay seen categories for both generators, i.e. 1 to\nt \u2212 1.\n\n4 Experimental results\n\nWe evaluated the proposed approaches in different datasets with different level of complexity. The\narchitecture and settings are set accordingly. We use the Tensor\ufb02ow [1] framework with Adam\noptimizer [11], learning rate 1e-4, batch size 64 and \ufb01xed parameters for all experiments: \u03bbEW C =\n1e9, \u03bbRA = 1e-3 and \u03bbCLS = 1 except for \u03bbRA = 1e-2 on SVHN dataset.\n\n4.1 Digit generation\n\nWe \ufb01rst consider the digit generation problem in two standard digit datasets. Learning to generate\na digit category is considered as a separate task. MNIST [12] consists of images of handwritten\ndigits which are resized 32 \u00d7 32 pixels in our experiment. SVHN [19] contains cropped digits of\nhouse numbers from real-world street images. The generation task is more challenging since SVHN\ncontains much more variability than MNIST, with diverse backgrounds, variable illumination, font\ntypes, rotated digits, etc.\nThe architecture used in the experiments is based on the combination of AC-GAN and Wasserstein\nloss described in Section 2.1. We evaluated the two variants of the proposed memory replay GANs:\njoint training with replay (MeRGAN-JTR) and replay alignment (MeRGAN-RA). As upper and lower\nbounds we also evaluated joint training (JT) with all data (i.e. non-sequential) and sequential \ufb01ne\ntuning (SFT). We also implemented two additional methods based on related works: the adaptation of\nEWC to conditional GANs proposed by [24], and the deep generative replay (DGR) module of [25],\nimplemented as an unconditional GAN followed by a classi\ufb01er to predict the label. For experiments\nwith memory replay we generate one batch of replayed samples (including all tasks) for every batch\nof real data. We use a three layer DCGAN [21] for both datasets. In order to compare the methods in\na more challenging setting, we keep the capacity of the network relatively limited for SVHN.\nFigure 3 compares the images generated by the different methods after sequentially training the\nten tasks. Since DGR is unconditional, the category for visualization is the one predicted by its\nclassi\ufb01er. We observe that SFT completely forgets previous tasks in both datasets, while the other\nmethods show different degrees of forgetting. The four methods are able to generate MNIST digits\nproperly, although both MeRGANs show sharper ones. In the more challenging setting of SVHN\n(note that the JT baseline also struggles to generate realistic images), the digits generated by EWC\nare hardly recognizable, while DGR is more unpredictable, sometimes generates good images but\noften generating images with ambiguous digits. Those generated by MeRGANs are in general clear\nand more recognizable, but still showing some degradation due to the limited capacity of the network.\nWe also trained a classi\ufb01er with real data, using classi\ufb01cation accuracy as a proxy to evaluate\nforgetting. The rationale behind is that in general bad quality images will confuse the classi\ufb01er and\n\n5\n\n\fSFT\n\nEWC\n\nDGR\n\nMeRGAN-JTR MeRGAN-RA\n\nJT\n\nSFT\n\nEWC\n\nDGR\n\nMeRGAN-JTR MeRGAN-RA\n\nJT\n\nc = 0\nc = 1\nc = 2\nc = 3\nc = 4\nc = 5\nc = 6\nc = 7\nc = 8\nc = 9\n\nc = 0\nc = 1\nc = 2\nc = 3\nc = 4\nc = 5\nc = 6\nc = 7\nc = 8\nc = 9\n\nFigure 3: Images generated for MNIST and SVHN after learning the ten tasks. Rows are different\nconditions (i.e. categories), and columns are different latent vectors.\n\nTable 1: Average classi\ufb01cation accuracy (%) in digit generation (ten sequential tasks).\n\nBaselines\nJT\n\nSFT\n19.87\n19.35\n\nMNIST 97.66\nSVHN 85.30\n\n5 tasks (0-4)\n\nOthers\n\nEWC[24] DGR[25]\n\n70.62\n39.84\n\n90.39\n61.29\n\nMeRGAN\nRA\nJTR\n98.19\n97.93\n80.90\n76.05\n\nBaselines\nJT\n\nSFT\n10.06\n10.10\n\n96.92\n84.82\n\n10 tasks (0-9)\n\nOthers\n\nEWC[24] DGR[25]\n\n77.03\n33.02\n\n85.40\n47.28\n\nMeRGAN\nJTR\n97.00\n66.50\n\nRA\n97.01\n66.78\n\nresult in lower classi\ufb01cation rates. Table 1 shows the classi\ufb01cation accuracy after the \ufb01rst \ufb01ve tasks\n(digits 0 to 4) and after the ten tasks. SFT forgets previous tasks so the accuracy is very low. As\nexpected, EWC performs worse than DGR since it does not leverage replays, however it signi\ufb01cantly\nmitigates the phenomenon of catastrophic forgetting by increasing the accuracy from 19.87 to 70.62\non MNIST, and from 19.35 to 39.84 on SVHN compared to SFT in the case of 5 tasks. The same\nconclusion can be drawn in the case of 10 tasks. By using the memory replay mechanism, MeRGANs\nobtain signi\ufb01cant improvement compared to the baselines and the others related methods. Especially,\nour approach performs about 8% better on MNIST and about 21% better on SVHN compared to the\nstrong baseline DGR in the case of 5 tasks. Note that our approach achieves about 12% gain in the\ncase of 10 tasks, which shows that our approach is much more stable with increasing number of tasks.\nIn the more challenging SVHN dataset, all methods decrease in terms of accuracy, however MerGAN\nare able to mitigate forgetting and obtain comparable results to JT.\nAnother interesting way to compare the different methods is through t-SNE visualizations. We use a\nclassi\ufb01er trained with real digits to extract embeddings of the methods to compare. Fig. 4a shows\nreal 0s from MNIST and generated 0s from the different methods after training 10 tasks (i.e. the \ufb01rst\ntask, and therefore the most dif\ufb01cult to remember). In contrast to SFT and EWC, the distributions of\n0s generated by MeRGANs greatly overlap with the distribution of real 0s (in red) and no isolated\nclusters of real samples are observed, which suggests that MeRGANs prevent forgetting better while\nkeeping diversity (at least in the t-SNE visualizations). Fig. 4b shows the t-SNE visualizations of real\nand 0s generated after learning 0,1,3 and 9, with similar conclusions.\n\n6\n\n\fSFT\n\nEWC\n\nMeRGAN-JTR\n\nMeRGAN-RA\n\n(a) After all tasks\n\nFigure 4: t-SNE visualization of generated 0s. Real 0s correspond to red dots. Please view in\nelectronic format with zooming.\n\n(b) After tasks 0,1,3,9\n\nTable 2: FID and average classi\ufb01cation accuracy (%) on LSUN after the 4th task\n\nAcc.(%)\n\nRev acc.(%)\n\nFID\n\nDGR MeRGAN-JTR MeRGAN-RA\n15.40\n26.17\n93.70\n\n81.03\n83.62\n37.73\n\n79.19\n70.00\n49.69\n\nSFT\n15.02\n28.0\n110.12\n\nEWC\n14.28\n63.35\n178.05\n\n4.2 Scene generation\nWe also evaluated MeRGANs in a more challenging domain and on higher resolution images (64\u00d7 64\npixels) using four scene categories of the LSUN dataset [28]. The experiment consists of a sequence\nof tasks, each one involving learning the generative distribution of a new category. The sequence\nof categories is bedroom, kitchen, church (outdoors) and tower, in this order. This sequence allows\nus to have two indoor and outdoor categories, and transitions between relatively similar categories\n(bedroom to kitchen and church to tower) and also a transition between very different categories\n(kitchen to church). Each category is represented by a set of 100000 training images, and the network\nis trained during 20000 iterations for every task. The architectures are based on [7] with 18-layer\nResNet generator and discriminator, and for every batch of training data for the new category we\ngenerate a batch of replayed images per category.\nFigure 5 shows examples of generated images. Each block column corresponds to a different method,\nand inside, each row shows images generated for a particular condition (i.e. category) and each\ncolumn corresponds to images generated after learning a particular task, using the same latent vector.\nNote that we excluded DGR since the generation is not conditioned on the category. We can observe\nthat SFT completely forgets the previous task, and essentially ignores the category condition. EWC\ngenerates images that have characteristics of both new and previous tasks (e.g. bluish outdoor colors,\nindoor shapes), being unable to neither successfully learn new tasks nor remember previous ones. In\ncontrast both variants of MeRGAN are able to generate competitive images of new categories while\nstill remembering to generate images of previous categories.\nIn addition to classi\ufb01cation accuracy (using a VGG [26] trained over the ten categories in LSUN),\nfor this dataset we add two additional measurements. The \ufb01rst one is reverse accuracy measured by\na classi\ufb01er trained with generated data and evaluated with real data. The second one is the Frechet\nInception Distance (FID), which is widely used to evaluate the images generated by GANs. Note that\nFID is sensitive to both quality and diversity[8]. Table 2 shows these metrics after the four tasks are\nlearned. MeRGANs perform better in this more complex and challenging setting, where EWC and\nDGR are severely degraded.\nFigure 6 shows the evolution of these metrics during the whole training process, including transitions\nto new tasks (the curves have been smoothed for easier visualization). We can observe not only that\nsequential \ufb01ne tuning forgets the task completely, but also that it happens early during the \ufb01rst few\niterations. This also allows the network to exploit its full capacity to focus on the new task and learn\nit quickly. MeRGANs experience forgetting during the initial iterations of a new task but then tend to\n\n7\n\n\fSequential \ufb01ne tuning\n\nTask 1\n\nTask 2\n\nTask 3\n\nTask 4\n\nTask 1\n\nEWC\n\nTask 2\n\nTask 3\n\nTask 4\n\nTask 1\n\nMeRGAN-JTR\nTask 2\nTask 3\n\nTask 4\n\nTask 1\n\nMeRGAN-RA\nTask 2\nTask 3\n\nTask 4\n\nm\no\no\nr\nd\ne\nb\n\nn\ne\nh\nc\nt\ni\nk\n\nh\nc\nr\nu\nh\nc\n\nr\ne\nw\no\nt\n\nm\no\no\nr\nd\ne\nb\n\nn\ne\nh\nc\nt\ni\nk\n\nh\nc\nr\nu\nh\nc\n\nr\ne\nw\no\nt\n\nFigure 5: Images generated after sequentially learning each task (column within each block) for\ndifferent methods (block column), two different latent vectors z (block row) and different conditions\nc (row within each block). The network learned after the \ufb01rst task is the same in all methods. Note\nthat \ufb01ne tuning forgets previous tasks completely, while the proposed methods still remember them.\n\nFigure 6: Evolution of FID and classi\ufb01cation accuracy (%). Best viewed in color.\n\nrecover during the training process. In this experiment MeRGAN-RA seems to be more stable and\nslightly more effective than MeRGAN-JTR.\nFigure 6 provides useful insight about the dynamics of learning and forgetting in sequential learning.\nThe evolution of generated images also provides complementary insight, as in the bedroom images\nshown in Figure 7, where we pay special attention to the \ufb01rst iterations. The transition between task 2\nto 3 (i.e. kitchen to church) is particularly revealing, since this new task requires the network to learn\nto generate many completely new visual patterns found in outdoor scenes. The most clear example is\nthe need to develop \ufb01lters that can generate the blue sky regions, that are not found in the previous\nindoor categories seen during task 1 and 2. Since the network is not equipped with knowledge to\ngenerate the blue sky, the new task has to reuse and adapt previous one, interfering with previous\ntasks and causing forgetting. This interference can be observed clearly in the \ufb01rst iterations of task 3\nwhere the walls of bedroom (and kitchen) images turn blue (also related with the peak in forgetting\nobserved at the same iterations in Figure 6). MeRGANs provide mechanisms that penalize forgetting,\nforcing the network to develop separate \ufb01lters for the different patterns (e.g. separated \ufb01lters for\nwall and sky). MeRGAN-JTR seems to effectively decouple both patterns, since we do not observe\nthe same \"blue walls\" interference during task 4. Interestingly, the same interference seems to be\nmilder in MeRGAN-RA, but recurrent, since it also appears again during task 4. Nevertheless, the\ninterference is still temporary and disappears after a few iterations more.\n\n8\n\n050100150200FIDBedroomKitchenChurchTowerTask 1Task 2Task 3Task 40.00.20.40.60.81.0AccuracyTask 1Task 2Task 3Task 4Task 1Task 2Task 3Task 4Task 1Task 2Task 3Task 4MeRGAN-RAMeRGAN-JTRSFT\fTask 1 (bedroom)\n\nTask 2 (kitchen)\n\nTask 3 (church)\n\nTask 4 (tower)\n\nSFT\n\nMeRGAN-JTR\n\nMeRGAN-RA\n\nSFT\n\nMeRGAN-JTR\n\nMeRGAN-RA\nIteration\n\n1\n\n16\n\n256\n\n5000\n\n20000\n\n1\n\n16\n\n256\n\n5000\n\n20000\n\n1\n\n16\n\n256\n\n5000\n\n20000\n\n1\n\n16\n\n256\n\n5000\n\n20000\n\nFigure 7: Evolution of the generated images (category bedroom and two different values of z) during\nthe sequential learning process (rows). Sequential \ufb01ne tuning forgets the previous task after just a\nfew iterations (iterations within each task are sampled in a logarithmic fashion). Note that \ufb01ne tuning\nforgets previous tasks completely, while the MeRGANs still remember them.\n\nAnother interesting observation from Figures 5 and 7 is that MeRGAN-RA remembers the same\nbedroom (e.g. same point of view, colors, objects), which is related to the replay alignment mechanism\nthat enforces remembering the instance. On the other hand, MeRGAN-JTR remembers bedrooms in\ngeneral as the generated image still resembles a bedroom but not exactly the same one as in previous\nsteps. This can be explained by the fact that the classi\ufb01er and the joint training mechanism enforce\nthe not-forgetting constraint at the category level.\n\nConclusions\n\nWe have studied the problem of sequential learning in the context of image generation with GANs,\nwhere the main challenge is to effectively address catastrophic forgetting. MeRGANs incorporate\nmemory replay as the main mechanism to prevent forgetting, which is then enforced through either\njoint training or replay alignment. Our results show their effectiveness in retaining the ability to\ngenerate competitive images of previous tasks even after learning several new ones. In addition to\nthe application in pure image generation, we believe MeRGANs and generative models robust to\nforgetting in general, could have important application in many other tasks. We also showed that\nimage generation provides an interesting way to visualize the interference between tasks and potential\nforgetting by directly observing generated images.\n\nAcknowledgements\n\nC. Wu, X. Liu, and Y. Wang, acknowledge the Chinese Scholarship Council (CSC) grant\nNo.201709110103, No.201506290018 and No.201507040048. Luis Herranz acknowledges the\nEuropean Union research and innovation program under the Marie Sk\u0142odowska-Curie grant agree-\nment No. 6655919. This work was supported by TIN2016-79717-R, and the CHISTERA project\nM2CR (PCIN-2015-251) of the Spanish Ministry, the ACCIO agency and CERCA Programme\n/ Generalitat de Catalunya, and the EU Project CybSpeed MSCA-RISE-2017-777720. We also\nacknowledge the generous GPU support from NVIDIA.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In\n\nICLR, 2017.\n\n[4] James Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the\n\nNational Academy of Sciences of the United States of America, 14(13):3521\u20133526, 2017.\n\n[5] Steffen Gais et al. Sleep transforms the cerebral trace of declarative memories. Proceedings of the National\n\nAcademy of Sciences, 104(47):18778\u201318783, 2007.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\n\ntraining of wasserstein gans. In NIPS, 2017.\n\n[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00fcnter Klambauer, and Sepp\nHochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. In NIPS, 2017.\n\n[9] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved\n\nquality, stability, and variation. In ICLR, 2018.\n\n[10] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. In ICLR,\n\n2018.\n\n[11] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[12] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n\n[13] Zhizhong Li and Derek Hoiem. Learning without forgetting. In ECCV, 2016.\n\n[14] Xialei Liu, Marc Masana, Luis Herranz, Joost Van de Weijer, Antonio M Lopez, and Andrew D Bagdanov.\n\nRotate your networks: Better weight consolidation and less catastrophic forgetting. In ICPR, 2018.\n\n[15] David Lopez-Paz et al. Gradient episodic memory for continual learning. In NIPS, 2017.\n\n[16] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least\n\nsquares generative adversarial networks. In ICCV, 2017.\n\n[17] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential\n\nlearning problem. The psychology of learning and motivatio, 24:109\u2013165, 1989.\n\n[18] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,\n\n2014.\n\n[19] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in\nnatural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised\nFeature Learning, 2011.\n\n[20] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary\n\nclassi\ufb01er gans. In ICML, 2017.\n\n[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[22] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Cristoph H. Lampert. icarl: Incremental\n\nclassi\ufb01er and representation learning. In CVPR, 2017.\n\n[23] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123\u2013\n\n146, 1995.\n\n[24] Ari Seff, Alex Beatson, Daniel Suo, and Han Liu. Continual learning in generative adversarial nets. arXiv\n\nprepprint arXiv:1705.08395v1, 2017.\n\n10\n\n\f[25] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative\n\nreplay. In NIPS, 2017.\n\n[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\n[27] Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and Bogdan\n\nRaducanu. Transferring GANs: generating images from limited data. In ECCV, 2018.\n\n[28] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of a large-scale\n\nimage dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2901, "authors": [{"given_name": "Chenshen", "family_name": "Wu", "institution": "Computer Vision Center"}, {"given_name": "Luis", "family_name": "Herranz", "institution": "Computer Vision Center"}, {"given_name": "Xialei", "family_name": "Liu", "institution": "Computer Vision Center"}, {"given_name": "yaxing", "family_name": "wang", "institution": "Centre de Visi\u00f3 per Computador (CVC)"}, {"given_name": "Joost", "family_name": "van de Weijer", "institution": "Computer Vision Center Barcelona"}, {"given_name": "Bogdan", "family_name": "Raducanu", "institution": "Computer Vision Center"}]}