{"title": "Online Continual Learning with Maximal Interfered Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 11849, "page_last": 11860, "abstract": "Continual learning, the setting where a learning agent is faced with a never-ending stream of data, continues to be a great challenge for modern machine learning systems. In particular the online or \"single-pass through the data\" setting has gained attention recently as a natural setting that is difficult to tackle. Methods based on replay, either generative or from a stored memory, have been shown to be effective approaches for continual learning, matching or exceeding the state of the art in a number of standard benchmarks. These approaches typically rely on randomly selecting samples from the replay memory or from a generative model, which is suboptimal. In this work, we consider a controlled sampling of memories for replay. We retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. We show a formulation for this sampling criterion in both the generative replay and the experience replay setting, producing consistent gains in performance and greatly reduced forgetting. We release an implementation of our method at https://github.com/optimass/Maximally_Interfered_Retrieval", "full_text": "Online Continual Learning with Maximally\n\nInterfered Retrieval\n\nRahaf Aljundi\u2217\n\nKU Leuven\n\nLucas Caccia\u2217\n\nMila\n\nrahaf.aljundi@gmail.com\n\nlucas.page-caccia@mail.mcgill.ca\n\nEugene Belilovsky\u2217\n\nMila\n\nMassimo Caccia\u2217\n\nMila\n\neugene.belilovsky@umontreal.ca\n\nmassimo.p.caccia@gmail.com\n\nMin Lin\n\nMila\n\nmavenlin@gmail.com\n\nLaurent Charlin\n\nMila\n\nlcharlin@gmail.com\n\nTinne Tuytelaars\n\nKU Leuven\n\ntinne.tuytelaars@esat.kuleuven.be\n\nAbstract\n\nContinual learning, the setting where a learning agent is faced with a never ending\nstream of data, continues to be a great challenge for modern machine learning\nsystems. In particular the online or \"single-pass through the data\" setting has\ngained attention recently as a natural setting that is dif\ufb01cult to tackle. Methods\nbased on replay, either generative or from a stored memory, have been shown to\nbe effective approaches for continual learning, matching or exceeding the state\nof the art in a number of standard benchmarks. These approaches typically rely\non randomly selecting samples from the replay memory or from a generative\nmodel, which is suboptimal. In this work we consider a controlled sampling\nof memories for replay. We retrieve the samples which are most interfered, i.e.\nwhose prediction will be most negatively impacted by the foreseen parameters\nupdate. We show a formulation for this sampling criterion in both the generative\nreplay and the experience replay setting, producing consistent gains in performance\nand greatly reduced forgetting. We release an implementation of our method at\nhttps://github.com/optimass/Maximally_Interfered_Retrieval.\n\n1\n\nIntroduction\n\nArti\ufb01cial neural networks have exceeded human-level performance in accomplishing individual\nnarrow tasks [19]. However, such success remains limited compared to human intelligence that\ncan continually learn and perform an unlimited number of tasks. Humans\u2019 ability of learning and\naccumulating knowledge over their lifetime has been challenging for modern machine learning\nalgorithms and particularly neural networks. In that perspective, continual learning aims for a higher\nlevel of machine intelligence by providing the arti\ufb01cial agents with the ability to learn online from a\nnon-stationary and never-ending stream of data. A key component for such never-ending learning\n\n\u2217Authors contributed equally\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fprocess is to overcome the catastrophic forgetting of previously seen data, a problem that neural\nnetworks are well known to suffer from [13]. The solutions developed so far often relax the problem\nof continual learning to the easier task-incremental setting, where the data stream can be divided into\ntasks with clear boundaries and each task is learned of\ufb02ine. One task here can be recognizing hand\nwritten digits while another different types of vehicles (see [24] for example).\n\nExisting approaches can be categorized into three major families based on how the information\nregarding previous task data is stored and used to mitigate forgetting and potentially support the\nlearning of new tasks. These include replay-based [8, 30] methods which store prior samples,\ndynamic architectures [35, 39] which add and remove components and prior-focused [18, 41, 9, 7]\nmethods that rely on regularization.\n\nIn this work, we consider an online continual setting where a stream of samples is seen only once\nand is not-iid. This is a much harder and more realistic setting than the milder incremental task\nassumption[4] and can be encountered in practice e.g. social media applications. We focus on\nthe replay-based approach [26, 36] which has been shown to be successful in the online continual\nlearning setting compared to other approaches [26]. In this family of methods, previous knowledge is\nstored either directly in a replay buffer, or compressed in a generative model. When learning from\nnew data, old examples are reproduced from a replay buffer or a generative model.\n\nIn this work, assuming a replay buffer or a generative model, we direct our attention towards answering\nthe question of what samples should be replayed from the previous history when new samples are\nreceived. We opt for retrieving samples that suffer from an increase in loss given the estimated\nparameters update of the model. This approach also takes some motivation from neuroscience where\nreplay of previous memories is hypothesized to be present in the mammalian brain [27, 34], but\nlikely not random. For example it is hypothesized in [15, 25] similar mechanisms might occur to\naccommodate recent events while preserving old memories.\n\nWe denote our approach Maximally Interfered Retrieval (MIR) and propose variants using stored\nmemories and generative models. The rest of the text is divided as follows: we discuss closely related\nwork in Sec. 2. We then present our approach based on a replay buffer or a generative model in Sec. 3\nand show the effectiveness of our approach compared to random sampling and strong baselines in\nSec. 4.\n\n2 Related work\n\nThe major challenge of continual learning is the catastrophic forgetting of previous knowledge once\nnew knowledge is acquired [12, 32] which is closely related to the stability/plasticity dilemma [14]\nthat is present in both biological and arti\ufb01cial neural networks. While these problems have been\nstudied in early research works [10, 11, 20, 21, 37], they are receiving increased attention since the\nrevival of neural networks.\n\nSeveral families of methods have been developed to prevent or mitigate the catastrophic forgetting\nphenomenon. Under the \ufb01xed architecture setting, one can identify two main streams of works: i)\nmethods that rely on replaying samples or virtual (generated) samples from the previous history while\nlearning new ones and ii) methods that encode the knowledge of the previous tasks in a prior that is\nused to regularize the training of the new task [17, 40, 1, 28]. While the prior-focused family might\nbe effective in the task incremental setting with a small number of disjoint tasks, this family often\nshows poor performance when tasks are similar and training models are faced with long sequences as\nshown in Farquhar and Gal [9].\n\nReplayed samples from previous history can be either used to constrain the parameters update based\non the new sample, to stay in the feasible region of the previous ones [26, 6, 5] or for rehearsal [30, 33].\nHere, we consider a rehearsal approach on samples played from previous history as it is a cheaper and\neffective alternative to the constraint optimization approach [8, 5]. Rehearsal methods usually play\nrandom samples from a buffer, or pseudo samples from a generative model trained on the previous\ndata Shin et al. [36]. These works showed promising results in the of\ufb02ine incremental tasks setting\nand recently been extended to the online setting [8, 5], where a sequence of tasks forming a non\ni.i.d. stream of training data is considered with one or few samples at a time. However, in the online\nsetting and given a limited computational budget, one can\u2019t replay all buffer samples each time and it\n\n2\n\n\fFigure 1: High-level illustration of a standard rehearsal method (left) such as generative replay or\nexperience replay which selects samples randomly. This is contrasted with selecting samples based\non interferences with the estimated update (right).\n\nbecomes crucial to select the best candidates to be replayed. Here, we propose a better strategy than\nrandom sampling in improving the learning behaviour and reducing the interference.\n\nContinual learning has also been studied recently for the case of learning generative models [29, 22].\nRiemer et al. [31] used an autoencoder to store compressed representation instead of raw samples. In\nthis work we will leverage this line of research and will consider for the \ufb01rst time generative modeling\nin the online continual learning setting.\n\n3 Methods\n\nWe consider a (potentially in\ufb01nite) stream of data where at each time step, t, the system receives a\nnew set of samples Xt, Yt drawn non i.i.d from a current distribution Dt that could itself experience\nsudden changes corresponding to task switching from Dt to Dt+1.\n\nWe aim to learn a classi\ufb01er f parameterized by \u03b8 that minimizes a prede\ufb01ned loss L on new sample(s)\nfrom the data stream without interfering, or increasing the loss, on previously observed samples.\nOne way to encourage this is by performing updates on old samples from a stored history, or from a\ngenerative model trained on the previous data. The principle idea of our proposal is that instead of\nusing randomly selected or generated samples from the previous history [6, 36], we \ufb01nd samples that\nwould be (maximally) interfered by the new incoming sample(s), had they been learned in isolation\n(Figure 1). This is motivated by the observation that the loss of some previous samples may be\nunaffected or even improved, thus retraining on them is wasteful. We formulate this \ufb01rst in the\ncontext of a small storage of past samples and subsequently using a latent variable generative model.\n\n3.1 Maximally Interfered Sampling from a Replay Memory\n\nWe \ufb01rst instantiate our method in the context of experience replay (ER), a recent and successful\nrehearsal method [8], which stores a small subset of previous samples and uses them to augment the\nincoming data. In this approach the learner is allocated a memory M of \ufb01nite size, which is updated\nby the use of reservoir sampling [3, 8] as the stream of samples arrives. Typically samples are drawn\nrandomly from memory and concatenated with the incoming batch.\n\n\u03b8\n\nL(f\u03b8(Xt), Yt), when receiving sample(s) Xt we estimate the would-\nGiven a standard objective min\nbe parameters update from the incoming batch as \u03b8v = \u03b8 \u2212 \u03b1\u2207L(f\u03b8(Xt), Yt), with learning rate\n\u03b1. We can now search for the top-k values x \u2208 M using the criterion sM I-1(x) = l(f\u03b8v (x), y) \u2212\nl(f\u03b8(x), y), where l is the sample loss. We may also augment the memory to additionally store the\nbest l(f\u03b8(x), y) observed so far for that sample, denoted l(f\u03b8\u2217 (x), y). Thus instead we can evaluate\nsM I-2(x) = l(f\u03b8v (x), y) \u2212 min (cid:0)l(f\u03b8(x), y), l(f\u03b8\u2217 (x), y)(cid:1). We will consider both versions of this\ncriterion in the sequel.\n\n3\n\nIncoming BatchFind Likely Interfered SamplesEstimate UpdateUpdate on Augmented BatchRandomly Select MemoriesStored MemoriesGenerative ModelORUpdate on Augmented BatchStream of Non-iid SamplesCat v DogOrange v AppleWolf vs CarLion vs ZebraDog vs. HorseNaive ApproachMaximally Interfered\fWe denote the budget of samples to retrieve, B. To encourage diversity we apply a simple strategy of\nperforming an initial random sampling of the memory, selecting C samples where C > B before\napplying the search criterion. This also reduces the compute cost of the search. The ER algorithm\nwith MIR is shown in Algorithm 1. We note that for the case of sM I-2 the loss of the C selected\nsamples at line 7 is tracked and stored as well.\n\n3.2 Maximally Interfered Sampling from a Generative Model\n\nWe now consider the case of replay from a generative model. Assume a function f parameterized by\n\u03b8 (e.g. a classi\ufb01er) and an encoder q\u03c6 and decoder g\u03b3 model parameterized by \u03c6 and \u03b3, respectively.\nWe can compute the would-be parameter update \u03b8v as in the previous section. We want to \ufb01nd in the\ngiven feature space data points that maximize the difference between their loss before and after the\nestimated parameters update:\n\ns.t.\n\nmax\n\nZ\n\n||zi \u2212 zj||2\n\nL(cid:0)f\u03b8v (g\u03b3(Z)), Y \u2217(cid:1) \u2212 L(cid:0)f\u03b8\n\n\u2032 (g\u03b3(Z)), Y \u2217(cid:1)\n\n2 > \u01eb \u2200zi, zj \u2208 Z with zi 6= zj\n(1)\nwith Z \u2208 RB\u00d7K, K the feature space dimension, and \u01eb a threshold to encourage the diversity of the\nretrieved points. Here \u03b8\ncan correspond to the current model parameters or a historical model as in\nShin et al. [36]. Furthermore, y\u2217 denotes the true label i.e. the one given to the generated sample\nby the real data distribution. We will explain how to approximate this value shortly. We convert the\nconstraint into a regularizer and optimize the Equation 1 with stochastic gradient descent denoting\nthe strength of the diversity term as \u03bb. From these points we reconstruct the full corresponding input\nsamples X\n\n= g\u03b3(Z) and use them to estimate the new parameters update min\n\nL(f\u03b8(Xt \u222a X\n\n\u2032\n\n\u2032\n\n\u03b8\n\n\u2032\n\n)).\n\nUsing the encoder encourages a better representation of the input samples where similar samples\nlie close. Our intuition is that the most interfered samples share features with new one(s) but have\ndifferent labels. For example, in handwritten digit recognition, the digit 9 might be written similarly\nto some examples from digits {4,7}, hence learning 9 alone may result in confusing similar 4(s)\nand 7(s) with 9 (Fig. 2). The retrieval is initialized with Z \u223c q\u03c6(Xt) and limited to a few gradient\nupdates, limiting its footprint.\n\nTo estimate the loss in Eq. 1 we also need an estimate of\ny\u2217 i.e. the label when using a generator. A straightforward\napproach for is based on the generative replay ideas [36] of\nstoring the predictions of a prior model. We thus suggest\nto use the predicted labels given by f\u03b8\n\u2032 as pseudo labels\nto estimate y\u2217. Denoting ypre = f\u03b8\n\u2032 (g\u03b3(z)) and \u02c6y =\nf\u03b8v (g\u03b3(z)) we compute the KL divergence, DKL(ypre k\n\u02c6y), as a proxy for the interference.\n\nGenerative models such as VAEs [16] are known to gen-\nerate blurry images and images with mix of categories. To\navoid such a source of noise in the optimization, we mini-\nmize an entropy penalty to encourage generating points for\nwhich the previous model is con\ufb01dent. The \ufb01nal objective\nof the generator based retrieval is\n\nFigure 2: Most interfered retrieval from\nVAE on MNIST. Top row shows in-\ncoming data from a \ufb01nal task (8 v 9).\nThe next rows show the samples caus-\ning most interference for the classi\ufb01er\n(Eq. 1)\n\nZ X\nmax\n\nz\u2208Z\n\n[DKL(ypre k \u02c6y) \u2212 \u03b1H(ypre)]\n\ns.t.\n\n||zi \u2212 zj||2\n\n2 > \u01eb \u2200zi, zj \u2208 Z with zi 6= zj,\n\n(2)\n\nwith the entropy H and a hyperparameter \u03b1 to weight the contribution of each term.\n\nSo far we have assumed having a perfect encoder/decoder that we use to retrieve the interfered\nsamples from the previous history for the function being learned. Since we assume an online continual\nlearning setting, we need to address learning the encoder/decoder continually as well.\n\nWe could use a variational autoencoder (VAE) with p\u03b3(X | z) = N (X | g\u03b3(z), \u03c32I) with mean\ng\u03b3(z) and covariance \u03c32I.\n\nAs for the classi\ufb01er we can also update the VAE based on incoming samples and the replayed samples.\nIn Eq. 1 we only retrieve samples that are going to be interfered given the classi\ufb01er update, assuming\n\n4\n\n\fa good feature representation. We can also use the same strategy to mitigate catastrophic forgetting in\nthe generator by retrieving the most interfered samples given an estimated update of both parameters\n(\u03c6, \u03b3). In this case, the intereference is with respect to the VAE\u2019s loss, the evidence lower bound\n(ELBO). Let us denote \u03b3v, \u03c6v the virtual updates for the encoder and decoder given the incoming\nbatch. We consider the following criterion for retrieving samples for the generator:\n\nmax\nZgen\n\nE\n\nz\u223cq\u03c6v\n\n[\u2212log(p\u03b3v (g\u03b3v (Zgen)|z))] \u2212 E\n\nz\u223cq\u03c6\u2032\n\n[\u2212log(p\u03b3 \u2032 (g\u03b3 \u2032 (Zgen)|z))]\n\n+ DKL(q\u03c6v (z|g\u03b3v (Zgen))||p(z)) \u2212 DKL(q\u03c6\u2032 (z|g\u03b3 \u2032 (Zgen))||p(z))\n\n(3)\n\ns.t.\n\n||zi \u2212 zj||2\n\n2 > \u01eb \u2200zi, zj \u2208 Zgen s.t. zi 6= zj\n\nHere (\u03c6\u2032, \u03b3 \u2032) can be the current VAE or stored from the end of the previous task. Similar to Z, Zgen\nis initialized with Zgen \u223c q\u03c6(Xt) and limited to few gradient updates. A complete view of the MIR\nbased generative replay is shown in Algorithm 2\n\n3.3 A Hybrid Approach\n\nTraining generative models in the continual learning setting on more challenging datasets like CIFAR-\n10 remains an open research problem [23]. Storing samples for replay is also problematic as it is\nconstrained by storage costs and very-large memories can become dif\ufb01cult to search. To leverage\nthe bene\ufb01ts of both worlds while avoiding training the complication of noisy generation, Similar to\nRiemer et al. [31] we use a hybrid approach where an autoencoder is \ufb01rst trained of\ufb02ine to store and\ncompress incoming memories. Differently, in our approach, we perform MIR search in the latent\nspace of the autoencoder using Eq. 1. We then select nearest neighbors from stored compressed\nmemories to ensure realistic samples. Our strategy has several bene\ufb01ts: by storing lightweight\nrepresentations, the buffer can store more data for the same \ufb01xed amount of memory. Moreover, the\nfeature space in which encoded samples lie is fully differentiable. This enables the use of gradient\nmethods to search for most interfered samples. We note that this is not the case for the discrete\nautoencoder proposed in [31]. Finally, the autoencoder with its simpler objective is easier to train in\nthe online setting than a variational autoencoder. The method is summarized in Algorithm 3 in the\nAppendix.\n\nAlgorithm 1: Experience MIR (ER-MIR)\nInput: Learning rate \u03b1, Subset size C; Budget B\n\n1 Initialize: Memory M; \u03b8\n2 for t \u2208 1..T do\n3\n\nfor Bn \u223c Dt do\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n%%Virtual Update\n\u03b8v \u2190 SGD(Bn, \u03b1)\n%Select C samples\nBC \u223c M\n%Select based on score\nS \u2190 sort(sM I (BC))\nBMC \u2190 {Si}B\n\u03b8 \u2190 SGD(Bn \u222a BMC , \u03b1)\n%Add samples to memory\nM \u2190 U pdateM emory(Bn);\n\ni=1\n\nend\n\n14\n15 end\n\n4 Experiments\n\nAlgorithm 2: Generative-MIR (GEN-MIR)\nInput: Learning rate \u03b1\n\n1 Initialize: Memory M; \u03b8, \u03c6,\u03b3\n2 for t \u2208 1..T do\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n\u2032\n\n\u2032\n\n\u2032\n\n, \u03c6\n\n, \u03b3\n\n\u03b8\nfor Bn \u223c Dt do\n\n\u2190 \u03b8, \u03c6, \u03b3\n\n%Virtual Update\n\u03b8v \u2190 SGD(Bn, \u03b1)\nBC \u2190 Retrieve samples as per Eq (2)\nBG \u2190 Retrieve samples as per Eq (3)\n%Update Classifier\n\u03b8 \u2190 SGD(Bn \u222a BC , \u03b1)\n%Update Generative Model\n\u03c6, \u03b3 \u2190 SGD(Bn \u222a BG, \u03b1)\n\nend\n\n13\n14 end\n\nWe now evaluate the proposed method under the generative and experience replay settings. We will\nuse three standard datasets and the shared classi\ufb01er setting described below.\n\n\u2022 MNIST Split splits MNIST data to create 5 different tasks with non-overlapping classes.\n\nWe consider the setting with 1000 samples per task as in [2, 26].\n\n\u2022 Permuted MNIST permutes MNIST to create 10 different tasks. We consider the setting\n\nwith 1000 samples per task as in [2, 26].\n\n5\n\n\f\u2022 CIFAR-10 Split splits CIFAR-10 dataset into 5 disjoint tasks as in Aljundi et al. [3].\nHowever, we use a more challenging setting, with all 9,750 samples per task and 250\nretained for validation.\n\n\u2022 MiniImagenet Split splits MiniImagenet [38] dataset into 20 disjoint tasks as in Chaudhry\n\net al. [8] with 5 classes each.\n\nIn our evaluations we will focus the comparisons of MIR to random sampling in the experience replay\n(ER) [3, 8] and generative replay [36, 22] approaches which our method directly modi\ufb01es. We also\nconsider the following reference baselines:\n\n\u2022 \ufb01ne-tuning trains continuously upon arrival of new tasks without any forgetting avoidance\n\nstrategy.\n\n\u2022 iid online (upper-bound) considers training the model with a single-pass through the data\n\non the same set of samples, but sampled iid.\n\n\u2022 iid of\ufb02ine (upper-bound) evaluates the model using multiple passes through the data, sam-\n\npled iid. We use 5 epochs in all the experiments for this baseline.\n\n\u2022 GEM [26] is another method that relies on storing samples and has been shown to be a\n\nstrong baseline in the online setting. It gives similar results to the recent A-GEM [6].\n\nWe do not consider prior-based baselines such as Kirkpatrick et al. [18] as they have been shown to\nwork poorly in the online setting as compared to GEM and ER [8, 26]. For evaluation we primarily\nuse the accuracy as well as forgetting [8].\n\nShared Classi\ufb01er A common setting for continual learning applies a separate classi\ufb01er for each\ntask. This does not cover some of the potentially more interesting continual learning scenarios where\ntask metadata is not available at inference time and the model must decide which classes correspond\nto the input from all possible outputs. As in Aljundi et al. [3] we adopt a shared-classi\ufb01er setup for\nour experiments where the model can potentially predict all classes from all tasks. This sort of setup\nis more challenging, yet can apply to many realistic scenarios.\n\nMultiple Updates for Incoming Samples\nIn the one-pass through the data continual learning setup,\nprevious work has been largely restricted to performing only a single gradient update on incoming\nsamples. However, as in [3] we argue this is not a necessary constraint as the prescribed scenario\nshould permit maximally using the current sample. In particular for replay methods, performing\nadditional gradient updates with additional replay samples can improve performance. In the sequel\nwe will refer to this as performing more iterations.\n\nComparisons to Reported Results Note comparing reported results in Continual Learning re-\nquires great diligence because of the plethora of experimental settings. We remind the reviewer that\nour setting, i.e. shared-classi\ufb01er, online and (in some cases) lower amount of training data, is more\nchallenging than many of the other reported continual learning settings.\n\n4.1 Experience Replay\n\nHere we evaluate experience replay with MIR comparing it to vanilla experience replay [8, 3] on a\nnumber of shared classi\ufb01er settings. In all cases we use a single update for each incoming batch,\nmultiple iterations/updates are evaluated in a \ufb01nal ablation study. We restrict ourselves to the use\nof reservoir sampling for deciding which samples to store. We \ufb01rst evaluate using the MNIST Split\nand Permuted MNIST (Table 1). We use the same learning rate, 0.05, used in Aljundi et al. [3].\nThe number of samples from the replay buffer is always \ufb01xed to the same amount as the incoming\nsamples, 10, as in [8]. For MIR we select by validation C = 50 and the sM I-2 criterion for both\nMNIST datasets. ER-MIR performs well and improves over (standard) ER in both accuracy and\nforgetting. We also show the accuracy on seen tasks after each task sequence is completed in Figure 7.\n\nWe now consider the more complex setting of CIFAR-10 and use a larger number of samples than in\nprior work [3]. We study the performance for different memory sizes (Table 2). For MIR we select\nby validation at M = 50, C = 50 and the sM I-1 criterion. We observe that the performance gap\nincreases when more memories are used. We \ufb01nd that the GEM method does not perform well in this\n\n6\n\n\fiid online\niid of\ufb02ine\n\ufb01ne-tuning\n\nAccuracy \u2191\n86.8 \u00b1 1.1\n92.3 \u00b1 0.5\n19.0 \u00b1 0.2\n79.3 \u00b1 0.6\nGEN-MIR 82.1 \u00b1 0.3\n86.3 \u00b1 1.4\nGEM [26]\n82.1 \u00b1 1.5\n87.6 \u00b1 0.7\n\nGEN\n\nER-MIR\n\nER\n\nForgetting \u2193\n\nN/A\nN/A\n\n97.8 \u00b1 0.2\n19.5 \u00b1 0.8\n17.0 \u00b1 0.4\n11.2 \u00b1 1.2\n15.0 \u00b1 2.1\n7.0 \u00b1 0.9\n\niid online\niid of\ufb02ine\n\ufb01ne-tuning\n\nAccuracy \u2191\n73.8 \u00b1 1.2\n86.6 \u00b1 0.5\n64.6 \u00b1 1.7\n79.7 \u00b1 0.1\nGEN-MIR 80.4 \u00b1 0.2\n78.8 \u00b1 0.4\nGEM [26]\n78.9 \u00b1 0.6\n80.1 \u00b1 0.4\n\nGEN\n\nER-MIR\n\nER\n\nForgetting \u2193\n\nN/A\nN/A\n\n15.2 \u00b1 1.9\n5.8 \u00b1 0.2\n4.8 \u00b1 0.2\n3.1 \u00b1 0.5\n3.8 \u00b1 0.6\n3.9 \u00b1 0.3\n\nTable 1: Results for MNIST SPLIT (left) and Permuted MNIST (right). We report the Average\nAccuracy (higher is better) and Average Forgetting (lower is better) after the \ufb01nal task. We split\nresults into priveleged baselines, methods that don\u2019t use a memory storage, and those that store\nmemories. For the ER methods, 50 memories per class are allowed. Each approach is run 20 times.\n\niid online\niid of\ufb02ine\nGEM [26]\n\niCarl (5 iter) [30]\n\n\ufb01ne-tuning\n\nER\n\nER-MIR\n\nM = 20\n60.8 \u00b1 1.0\n79.2 \u00b1 0.4\n16.8 \u00b1 1.1\n28.6 \u00b1 1.2\n18.4 \u00b1 0.3\n27.5 \u00b1 1.2\n29.8\u00b11.1\n\nAccuracy \u2191\nM = 50\n60.8 \u00b1 1.0\n79.2 \u00b1 0.4\n17.1 \u00b1 1.0\n33.7 \u00b1 1.6\n18.4 \u00b1 0.3\n33.1 \u00b1 1.7\n40.0 \u00b1 1.1\n\nM = 100\n60.8 \u00b1 1.0\n79.2 \u00b1 0.4\n17.5 \u00b1 1.6\n32.4 \u00b1 2.1\n18.4 \u00b1 0.3\n41.3 \u00b1 1.9\n47.6 \u00b1 1.1\n\nM = 20\n\nN/A\nN/A\n\n73.5 \u00b1 1.7\n49 \u00b1 2.4\n85.4 \u00b1 0.7\n50.5 \u00b1 2.4\n50.2 \u00b1 2.0\n\nForgetting \u2193\nM = 50\n\nN/A\nN/A\n\n70.7 \u00b1 4.5\n40.6 \u00b1 1.1\n85.4 \u00b1 0.7\n35.4 \u00b1 2.0\n30.2 \u00b1 2.3\n\nM = 100\n\nN/A\nN/A\n\n71.7 \u00b1 1.3\n40 \u00b1 1.8\n85.4 \u00b1 0.7\n23.3 \u00b1 2.9\n17.4 \u00b1 2.1\n\nTable 2: CIFAR-10 results. Memories per class M , we report (a) Accuracy, (b) Forgetting (lower\nis better). For larger sizes of memory ER-MIR has better accuracy and improved forgetting metric.\nEach approach is run 15 times.\n\nsetting. We also consider another baseline iCarl [30]. Here we boost the iCarl method permitting it to\nperform 5 iterations for each incoming sample to maximize its performance. Even in this setting it is\nonly able to match the experience replay baseline and is outperformed by ER-MIR for larger buffers.\n\nNumber of iterations\n\n1\n\n5\n\niid online\n\nER\n\nER-MIR\n\n60.8 \u00b1 1.0\n41.3 \u00b1 1.9\n47.6 \u00b1 1.1\n\n62.0 \u00b1 0.9\n42.4 \u00b1 1.1\n49.3 \u00b1 0.1\n\nTable 3: CIFAR-10 accuracy (\u2191) results for in-\ncreased iterations and 100 memories per class.\nEach approach is run 15 times.\n\nER\n\nER-MIR\n\nAccuracy \u2191\n24.7 \u00b1 0.7\n25.2\u00b10.6\n\nForgetting \u2193\n23.5 \u00b1 1.0\n18.0\u00b10.8\n\nTable 4: MinImagenet results. 100 memories\nper class and using 3 updates per incoming\nbatch, accuracy is slightly better and forget-\nting is greatly improved. Each approach is\nrun 15 times\n\nIncreased iterations We evaluate the use of additional iterations on incoming batches by comparing\nthe 1 iteration results above to running 5 iterations. Results are shown in Table 3 We use ER an and at\neach iteration we either re-sample randomly or using the MIR criterion. We observe that increasing\nthe number of updates for an incoming sample can improve results on both methods.\n\nLonger Tasks Sequence we want to test how our strategy performs on longer sequences of tasks.\nFor this we consider the 20 tasks sequence of MiniImagenet Split. Note that this dataset is very\nchallenging in our setting given the shared classi\ufb01er and the online training. A naive experience\nreplay with 100 memories per class obtains only 17% accuracy at the end of the task sequence. To\novercome this dif\ufb01culty, we allow more iterations per incoming batch. Table 4 compares ER and\nER-MIR accuracy and forgetting at the end of the sequence. It can be seen how our strategy continues\nto outperform, in particular we achieve over 5% decrease in forgetting.\n\n7\n\n\f(a) Generation with the best VAE baseline.\nComplications arising from both properties\nleave the VAE generating blurry and/or fad-\ning digits.\n\n(b) Most interfered samples while learning\nthe last task (8 vs 9). Top row is the incoming\nbatch. Rows 2 and 3 show the most interfered\nsamples for the classi\ufb01er, Row 4 and 5 for\nthe VAE. We observe retrieved samples look\nsimilar but belong to different category.\n\nFigure 3: Online and low data regime MNIST Split generation. Qualitatively speaking, most interfered\nsamples are superior to baseline\u2019s.\n\n4.2 Generative Replay\n\nWe now study the effect of our proposed retrieval mechanism in the generative replay setting (Alg. 2).\nRecall that online continual generative modeling is particularly challenging and to the best of our\nknowledge has never been attempted. This is further exacerbated by the low data regime we consider.\n\nResults for the MNIST datasets are presented in Table 1. To maximally use the incoming samples,\nwe (hyper)-parameter searched the amount of additional iterations for both GEN and GEN-MIR. In\nthat way, both methodologies are allowed their optimal performance. More hyperarameter details are\nprovided in Appendix B.2. On MNIST Split, MIR outperforms the baseline by 2.8% and 2.5% on\naccuracy and forgetting respectively. Methods using stored memory show improved performance,\nbut with greater storage overhead. We provide further insight into theses results with a generation\ncomparison (Figure 3). Complications arising from online generative modeling combined with the\nlow data regime cause blurry and/or fading digits (Figure 3a) in the VAE baseline (GEN). In line with\nthe reported results, the most interfered retrievals seem qualitatively superior (see Figure 3b where\nthe GEN-MIR generation retrievals is demonstrated). We note that the quality of the samples causing\nmost interference on the VAE seems higher than those on the classi\ufb01er.\n\nFor the Permuted MNIST dataset, GEN-MIR not only outperforms the its baselines, but it achieves\nthe best performance over all models. This result is quite interesting, as generative replay methods\ncan\u2019t store past data and require much more tuning.\n\nThe results discussed thus far concern classi\ufb01cation. Nevertheless, GEN-MIR alleviates catastrophic\nforgetting in the generator as well. Table 5 shows results for the online continual generative modeling.\nThe loss of the generator is signi\ufb01cantly lower on both datasets when it rehearses on maximally\ninterfered samples versus on random samples. This result suggest that our method is not only viable\nin supervised learning, but in generative modeling as well.\n\nOur last generative replay experiment is an ab-\nlation study. The results are presented in Table\n6. All facets of our proposed methodology\nseem to help in achieving the best possible re-\nsults. It seems however that the minimization\nof the label entropy, i.e. H(ypre), which en-\nsures that the previous classi\ufb01er is con\ufb01dent\nabout the retrieved sample\u2019s class, is most im-\nportant and is essential to outperform the base-\nline.\n\nMNIST Split\n107.2 \u00b1 0.2\nGEN-MIR 102.5 \u00b1 0.2\n\nGEN\n\nPermuted MNIST\n\n196.7 \u00b1 0.7\n193.7 \u00b1 1.0\n\nTable 5: Generator\u2019s loss (\u2193), i.e. negative ELBO,\non the MNIST datasets. Our methodology outper-\nforms the baseline in online continual generative\nmodeling as well.\n\nAs noted in [23], training generative models\nin the continual learning setting on more chal-\nlenging datasets remains an open research problem. [23] found that generative replay is not yet a\nviable strategy for CIFAR-10 given the current state of the generative modeling. We too arrived at the\nsame conclusion, which led us to design the hybrid approach presented next.\n\n8\n\n\fAccuracy\n\nGEN-MIR\nablate MIR on generator\nablate MIR on classi\ufb01er\nablate DKL(ypre k \u02c6y)\nablate H(ypre)\nablate diversity constraint\nGEN\n\n83.0\n82.7\n81.7\n80.7\n78.3\n80.7\n80.0\n\nTable 6: Ablation study of GEN-MIR on the\nMNIST Split dataset. The H(ypre) term in the\nMIR loss function seems to play an important role\nin the success of our method.\n\nTable 7: Permuted MNIST test accuracy on\ntasks seen so far for rehearsal methods.\n\n4.3 Hybrid Approach\n\nIn this section, we evaluate the hybrid approach pro-\nposed in Sec 3.3 on the CIFAR-10 dataset. We use an\nautoencoder to compress the data stream and simplify\nMIR search.\n\nWe \ufb01rst identify an important failure mode arising\nfrom the use of reconstructions which may also apply\nto generative replay. During training, the classi\ufb01er\nsees real images, from the current task, from the data\nstream, along with reconstructions from the buffer,\nwhich belong to old tasks. In the shared classi\ufb01er\nsetting, this discrepancy can be leveraged by the clas-\nsi\ufb01er as a discriminative feature. The classi\ufb01er will\ntend to classify all real samples as belonging to the\nclasses of the last task, yielding low test accuracy. To\naddress this problem, we \ufb01rst autoencode the incom-\ning data with the generator before passing it to the\nclassi\ufb01er. This way, the classi\ufb01er cannot leverage the distribution shift. We found that this simple\ncorrection led to a signi\ufb01cant performance increase. We perform an ablation experiment to validate\nthis claim, which can be found in Appendix C, along with further details about the training procedure.\n\nFigure 4: Results for the Hybrid Approach\n\nIn practice, we store a latent representation of size 4 \u00d7 4 \u00d7 20 = 320, giving us a compression factor\nof 32\u00d732\u00d73\n320 = 9.6 (putting aside the size of the autoencoder, which is less than 2% of total parameters\nfor large buffer size). We therefore look at buffer size which are 10 times as big i.e. which can\ncontain 1k, 5k, 10k compressed images, while holding memory equivalent to storing 100 / 5000 / 1k\nreal images. Results are shown in Figure 4. We \ufb01rst note that as the number of compressed samples\nincreases we continue to see performance improvement, suggesting the increased storage capacity\ngained from the autoencoder can be leveraged. We next observe that even though AE-MIR obtains\nalmost the same average accuracies as AE-Random, it achieved a big decrease in the forgetting metric,\nindicating a better trade-offs in the performance of the learned tasks. Finally we note a gap still exists\nbetween the performance of reconstructions from incrementally learned AE or VAE models and real\nimages, further work is needed to close it.\n\n5 Conclusion\n\nWe have proposed and studied a criterion for retrieving relevant memories in an online continual\nlearning setting. We have shown in a number of settings that retrieving interfered samples reduces\nforgetting and signi\ufb01cantly improves on random sampling and standard baselines. Our results and\nanalysis also shed light on the feasibility and challenges of using generative modeling in the online\ncontinual learning setting. We have also shown a \ufb01rst result in leveraging encoded memories for\nmore compact memory and more ef\ufb01cient retrieval.\n\n9\n\n246810Task7476788082Test AccuracyERER-MIR100 / 1k500 / 5k1k / 10k0.000.050.100.150.200.250.300.350.40Accuracy ()100 / 1k500 / 5k1k / 10k0.00.10.20.30.40.5Forgetting ()AE-RandomAE-MIRReal Memory Slots / Compressed Memory Slots\fAcknowledgements\n\nWe would like to thank Kyle Kastner and Puneet Dokania for helpful discussion. Eugene Belilvosky\nis funded by IVADO and Rahaf Aljundi is funded by FWO.\n\nReferences\n\n[1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte-\n\nlaars. Memory aware synapses: Learning what (not) to forget. In ECCV 2018, .\n\n[2] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In\n\nCVPR 2019, .\n\n[3] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Online continual learning with\n\nno task boundaries. In arXiv, .\n\n[4] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning\nwith a network of experts. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n[5] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Online continual learning with\n\nno task boundaries. arXiv preprint arXiv:1903.08671, 2019.\n\n[6] Arslan Chaudhry, Marc\u2019Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Ef\ufb01cient\n\nlifelong learning with a-gem. In ICLR 2019.\n\n[7] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian\nwalk for incremental learning: Understanding forgetting and intransigence. arXiv preprint\narXiv:1801.10112, 2018.\n\n[8] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K\nDokania, Philip HS Torr, and Marc\u2019Aurelio Ranzato. Continual learning with tiny episodic\nmemories. arXiv preprint arXiv:1902.10486, 2019.\n\n[9] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv\n\npreprint arXiv:1805.09733, 2018.\n\n[10] Robert M French. Semi-distributed representations and catastrophic forgetting in connectionist\n\nnetworks. Connection Science, 4(3-4):365\u2013377, 1992.\n\n[11] Robert M French. Dynamically constraining connectionist networks to produce distributed,\n\northogonal representations to reduce catastrophic interference. network, 1111:00001, 1994.\n\n[12] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive\n\nsciences, 3(4):128\u2013135, 1999.\n\n[13] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical\ninvestigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint\narXiv:1312.6211, 2013.\n\n[14] Stephen Grossberg. Studies of mind and brain : neural principles of learning, perception,\ndevelopment, cognition, and motor control. Boston studies in the philosophy of science 70.\nReidel, Dordrecht, 1982. ISBN 9027713596.\n\n[15] Christopher J Honey, Ehren L Newman, and Anna C Schapiro. Switching between internal and\nexternal modes: a multiscale learning principle. Network Neuroscience, 1(4):339\u2013356, 2017.\n\n[16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[17] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,\net al. Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796,\n2016.\n\n10\n\n\f[18] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the national academy of\nsciences, page 201611835, 2017.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[20] John K Kruschke. Alcove: an exemplar-based connectionist model of category learning.\n\nPsychological review, 99(1):22, 1992.\n\n[21] John K Kruschke. Human category learning: Implications for backpropagation models. Con-\n\nnection Science, 5(1):3\u201336, 1993.\n\n[22] Frantzeska Lavda, Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. Continual\n\nclassi\ufb01cation learning using generative models, 2018.\n\n[23] Timoth\u00e9e Lesort, Hugo Caselles-Dupr\u00e9, Michael Garcia-Ortiz, Andrei Stoian, and David Filliat.\nGenerative models from the perspective of continual learning. arXiv preprint arXiv:1812.09111,\n2018.\n\n[24] Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on\n\nComputer Vision, pages 614\u2013629. Springer, 2016.\n\n[25] Kuhl BA Long NM. Decoding the tradeoff between encoding and retrieval to predict memory\n\nfor overlapping events. In SSRN 3265727. 2018 Oct 13.\n\n[26] David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural\n\nInformation Processing Systems, pages 6467\u20136476, 2017.\n\n[27] JAMES L McCLELLAND. Complementary learning systems in the brain: A connectionist\napproach to explicit and implicit cognition and memory. Annals of the New York Academy of\nSciences, 843(1):153\u2013169, 1998.\n\n[28] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual\n\nlearning. arXiv preprint arXiv:1710.10628, 2017.\n\n[29] Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. Lifelong generative modeling.\n\narXiv preprint arXiv:1705.09847, 2017.\n\n[30] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:\n\nIncremental classi\ufb01er and representation learning. In Proc. CVPR, 2017.\n\n[31] Matthew Riemer, Tim Klinger, Djallel Bouneffouf, and Michele Franceschini. Scalable recol-\nlections for continual lifelong learning. In Proceedings of the AAAI Conference on Arti\ufb01cial\nIntelligence, volume 33, pages 1352\u20131359, 2019.\n\n[32] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7\n\n(2):123\u2013146, 1995.\n\n[33] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne.\nExperience replay for continual learning. CoRR, abs/1811.11682, 2018. URL http:\n//arxiv.org/abs/1811.11682.\n\n[34] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. Experi-\n\nence replay for continual learning, 2018.\n\n[35] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[36] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep\ngenerative replay. In Advances in Neural Information Processing Systems, pages 2990\u20132999,\n2017.\n\n11\n\n\f[37] Steven A Sloman and David E Rumelhart. Reducing interference in distributed memories\n\nthrough episodic gating. Essays in honor of WK Estes, 1:227\u2013248, 1992.\n\n[38] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks\nfor one shot learning. In Advances in neural information processing systems, pages 3630\u20133638,\n2016.\n\n[39] Ju Xu and Zhanxing Zhu. Reinforced continual learning. arXiv preprint arXiv:1805.12369,\n\n2018.\n\n[40] Friedemann Zenke, Ben Poole, and Surya Ganguli.\n\nImproved multitask learning through\nsynaptic intelligence. In Proceedings of the International Conference on Machine Learning\n(ICML), 2017.\n\n[41] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic\n\nintelligence. arXiv preprint arXiv:1703.04200, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6371, "authors": [{"given_name": "Rahaf", "family_name": "Aljundi", "institution": "KU Leuven, Belgium"}, {"given_name": "Eugene", "family_name": "Belilovsky", "institution": "Mila, University of Montreal"}, {"given_name": "Tinne", "family_name": "Tuytelaars", "institution": "KU Leuven"}, {"given_name": "Laurent", "family_name": "Charlin", "institution": "MILA / U.Montreal"}, {"given_name": "Massimo", "family_name": "Caccia", "institution": "MILA"}, {"given_name": "Min", "family_name": "Lin", "institution": "MILA"}, {"given_name": "Lucas", "family_name": "Page-Caccia", "institution": "McGill University"}]}