{"title": "Gradient Episodic Memory for Continual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6467, "page_last": 6476, "abstract": "One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge. To better understand this issue, we study the problem of continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model for continual learning, called Gradient Episodic Memory (GEM) that alleviates forgetting, while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art.", "full_text": "Gradient Episodic Memory for Continual Learning\n\nDavid Lopez-Paz and Marc\u2019Aurelio Ranzato\n\nFacebook Arti\ufb01cial Intelligence Research\n\n{dlp,ranzato}@fb.com\n\nAbstract\n\nOne major obstacle towards AI is the poor ability of models to solve new prob-\nlems quicker, and without forgetting previously acquired knowledge. To better\nunderstand this issue, we study the problem of continual learning, where the model\nobserves, once and one by one, examples concerning a sequence of tasks. First,\nwe propose a set of metrics to evaluate models learning over a continuum of data.\nThese metrics characterize models not only by their test accuracy, but also in terms\nof their ability to transfer knowledge across tasks. Second, we propose a model\nfor continual learning, called Gradient Episodic Memory (GEM) that alleviates\nforgetting, while allowing bene\ufb01cial transfer of knowledge to previous tasks. Our\nexperiments on variants of the MNIST and CIFAR-100 datasets demonstrate the\nstrong performance of GEM when compared to the state-of-the-art.\n\n(cid:80)\n\nIntroduction\n\n1\nThe starting point in supervised learning is to collect a training set Dtr = {(xi, yi)}n\ni=1, where\neach example (xi, yi) is composed by a feature vector xi \u2208 X , and a target vector yi \u2208 Y. Most\nsupervised learning methods assume that each example (xi, yi) is an identically and independently\ndistributed (iid) sample from a \ufb01xed probability distribution P , which describes a single learning task.\nThe goal of supervised learning is to construct a model f : X \u2192 Y, used to predict the target vectors\ny associated to unseen feature vectors x, where (x, y) \u223c P . To accomplish this, supervised learning\nmethods often employ the Empirical Risk Minimization (ERM) principle [Vapnik, 1998], where f\n(cid:96)(f (xi), yi), where (cid:96) : Y \u00d7 Y \u2192 [0,\u221e) is a loss function\nis found by minimizing 1|Dtr|\npenalizing prediction errors. In practice, ERM often requires multiple passes over the training set.\nERM is a major simpli\ufb01cation from what we deem as human learning. In stark contrast to learning\nmachines, learning humans observe data as an ordered sequence, seldom observe the same example\ntwice, they can only memorize a few pieces of data, and the sequence of examples concerns different\nlearning tasks. Therefore, the iid assumption, along with any hope of employing the ERM principle,\nfall apart. In fact, straightforward applications of ERM lead to \u201ccatastrophic forgetting\u201d [McCloskey\nand Cohen, 1989]. That is, the learner forgets how to solve past tasks after it is exposed to new tasks.\nThis paper narrows the gap between ERM and the more human-like learning description above. In\nparticular, our learning machine will observe, example by example, the continuum of data\n\n(xi,yi)\u2208Dtr\n\n(x1, t1, y1), . . . , (xi, ti, yi), . . . , (xn, tn, yn),\n\n(1)\nwhere besides input and target vectors, the learner observes ti \u2208 T , a task descriptor identifying\nthe task associated to the pair (xi, yi) \u223c Pti. Importantly, examples are not drawn iid from a \ufb01xed\nprobability distribution over triplets (x, t, y), since a whole sequence of examples from the current\ntask may be observed before switching to the next task. The goal of continual learning is to construct\na model f : X \u00d7 T able to predict the target y associated to a test pair (x, t), where (x, y) \u223c Pt. In\nthis setting, we face challenges unknown to ERM:\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1. Non-iid input data: the continuum of data is not iid with respect to any \ufb01xed probability\ndistribution P (X, T, Y ) since, once tasks switch, a whole sequence of examples from the\nnew task may be observed.\n\n2. Catastrophic forgetting: learning new tasks may hurt the performance of the learner at\n\npreviously solved tasks.\n\n3. Transfer learning: when the tasks in the continuum are related, there exists an opportunity\nfor transfer learning. This would translate into faster learning of new tasks, as well as\nperformance improvements in old tasks.\n\nThe rest of this paper is organized as follows. In Section 2, we formalize the problem of continual\nlearning, and introduce a set of metrics to evaluate learners in this scenario. In Section 3, we\npropose GEM, a model to learn over continuums of data that alleviates forgetting, while transferring\nbene\ufb01cial knowledge to past tasks. In Section 4, we compare the performance of GEM to the\nstate-of-the-art. Finally, we conclude by reviewing the related literature in Section 5, and offer some\ndirections for future research in Section 6. Our source code is available at https://github.com/\nfacebookresearch/GradientEpisodicMemory.\n\n2 A Framework for Continual Learning\n\nWe focus on the continuum of data of (1), where each triplet (xi, ti, yi) is formed by a feature vector\nxi \u2208 Xti, a task descriptor ti \u2208 T , and a target vector yi \u2208 Yti. For simplicity, we assume that the\ncontinuum is locally iid, that is, every triplet (xi, ti, yi) satis\ufb01es (xi, yi) iid\u223c Pti(X, Y ).\nWhile observing the data (1) example by example, our goal is to learn a predictor f : X \u00d7 T \u2192 Y,\nwhich can be queried at any time to predict the target vector y associated to a test pair (x, t), where\n(x, y) \u223c Pt. Such test pair can belong to a task that we have observed in the past, the current task, or\na task that we will experience (or not) in the future.\n\nTask descriptors An important component in our framework is the collection of task descriptors\nt1, . . . , tn \u2208 T . In the simplest case, the task descriptors are integers ti = i \u2208 Z enumerating the\ndifferent tasks appearing in the continuum of data. More generally, task descriptors ti could be\nstructured objects, such as a paragraph of natural language explaining how to solve the i-th task. Rich\ntask descriptors offer an opportunity for zero-shot learning, since the relation between tasks could be\ninferred using new task descriptors alone. Furthermore, task descriptors disambiguate similar learning\ntasks. In particular, the same input xi could appear in two different tasks, but require different targets.\nTask descriptors can reference the existence of multiple learning environments, or provide additional\n(possibly hierarchical) contextual information about each of the examples. However, in this paper\nwe focus on alleviating catastrophic forgetting when learning from a continuum of data, and leave\nzero-shot learning for future research.\nNext, we discuss the training protocol and evaluation metrics for continual learning.\n\nTraining Protocol and Evaluation Metrics\n\nMost of the literature about learning over a sequence of tasks [Rusu et al., 2016, Fernando et al.,\n2017, Kirkpatrick et al., 2017, Rebuf\ufb01 et al., 2017] describes a setting where i) the number of tasks is\nsmall, ii) the number of examples per task is large, iii) the learner performs several passes over the\nexamples concerning each task, and iv) the only metric reported is the average performance across all\ntasks. In contrast, we are interested in the \u201cmore human-like\u201d setting where i) the number of tasks is\nlarge, ii) the number of training examples per task is small, iii) the learner observes the examples\nconcerning each task only once, and iv) we report metrics that measure both transfer and forgetting.\nTherefore, at training time we provide the learner with only one example at the time (or a small\nmini-batch), in the form of a triplet (xi, ti, yi). The learner never experiences the same example\ntwice, and tasks are streamed in sequence. We do not need to impose any order on the tasks, since a\nfuture task may coincide with a past task.\nBesides monitoring its performance across tasks, it is also important to assess the ability of the learner\nto transfer knowledge. More speci\ufb01cally, we would like to measure:\n\n2\n\n\f1. Backward transfer (BWT), which is the in\ufb02uence that learning a task t has on the perfor-\nmance on a previous task k \u227a t. On the one hand, there exists positive backward transfer\nwhen learning about some task t increases the performance on some preceding task k. On\nthe other hand, there exists negative backward transfer when learning about some task t\ndecreases the performance on some preceding task k. Large negative backward transfer is\nalso known as (catastrophic) forgetting.\n2. Forward transfer (FWT), which is the in\ufb02uence that learning a task t has on the performance\non a future task k (cid:31) t. In particular, positive forward transfer is possible when the model is\nable to perform \u201czero-shot\u201d learning, perhaps by exploiting the structure available in the\ntask descriptors.\n\nFor a principled evaluation, we consider access to a test set for each of the T tasks. After the model\n\ufb01nishes learning about the task ti, we evaluate its test performance on all T tasks. By doing so, we\nconstruct the matrix R \u2208 RT\u00d7T , where Ri,j is the test classi\ufb01cation accuracy of the model on task tj\nafter observing the last sample from task ti. Letting \u00afb be the vector of test accuracies for each task at\nrandom initialization, we de\ufb01ne three metrics:\n\nT(cid:88)\n\ni=1\n\nAverage Accuracy: ACC =\n\n1\nT\n\nBackward Transfer: BWT =\n\nForward Transfer: FWT =\n\n1\n\nT \u2212 1\n\n1\n\nT \u2212 1\n\nRT,i\n\nT\u22121(cid:88)\nT(cid:88)\n\ni=1\n\nRT,i \u2212 Ri,i\n\nRi\u22121,i \u2212 \u00afbi.\n\ni=2\n\n(2)\n\n(3)\n\n(4)\n\nThe larger these metrics, the better the model. If two models have similar ACC, the most preferable\none is the one with larger BWT and FWT. Note that it is meaningless to discuss backward transfer\nfor the \ufb01rst task, or forward transfer for the last task.\nFor a \ufb01ne-grained evaluation that accounts for learning speed, one can build a matrix R with more\nrows than tasks, by evaluating more often. In the extreme case, the number of rows could equal the\nnumber of continuum samples n. Then, the number Ri,j is the test accuracy on task tj after observing\nthe i-th example in the continuum. Plotting each column of R results into a learning curve.\n\n3 Gradient of Episodic Memory (GEM)\n\nIn this section, we propose Gradient Episodic Memory (GEM), a model for continual learning, as\nintroduced in Section 2. The main feature of GEM is an episodic memory Mt, which stores a\nsubset of the observed examples from task t. For simplicity, we assume integer task descriptors, and\nuse them to index the episodic memory. When using integer task descriptors, one cannot expect\nsigni\ufb01cant positive forward transfer (zero-shot learning). Instead, we focus on minimizing negative\nbackward transfer (catastrophic forgetting) by the ef\ufb01cient use of episodic memory.\nIn practice, the learner has a total budget of M memory locations. If the number of total tasks T\nis known, we can allocate m = M/T memories for each task. Conversely, if the number of total\ntasks T is unknown, we can gradually reduce the value of m as we observe new tasks [Rebuf\ufb01 et al.,\n2017]. For simplicity, we assume that the memory is populated with the last m examples from each\ntask, although better memory update strategies could be employed (such as building a coreset per\ntask). In the following, we consider predictors f\u03b8 parameterized by \u03b8 \u2208 Rp, and de\ufb01ne the loss at the\nmemories from the k-th task as\n\n(cid:88)\n\n(cid:96)(f\u03b8,Mk) =\n\n1\n|Mk|\n\n(xi,k,yi)\u2208Mk\n\n(cid:96)(f\u03b8(xi, k), yi).\n\n(5)\n\nObviously, minimizing the loss at the current example together with (5) results in over\ufb01tting to the\nexamples stored in Mk. As an alternative, we could keep the predictions at past tasks invariant by\nmeans of distillation [Rebuf\ufb01 et al., 2017]. However, this would deem positive backward transfer\nimpossible. Instead, we will use the losses (5) as inequality constraints, avoiding their increase but\n\n3\n\n\fallowing their decrease. In contrast to the state-of-the-art [Kirkpatrick et al., 2017, Rebuf\ufb01 et al.,\n2017], our model therefore allows positive backward transfer.\nMore speci\ufb01cally, when observing the triplet (x, t, y), we solve the following problem:\n\nminimize\u03b8\nsubject to (cid:96)(f\u03b8,Mk) \u2264 (cid:96)(f t\u22121\n\n(cid:96)(f\u03b8(x, t), y)\n\n\u03b8\n\n,Mk) for all k < t,\n\n(6)\n\n\u03b8\n\nis the predictor state at the end of learning of task t \u2212 1.\n\nwhere f t\u22121\nIn the following, we make two key observations to solve (6) ef\ufb01ciently. First, it is unnecessary to\nstore old predictors f t\u22121\n, as long as we guarantee that the loss at previous tasks does not increase\nafter each parameter update g. Second, assuming that the function is locally linear (as it happens\naround small optimization steps) and that the memory is representative of the examples from past\ntasks, we can diagnose increases in the loss of previous tasks by computing the angle between their\nloss gradient vector and the proposed update. Mathematically, we rephrase the constraints (6) as:\n\n\u03b8\n\n(cid:104)g, gk(cid:105) :=\n\n\u2265 0, for all k < t.\n\n(7)\n\n(cid:28) \u2202(cid:96)(f\u03b8(x, t), y)\n\n\u2202\u03b8\n\n\u2202(cid:96)(f\u03b8,Mk)\n\n,\n\n\u2202\u03b8\n\n(cid:29)\n\nIf all the inequality constraints (7) are satis\ufb01ed, then the proposed parameter update g is unlikely to\nincrease the loss at previous tasks. On the other hand, if one or more of the inequality constraints (7)\nare violated, then there is at least one previous task that would experience an increase in loss after the\nparameter update. If violations occur, we propose to project the proposed gradient g to the closest\ngradient \u02dcg (in squared (cid:96)2 norm) satisfying all the constraints (7). Therefore, we are interested in:\n\nminimize\u02dcg\n\n(cid:107)g \u2212 \u02dcg(cid:107)2\n\n2\n\n1\n2\n\nsubject to (cid:104)\u02dcg, gk(cid:105) \u2265 0 for all k < t.\n\n(8)\n\nTo solve (8) ef\ufb01ciently, recall the primal of a Quadratic Program (QP) with inequality constraints:\n\nminimizez\nsubject to Az \u2265 b,\n\n1\n2\n\nz(cid:62)Cz + p(cid:62)z\n\nwhere C \u2208 Rp\u00d7p, p \u2208 Rp, A \u2208 R(t\u22121)\u00d7p, and b \u2208 Rt\u22121. The dual problem of (9) is:\n\n(9)\n\nminimizeu,v\n\nu(cid:62)Cu \u2212 b(cid:62)v\nsubject to A(cid:62)v \u2212 Cu = p,\n\n1\n2\n\nv \u2265 0.\n\n(10)\nIf (u(cid:63), v(cid:63)) is a solution to (10), then there is a solution z(cid:63) to (9) satisfying Cz(cid:63) = Cu(cid:63) [Dorn, 1960].\nQuadratic programs are at the heart of support vector machines [Scholkopf and Smola, 2001].\nWith these notations in hand, we write the primal GEM QP (8) as:\ng(cid:62)g\n\nz(cid:62)z \u2212 g(cid:62)z +\n\nminimizez\nsubject to Gz \u2265 0,\n\n1\n2\n\n1\n2\n\nwhere G = \u2212(g1, . . . , gt\u22121), and we discard the constant term g(cid:62)g. This is a QP on p variables (the\nnumber of parameters of the neural network), which could be measured in the millions. However, we\ncan pose the dual of the GEM QP as:\n\nminimizev\nsubject to v \u2265 0,\n\n1\n2\n\nv(cid:62)GG(cid:62)v + g(cid:62)G(cid:62)v\n\n(11)\nsince u = G(cid:62)v + g and the term g(cid:62)g is constant. This is a QP on t \u2212 1 (cid:28) p variables, the number\nof observed tasks so far. Once we solve the dual problem (11) for v(cid:63), we can recover the projected\ngradient update as \u02dcg = G(cid:62)v(cid:63) + g. In practice, we found that adding a small constant \u03b3 \u2265 0 to v(cid:63)\nbiased the gradient projection to updates that favoured bene\ufb01tial backwards transfer.\nAlgorithm 1 summarizes the training and evaluation protocol of GEM over a continuum of data. The\npseudo-code includes the computation of the matrix R, containing the suf\ufb01cient statistics to compute\nthe metrics ACC, FWT, and BWT described in Section 2.\n\n4\n\n\fA causal compression view We can interpret GEM as a model that learns the subset of correlations\ncommon to a set of distributions (tasks). Furthermore, GEM can (and will in our MNIST experiments)\nbe used to predict target vectors associated to previous or new tasks without making use of task\ndescriptors. This is a desired feature in causal inference problems, since causal predictions are invari-\nant across different environments [Peters et al., 2016], and therefore provide the most compressed\nrepresentation of a set of distributions [Sch\u00f6lkopf et al., 2016].\n\nAlgorithm 1 Training a GEM over an ordered continuum of data\nprocedure TRAIN(f\u03b8, Continuumtrain, Continuumtest)\n\nMt \u2190 {} for all t = 1, . . . , T .\nR \u2190 0 \u2208 RT\u00d7T .\nfor t = 1, . . . , T do:\n\nfor (x, y) in Continuumtrain(t) do\n\nMt \u2190 Mt \u222a (x, y)\ng \u2190 \u2207\u03b8 (cid:96)(f\u03b8(x, t), y)\ngk \u2190 \u2207\u03b8 (cid:96)(f\u03b8,Mk) for all k < t\n\u02dcg \u2190 PROJECT(g, g1, . . . , gt\u22121), see (11).\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u02dcg.\n\nend for\nRt,: \u2190 EVALUATE(f\u03b8, Continuumtest)\n\nend for\nreturn f\u03b8, R\nend procedure\n\nprocedure EVALUATE(f\u03b8, Continuum)\n\nr \u2190 0 \u2208 RT\nfor k = 1, . . . , T do\nrk \u2190 0\nfor (x, y) in Continuum(k) do\n\nrk \u2190 rk +accuracy(f\u03b8(x, k), y)\n\nend for\nrk \u2190 rk / len(Continuum(k))\n\nend for\nreturn r\n\nend procedure\n\n4 Experiments\n\nWe perform a variety of experiments to assess the performance of GEM in continual learning.\n\n4.1 Datasets\n\nWe consider the following datasets:\n\n\u2022 MNIST Permutations [Kirkpatrick et al., 2017], a variant of the MNIST dataset of handwrit-\nten digits [LeCun et al., 1998], where each task is transformed by a \ufb01xed permutation of\npixels. In this dataset, the input distribution for each task is unrelated.\n\n\u2022 MNIST Rotations, a variant of MNIST where each task contains digits rotated by a \ufb01xed\n\nangle between 0 and 180 degrees.\n\n\u2022 Incremental CIFAR100 [Rebuf\ufb01 et al., 2017], a variant of the CIFAR object recognition\ndataset with 100 classes [Krizhevsky, 2009], where each task introduces a new set of classes.\nFor a total number of T tasks, each new task concerns examples from a disjoint subset of\n100/T classes. Here, the input distribution is similar for all tasks, but different tasks require\ndifferent output distributions.\n\nFor all the datasets, we considered T = 20 tasks. On the MNIST datasets, each task has 1000\nexamples from 10 different classes. On the CIFAR100 dataset each task has 2500 examples from 5\ndifferent classes. The model observes the tasks in sequence, and each example once. The evaluation\nfor each task is performed on the test partition of each dataset.\n\n4.2 Architectures\n\nOn the MNIST tasks, we use fully-connected neural networks with two hidden layers of 100 ReLU\nunits. On the CIFAR100 tasks, we use a smaller version of ResNet18 [He et al., 2015], with\nthree times less feature maps across all layers. Also on CIFAR100, the network has a \ufb01nal linear\nclassi\ufb01er per task. This is one simple way to leverage the task descriptor, in order to adapt the output\ndistribution to the subset of classes for each task. We train all the networks and baselines using plain\nSGD on mini-batches of 10 samples. All hyper-parameters are optimized using a grid-search (see\nAppendix A), and the best results for each model are reported.\n\n5\n\n\fFigure 1: Left: ACC, BWT, and FWT for all datasets and methods. Right: evolution of the test\naccuracy at the \ufb01rst task, as more tasks are learned.\n\nTable 1: CPU Training time (s) of MNIST experiments for all methods.\n\ntask\npermutations\nrotations\n\nsingle\n11\n11\n\nindependent multimodal EWC GEM\n77\n135\n\n179\n169\n\n11\n16\n\n14\n13\n\n4.3 Methods\n\nWe compare GEM to \ufb01ve alternatives:\n\n1. a single predictor trained across all tasks.\n\n2. one independent predictor per task. Each independent predictor has the same architecture as\n\u201csingle\u201d but with T times less hidden units than \u201csingle\u201d. Each new independent predictor can\nbe initialized at random, or be a clone of the last trained predictor (decided by grid-search).\n\n3. a multimodal predictor, which has the same architecture of \u201csingle\u201d, but with a dedicated\n\ninput layer per task (only for MNIST datasets).\n\n4. EWC [Kirkpatrick et al., 2017], where the loss is regularized to avoid catastrophic forgetting.\n\n5. iCARL [Rebuf\ufb01 et al., 2017], a class-incremental learner that classi\ufb01es using a nearest-\nexemplar algorithm, and prevents catastrophic forgetting by using an episodic memory.\niCARL requires the same input representation across tasks, so this method only applies to\nour experiment on CIFAR100.\n\nGEM, iCaRL and EWC have the same architecture as \u201csingle\u201d, plus episodic memory.\n\n6\n\nACCBWTFWT0.20.00.20.40.60.8classification accuracyMNIST permutationssingleindependentmultimodalEWCGEM024681012141618200.20.40.60.8MNIST permutationsACCBWTFWT0.20.00.20.40.60.8classification accuracyMNIST rotationssingleindependentmultimodalEWCGEM024681012141618200.20.40.60.8MNIST rotationsACCBWTFWT0.00.20.40.6classification accuracyCIFAR-100singleindependentiCARLEWCGEM024681012141618200.20.30.40.50.6CIFAR-100\fTable 2: ACC as a function of the episodic memory size for GEM and iCARL, on CIFAR100.\n\nmemory size\nGEM\niCARL\n\n200\n\n1, 280\n\n2, 560\n\n5, 120\n\n0.487\n0.436\n\n0.579\n0.494\n\n0.633\n0.500\n\n0.654\n0.508\n\nTable 3: ACC/BWT on the MNIST Rotations dataset, when varying the number of epochs per task.\n\nmethod\nsingle, shuf\ufb02ed data\nsingle\nindependent\nmultimodal\nEWC\nGEM\n\n1 epoch\n0.83/-0.00\n0.53/-0.08\n0.56/-0.00\n0.76/-0.02\n0.55/-0.19\n0.86/+0.05\n\n2 epochs\n0.87/-0.00\n0.49/-0.25\n0.64/-0.00\n0.72/-0.11\n0.59/-0.17\n0.88/+0.02\n\n5 epochs\n0.89/-0.00\n0.43/-0.40\n0.67/-0.00\n0.59/-0.28\n0.61/-0.11\n0.89/-0.02\n\n4.4 Results\n\nFigure 1 (left) summarizes the average accuracy (ACC, Equation 2), backward transfer (BWT,\nEquation 3) and forward transfer (FWT, Equation 4) for all datasets and methods. We provide the full\nevaluation matrices R in Appendix B. Overall, GEM performs similarly or better than the multimodal\nmodel (which is very well suited to the MNIST tasks). GEM minimizes backward transfer, while\nexhibiting negligible or positive forward transfer.\nFigure 1 (right) shows the evolution of the test accuracy of the \ufb01rst task throughout the continuum of\ndata. GEM exhibits minimal forgetting, and positive backward transfer in CIFAR100.\nOverall, GEM performs signi\ufb01cantly better than other continual learning methods like EWC, while\nspending less computation (Table 1). GEM\u2019s ef\ufb01ciency comes from optimizing over a number of\nvariables equal to the number of tasks (T = 20 in our experiments), instead of optimizing over a\nnumber of variables equal to the number of parameters (p = 1109240 for CIFAR100 for instance).\nGEM\u2019s bottleneck is the necessity of computing previous task gradients at each learning iteration.\n\n4.4.1 Importance of memory, number of passes, and order of tasks\n\nTable 2 shows the \ufb01nal ACC in the CIFAR-100 experiment for both GEM and iCARL as a function\ntheir episodic memory size. Also seen in Table 2, the \ufb01nal ACC of GEM is an increasing function of\nthe size of the episodic memory, eliminating the need to carefully tune this hyper-parameter. GEM\noutperforms iCARL for a wide range of memory sizes.\nTable 3 illustrates the importance of memory as we do more than one pass through the data on the\nMNIST rotations experiment. Multiple training passe exacerbate the catastrophic forgetting problem.\nFor instance, in the last column of Table 3 (except for the result in the \ufb01rst row), each model is shown\nexamples of a task \ufb01ve times (in random order) before switching to the next task. Table 3 shows\nthat memory-less methods (like \u201csingle\u201d and \u201cmultimodal\u201d) exhibit higher negative BWT, leading to\nlower ACC. On the other hand, memory-based methods such as EWC and GEM lead to higher ACC\nas the number of passes through the data increases. However, GEM suffers less negative BWT than\nEWC, leading to a higher ACC.\nFinally, to relate the performance of GEM to the best possible performance on the proposed datasets,\nthe \ufb01rst row of Table 3 reports the ACC of \u201csingle\u201d when trained with iid data from all tasks. This\nmimics usual multi-task learning, where each mini-batch contains examples taken from a random\nselection of tasks. By comparing the \ufb01rst and last row of Table 3, we see that GEM matches the\n\u201coracle performance upper-bound\u201d ACC provided by iid learning, and minimizes negative BWT.\n\n7\n\n\f5 Related work\n\nContinual learning [Ring, 1994], also called lifelong learning [Thrun, 1994, Thrun and Pratt, 2012,\nThrun, 1998, 1996], considers learning through a sequence of tasks, where the learner has to retain\nknowledge about past tasks and leverage that knowledge to quickly acquire new skills. This learning\nsetting led to implementations [Carlson et al., 2010, Ruvolo and Eaton, 2013, Ring, 1997], and\ntheoretical investigations [Baxter, 2000, Balcan et al., 2015, Pentina and Urner, 2016], although the\nlatter ones have been restricted to linear models. In this work, we revisited continual learning but\nproposed to focus on the more realistic setting where examples are seen only once, memory is \ufb01nite,\nand the learner is also provided with (potentially structured) task descriptors. Within this framework,\nwe introduced a new set of metrics, a training and testing protocol, and a new algorithm, GEM, that\noutperforms the current state-of-the-art in terms of limiting forgetting.\nThe use of task descriptors is similar in spirit to recent work in Reinforcement Learning [Sutton et al.,\n2011, Schaul et al., 2015], where task or goal descriptors are also fed as input to the system. The\nCommAI project [Mikolov et al., 2015, Baroni et al., 2017] shares our same motivations, but focuses\non highly structured task descriptors, such as strings of text. In contrast, we focus on the problem of\ncatastrophic forgetting [McCloskey and Cohen, 1989, French, 1999, Ratcliff, 1990, McClelland et al.,\n1995, Goodfellow et al., 2013].\nSeveral approaches have been proposed to avoid catastrophic forgetting. The simplest approach\nin neural networks is to freeze early layers, while cloning and \ufb01ne-tuning later layers on the new\ntask [Oquab et al., 2014] (which we considered in our \u201cindependent\u201d baseline). This relates to\nmethods that leverage a modular structure of the network with primitives that can be shared across\ntasks [Rusu et al., 2016, Fernando et al., 2017, Aljundi et al., 2016, Denoyer and Gallinari, 2015,\nEigen et al., 2014]. Unfortunately, it has been very hard to scale up these methods to lots of modules\nand tasks, given the combinatorial number of compositions of modules.\nOur approach is most similar to the regularization approaches that consider a single model, but\nmodify its learning objective to prevent catastrophic forgetting. Within this class of methods, there\nare approaches that leverage \u201csynaptic\u201d memory [Kirkpatrick et al., 2017, Zenke et al., 2017], where\nlearning rates are adjusted to minimize changes in parameters important for previous tasks. Other\napproaches are instead based on \u201cepisodic\u201d memory [Jung et al., 2016, Li and Hoiem, 2016, Rannen\nTriki et al., 2017, Rebuf\ufb01 et al., 2017], where examples from previous tasks are stored and replayed\nto maintain predictions invariant by means of distillation [Hinton et al., 2015]. GEM is related to\nthese latter approaches but, unlike them, allows for positive backward transfer.\nMore generally, there are a variety of setups in the machine learning literature related to continual\nlearning. Multitask learning [Caruana, 1998] considers the problem of maximizing the performance\nof a learning machine across a variety of tasks, but the setup assumes simultaneous access to all the\ntasks at once. Similarly, transfer learning [Pan and Yang, 2010] and domain adaptation [Ben-David\net al., 2010] assume the simultaneous availability of multiple learning tasks, but focus at improving\nthe performance at one of them in particular. Zero-shot learning [Lampert et al., 2009, Palatucci\net al., 2009] and one-shot learning [Fei-Fei et al., 2003, Vinyals et al., 2016, Santoro et al., 2016,\nBertinetto et al., 2016] aim at performing well on unseen tasks, but ignore the catastrophic forgetting\nof previously learned tasks. Curriculum learning considers learning a sequence of data [Bengio et al.,\n2009], or a sequence of tasks [Pentina et al., 2015], sorted by increasing dif\ufb01culty.\n\n6 Conclusion\n\nWe formalized the scenario of continual learning. First, we de\ufb01ned training and evaluation protocols\nto assess the quality of models in terms of their accuracy, as well as their ability to transfer knowledge\nforward and backward between tasks. Second, we introduced GEM, a simple model that leverages\nan episodic memory to avoid forgetting and favor positive backward transfer. Our experiments\ndemonstrate the competitive performance of GEM against the state-of-the-art.\nGEM has three points for improvement. First, GEM does not leverage structured task descriptors,\nwhich may be exploited to obtain positive forward transfer (zero-shot learning). Second, we did not\ninvestigate advanced memory management (such as building coresets of tasks [Lucic et al., 2017]).\nThird, each GEM iteration requires one backward pass per task, increasing computation time. These\nare exciting research directions to extend learning machines beyond ERM, and to continuums of data.\n\n8\n\n\fAcknowledgements\n\nWe are grateful to M. Baroni, L. Bottou, M. Nickel, Y. Olivier and A. Szlam for their insight. We are\ngrateful to Martin Arjovsky for the QP interpretation of GEM.\n\nReferences\nR. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. CVPR,\n\n2016.\n\nM.-F. Balcan, A. Blum, and S. Vempola. Ef\ufb01cient representations for lifelong learning and autoencoding. COLT,\n\n2015.\n\nM. Baroni, A. Joulin, A. Jabri, G. Kruszewski, A. Lazaridou, K. Simonic, and T. Mikolov. CommAI: Evaluating\n\nthe \ufb01rst steps towards a useful general AI. arXiv, 2017.\n\nJ. Baxter. A model of inductive bias learning. JAIR, 2000.\n\nS. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman Vaughan. A theory of learning\n\nfrom different domains. Machine Learning Journal, 2010.\n\nY. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. ICML, 2009.\n\nL. Bertinetto, J. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. NIPS,\n\n2016.\n\nA. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell. Toward an architecture for\n\nnever-ending language learning. AAAI, 2010.\n\nR. Caruana. Multitask learning. In Learning to learn. Springer, 1998.\n\nL. Denoyer and P. Gallinari. Deep sequential neural networks. EWRL, 2015.\n\nW. S. Dorn. Duality in quadratic programming. Quarterly of Applied Mathematics, 1960.\n\nD. Eigen, I. Sutskever, and M. Ranzato. Learning factored representations in a deep mixture of experts. ICLR,\n\n2014.\n\nL. Fei-Fei, R. Fergus, and P. Perona. A Bayesian approach to unsupervised one-shot learning of object categories.\n\nICCV, 2003.\n\nC. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. PathNet:\n\nEvolution channels gradient descent in super neural networks. arXiv, 2017.\n\nR. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 1999.\n\nI. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An Empirical Investigation of Catastrophic\n\nForgetting in Gradient-Based Neural Networks. arXiv, 2013.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv, 2015.\n\nG. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv, 2015.\n\nH. Jung, J. Ju, M. Jung, and J. Kim. Less-forgetting Learning in Deep Neural Networks. arXiv, 2016.\n\nJ. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,\n\nA. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.\n\nA. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Technical report,\n\nUniversity of Toronto, 2009.\n\nC. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute\n\ntransfer. CVPR, 2009.\n\nY. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits, 1998. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\nZ. Li and D. Hoiem. Learning without forgetting. ECCV, 2016.\n\n9\n\n\fM. Lucic, M. Faulkner , A. Krause, and D. Feldman. Training Mixture Models at Scale via Coresets. arXiv,\n\n2017.\n\nJ. L. McClelland, B. L. McNaughton, and R. C. O\u2019reilly. Why there are complementary learning systems in the\nhippocampus and neocortex: insights from the successes and failures of connectionist models of learning and\nmemory. Psychological review, 1995.\n\nM. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning\n\nproblem. Psychology of learning and motivation, 1989.\n\nT. Mikolov, A. Joulin, and M. Baroni. A roadmap towards machine intelligence. arXiv, 2015.\n\nM. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using\n\nconvolutional neural networks. CVPR, 2014.\n\nM. Palatucci, D. A. Pomerleau, G. E. Hinton, and T. Mitchell. Zero-shot learning with semantic output codes.\n\nNIPS, 2009.\n\nS. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 2010.\n\nA. Pentina and R. Urner. Lifelong learning with weighted majority votes. NIPS, 2016.\n\nA. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. CVPR, 2015.\n\nJ. Peters, P. B\u00fchlmann, and N. Meinshausen. Causal inference by using invariant prediction: identi\ufb01cation and\n\ncon\ufb01dence intervals. Journal of the Royal Statistical Society, 2016.\n\nA. Rannen Triki, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder Based Lifelong Learning. arXiv, 2017.\n\nR. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting\n\nfunctions. Psychological review, 1990.\n\nS.-A. Rebuf\ufb01, A. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL: Incremental classi\ufb01er and representation\n\nlearning. CVPR, 2017.\n\nM. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin,\n\nAustin, Texas 78712, 1994.\n\nM. B. Ring. CHILD: A \ufb01rst step towards continual learning. Machine Learning, 1997.\n\nA. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and\n\nR. Hadsell. Progressive neural networks. NIPS, 2016.\n\nP. Ruvolo and E. Eaton. ELLA: An Ef\ufb01cient Lifelong Learning Algorithm. ICML, 2013.\n\nA. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot learning with memory-augmented\n\nneural networks. arXiv, 2016.\n\nT. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. ICML, 2015.\n\nB. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2001.\n\nB. Sch\u00f6lkopf, D. Janzing, and D. Lopez-Paz. Causal and statistical learning. In Learning Theory and Approxi-\n\nmation. Oberwolfach Research Institute for Mathematics, 2016.\n\nR. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: A scalable real-time\narchitecture for learning knowledge from unsupervised sensorimotor interaction. The 10th International\nConference on Autonomous Agents and Multiagent Systems, 2011.\n\nS. Thrun. A lifelong learning perspective for mobile robot control. Proceedings of the IEEE/RSJ/GI Conference\n\non Intelligent Robots and Systems, 1994.\n\nS. Thrun. Is learning the n-th thing any easier than learning the \ufb01rst? NIPS, 1996.\n\nS. Thrun. Lifelong learning algorithms. In Learning to learn. Springer, 1998.\n\nS. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.\n\nV. Vapnik. Statistical learning theory. Wiley New York, 1998.\n\nO. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. NIPS, 2016.\n\nF. Zenke, B. Poole, and S. Ganguli. Improved multitask learning through synaptic intelligence. arXiv, 2017.\n\n10\n\n\f", "award": [], "sourceid": 3235, "authors": [{"given_name": "David", "family_name": "Lopez-Paz", "institution": "Facebook AI Research"}, {"given_name": "Marc'Aurelio", "family_name": "Ranzato", "institution": "Facebook"}]}