{"title": "Meta-Learning Representations for Continual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1820, "page_last": 1830, "abstract": "The reviews had two major concerns: lack of a benchmarking on a complex dataset, and unclear writing. To address these two major issues we: \n1- Rewrote experiments section with improved terminology to make the paper more clear. Previously we were using the term Pretraining to refer to both a baseline and the meta-training stage. As the reviewers pointed out, this was confusing. We have replaced one of the usages with 'meta-training.' We have also changed evaluation to meta-testing. \n2- Added mini-imagenet experiments to show that the proposed method scales to more complex datasets. \n\nMoreover, it wasn't clear if the objective we introduced improved over a maml like objective that also learned representations. We added MAML-Rep as a baseline that shows that our method -- which minimizes interference in addition to maximizing fast adaptation -- performs noticeably better. \n\nWe also added the pseudo-code of the algorithms to the main paper as requested by reviewers. Moreover, we contrast our algorithm with MAML to highlight the difference between the two. We believe that this makes the current version significantly more clear to anyone who already understands the MAML objective. \n\nWe have fixed various minor issues in writing and included some missing related work. (bengio2019meta, nagabandi19, al2017continuous) that we have discovered since our initial submission. \n\nFinally, we thank the reviewers and the meta-reviewer for the feedback, which allowed us to improve the work in several aspects.", "full_text": "Meta-Learning Representations for\n\nContinual Learning\n\nKhurram Javed, Martha White\nDepartment of Computing Science\n\nkjaved@ualberta.ca, whitem@ualberta.ca\n\nUniversity of Alberta\n\nT6G 1P8\n\nAbstract\n\nA continual learning agent should be able to build on top of existing knowledge to\nlearn on new data quickly while minimizing forgetting. Current intelligent systems\nbased on neural network function approximators arguably do the opposite\u2014they\nare highly prone to forgetting and rarely trained to facilitate future learning. One\nreason for this poor behavior is that they learn from a representation that is not\nexplicitly trained for these two goals. In this paper, we propose OML, an objective\nthat directly minimizes catastrophic interference by learning representations that\naccelerate future learning and are robust to forgetting under online updates in con-\ntinual learning. We show that it is possible to learn naturally sparse representations\nthat are more effective for online updating. Moreover, our algorithm is comple-\nmentary to existing continual learning strategies, such as MER and GEM. Finally,\nwe demonstrate that a basic online updating strategy on representations learned by\nOML is competitive with rehearsal based methods for continual learning. 1\n\n1\n\nIntroduction\n\nContinual learning\u2014also called cumulative learning and lifelong learning\u2014is the problem setting\nwhere an agent faces a continual stream of data, and must continually make and learn new predictions.\nThe two main goals of continual learning are (1) to exploit existing knowledge of the world to quickly\nlearn predictions on new samples (accelerate future learning) and (2) reduce interference in updates,\nparticularly avoiding overwriting older knowledge. Humans, as intelligence agents, are capable\nof doing both. For instance, an experienced programmer can learn a new programming language\nsigni\ufb01cantly faster than someone who has never programmed before and does not need to forget\nthe old language to learn the new one. Current state-of-the-art learning systems, on the other hand,\nstruggle with both (French, 1999; Kirkpatrick et al., 2017).\nSeveral methods have been proposed to address catastrophic interference. These can generally be\ncategorized into methods that (1) modify the online update to retain knowledge, (2) replay or generate\nsamples for more updates and (3) use semi-distributed representations. Knowledge retention methods\nprevent important weights from changing too much, by introducing a regularization term for each\nparameter weighted by its importance (Kirkpatrick et al., 2017; Aljundi et al., 2018; Zenke et al.,\n2017; Lee et al., 2017; Liu et al., 2018). Rehearsal methods interleave online updates with updates\non samples from a model. Samples from a model can be obtained by replaying samples from older\ndata (Lin, 1992; Mnih et al., 2015; Chaudhry et al., 2019; Riemer et al., 2019; Rebuf\ufb01 et al., 2017;\nLopez-Paz and Ranzato, 2017; Aljundi et al., 2019), by using a generative model learned on previous\ndata (Sutton, 1990; Shin et al., 2017), or using knowledge distillation which generates targets using\n\n1Code accompanying paper available at https://github.com/khurramjaved96/mrcl\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An example of our proposed architecture for learning representations for continual learning.\nDuring the inner gradient steps for computing the meta-objective, we only update the parameters\nin the prediction learning network (PLN). We then update both the representation learning network\n(RLN) and the prediction learning network (PLN) by taking a gradient step with respect to our\nmeta-objective. The online updates for continual learning also only modify the PLN. Both RLN and\nPLN can be arbitrary models.\n\npredictions from an older predictor (Li and Hoiem, 2018). These ideas are all complementary to that\nof learning representations that are suitable for online updating.\nEarly work on catastrophic interference focused on learning semi-distributed (also called sparse)\nrepresentations (French, 1991, 1999). Recent work has revisited the utility of sparse representations\nfor mitigating interference (Liu et al., 2019) and for using model capacity more conservatively to\nleave room for future learning (Aljundi et al., 2019). These methods, however, use sparsity as a proxy,\nwhich alone does not guarantee robustness to interference. A recently proposed online update for\nneural networks implicitly learns representations to obtain non-interfering updates (Riemer et al.,\n2019). Their objective maximizes the dot product between gradients computed for different samples.\nThe idea is to encourage the network to reach an area in the parameter space where updates to the\nentire network have minimal interference and positive generalization. This idea is powerful: to specify\nan objective to explicitly mitigate interference\u2014rather than implicitly with sparse representations.\nIn this work, we propose to explicitly learn a representation for continual learning that avoids\ninterference and promotes future learning. We propose to train the representation with OML \u2013 a\nmeta-objective that uses catastrophic interference as a training signal by directly optimizing through\nan online update. The goal is to learn a representation such that the stochastic online updates the\nagent will use at meta-test time improve the accuracy of its predictions in general. We show that\nusing our objective, it is possible to learn representations that are more effective for online updating\nin sequential regression and classi\ufb01cation problems. Moreover, these representations are naturally\nhighly sparse. Finally, we show that existing continual learning strategies, like Meta Experience\nReplay (Riemer et al., 2019), can learn more effectively from these representations.\n\n2 Problem Formulation\n\nA Continual Learning Prediction (CLP) problem consists of an unending stream of samples\n\nT = (X1, Y1), (X2, Y2), . . . , (Xt, Yt), . . .\n\nfor inputs Xt and prediction targets Yt, from sets X and Y respectively.2 The random vector Yt is sam-\npled according to an unknown distribution p(Y |Xt). We assume the process X1, X2, . . . , Xt, . . . has\na marginal distribution \u00b5 : X \u2192 [0,\u221e), that re\ufb02ects how often each input is observed. This assump-\ntion allows for a variety of correlated sequences. For example, Xt could be sampled from a distribution\n\n2This de\ufb01nition encompasses the continual learning problem where the tuples also include task descriptors\nTt (Lopez-Paz and Ranzato, 2017). Tt in the tuple (Xt, Tt, Yt) can simply be considered as part of the inputs.\n\n2\n\nRepresentation Learning Network (RLN)x1x2xn...InputPrediction Learning Network (PLN)LearnedrepresentationyOutputMeta-parameters(Only updated in the outer loopduring meta-training)Adaptation Parameters(Updated in the inner loop and at meta-testing)...r1r2r3r4NetworkConnectionsCould be any di\ufb00erentiable layer e.g a conv layer + relu or fc layer + relurdNetworkConnectionsNetworkConnectionsNetworkConnectionsNetworkConnectionsNetworkConnectionsNetworkConnections\fFigure 2: Effect of the representation on continual learning, for a problem where targets are generated\nfrom three different distributions p1(Y |x), p2(Y |x) and p3(Y |x). The representation results in\ndifferent solution manifolds for the three distributions; we depict two different possibilities here. We\nshow the learning trajectory when training incrementally from data generates \ufb01rst by p1, then p2\nand p3. On the left, the online updates interfere, jumping between distant points on the manifolds.\nOn the right, the online updates either generalize appropriately\u2014for parallel manifolds\u2014or avoid\ninterference because manifolds are orthogonal.\n\npotentially dependent on past variables Xt\u22121 and Xt\u22122. The targets Yt, however, are dependent only\non Xt, and not on past Xi. We de\ufb01ne Sk = (Xj+1Yj+1), (Xj+2Yj+2) . . . , (Xj+k, Yj+k), a random\ntrajectory of length k sampled from the CLP problem T . Finally, p(Sk|T ) gives a distribution over\nall trajectories of length k that can be sampled from problem T .\nFor a given CLP problem, our goal is to learn a function fW,\u03b8 that can predict Yt given Xt. More\nconcretely, let (cid:96) : Y \u00d7 Y \u2192 R be the function that de\ufb01nes loss between a prediction \u02c6y \u2208 Y and target\ny as (cid:96)(\u02c6y, y). If we assume that inputs X are seen proportionally to some density \u00b5 : X \u2192 [0,\u221e),\nthen we want to minimize the following objective for a CLP problem:\n\n(cid:90) (cid:20)(cid:90)\n\n(cid:21)\n\nLCLP (W, \u03b8)\n\ndef\n\n= E[(cid:96)(fW,\u03b8(X), Y )] =\n\n(cid:96)(fW,\u03b8(x), y)p(y|x)dy\n\n\u00b5(x)dx.\n\n(1)\n\nwhere W and \u03b8 represent the set of parameters that are updated to minimize the objective. To\nminimize LCLP , we limit ourselves to learning by online updates on a single k length trajectory\nsampled from p(Sk|T ). This changes the learning problem from the standard iid setting \u2013 the agent\nsees a single trajectory of correlated samples of length k, rather than getting to directly sample from\np(x, y) = p(y|x)\u00b5(x). This modi\ufb01cation can cause signi\ufb01cant issues when simply applying standard\nalgorithms for the iid setting. Instead, we need to design algorithms that take this correlation into\naccount.\nA variety of continual problems can be represented by this formulation. One example is an online\nregression problem, such as predicting the next spatial location for a robot given the current location;\nanother is the existing incremental classi\ufb01cation benchmarks. The CLP formulation also allows for\ntargets Yt that are dependent on a history of the most recent m observations. This can be obtained by\nde\ufb01ning each Xt to be the last m observations. The overlap between Xt and Xt\u22121 does not violate\nthe assumptions on the correlated sequence of inputs. Finally, the prediction problem in reinforcement\nlearning\u2014predicting the value of a policy from a state\u2014can be represented by considering the inputs\nXt to be states and the targets to be sampled returns or bootstrapped targets.\n\n3 Meta-learning Representations for Continual Learning\n\nNeural networks, trained end-to-end, are not effective at minimizing the CLP loss using a single\ntrajectory sampled from p(Sk|T ) for two reasons. First, they are extremely sample-inef\ufb01cient,\nrequiring multiple epochs of training to converge to reasonable solutions. Second, they suffer from\ncatastrophic interference when learning online from a correlated stream of data (French, 1991). Meta-\nlearning is effective at making neural networks more sample ef\ufb01cient (Finn et al., 2017). Recently,\nNagabandi et al. (2019); Al-Shedivat et al. (2018) showed that it can also be used for quick adaptation\nfrom a stream of data. However, they do not look at the catastrophic interference problem. Moreover,\n\n3\n\nSolution Manifold for Task 1Solution manifolds in arepresentation space notoptimized for continual learningJoint Training SoluionParameter SpaceSolution manifolds in arepresentation space ideal for continual learningWWp1p2p3p1p2p3\ftheir work meta-learns a model initialization, an inductive bias we found insuf\ufb01cient for solving the\ncatastrophic interference problem (See Appendix B.1).\nTo apply neural network to the CLP problem, we propose meta-learning a function \u03c6\u03b8(X) \u2013 a deep\nRepresentation Learning Network (RLN) parametrized by \u03b8 \u2013 from X \u2192 Rd. We then learn another\nfunction gW from Rd \u2192 Y, called a Prediction Learning Network (PLN). By composing the two\nfunctions we get fW,\u03b8(X) = gW (\u03c6\u03b8(X)), which constitute our model for the CLP tasks as shown in\nFigure 1. We treat \u03b8 as meta-parameters that are learned by minimizing a meta-objective and then\nlater \ufb01xed at meta-test time. After learning \u03b8, we learn gW from Rd \u2192 Y for a CLP problem from a\nsingle trajectory S using fully online SGD updates in a single pass. A similar idea has been proposed\nby Bengio et al. (2019) for learning causal structures.\nFor meta-training, we assume a distribution over CLP problems given by p(T ). We consider two\nmeta-objectives for updating the meta-parameters \u03b8. (1) MAML-Rep, a MAML (Finn et al., 2017)\nlike few-shot-learning objective that learns an RLN instead of model initialization, and OML (Online\naware Meta-learning) \u2013 an objective that also minimizes interference in addition to maximizing fast\nadaptation for learning the RLN. Our OML objective is de\ufb01ned as:\n\n(cid:88)\n\n(cid:88)\n\n(cid:104)\n\n(cid:16)\n\n(cid:105)\n\n(cid:88)\n\nmin\nW,\u03b8\n\nTi\u223cp(T )\n\nOML(W, \u03b8)\n\ndef\n=\n\nLCLPi\n\nU (W, \u03b8,S j\nk)\n\n(2)\n\nk = (X i\n\nj+1Y i\n\nj+1), (X i\n\nj+2Y i\n\nt+j, Y i\n\nTi\u223cp(T )\nj+2), . . . , (X i\n\nS j\nk\u223cp(Sk|Ti)\nwhere Sj\nj+k). U (Wt, \u03b8, Sj\nk) = (Wt+k, \u03b8) represents an\nj+kY i\nupdate function where Wt+k is the weight vector after k steps of stochastic gradient descent. The jth\nupdate step in U is taken using parameters (Wt+j\u22121, \u03b8) on sample (X i\nt+j) to give (Wt+j, \u03b8).\nMAML-Rep and OML objectives can be implemented as Algorithm 1 and 2 respectively, with the\nprimary difference between the two highlighted in blue. Note that MAML-Rep uses the complete\nbatch of data Sk to do l inner updates \u2013 where l is a hyper-parameter \u2013 whereas OML uses one data\npoint from Sk for one update. This allows OML to take the effects of online continual learning \u2013\nsuch as catastrophic forgetting \u2013 into account.\nThe goal of the OML ob-\njective is to learn represen-\ntations suitable for online\ncontinual learnings. For an\nillustration of what would\nconstitute an effective rep-\nresentation for continual\nlearning, suppose that we\nhave three clusters of inputs,\nwhich have signi\ufb01cantly dif-\nferent p(Y |x), correspond-\ning to p1, p2 and p3. For\na \ufb01xed 2-dimensional repre-\nsentation \u03c6\u03b8 : X \u2192 R2, we\ncan consider the manifold\nof solutions W \u2208 R2 given\nby a linear model that pro-\nvide equivalently accurate solutions for each pi. These three manifolds are depicted as three different\ncolored lines in the W \u2208 R2 parameter space in Figure 2. The goal is to \ufb01nd one parameter vector\nW that is effective for all three distributions by learning online on samples from three distributions\nsequentially. For two different representations, these manifolds, and their intersections can look very\ndifferent. The intuition is that online updates from a W are more effective when the manifolds are\neither parallel\u2014allowing for positive generalization\u2014or orthogonal\u2014avoiding interference. It is\nunlikely that a representation producing such manifolds would emerge naturally. Instead, we will\nhave to explicitly \ufb01nd it. By taking into account the effects of online continual learning, the OML\nobjective optimizes for such a representation.\nWe can optimize this objective similarly to other gradient-based meta-learning objectives. Early work\non learning-to-learn considered optimizing parameters through learning updates themselves, though\ntypically considering approaches using genetic algorithms (Schmidhuber, 1987). Improvements\n\nAlgorithm 1: Meta-Training : MAML-Rep\nRequire: p(T ): distribution over CLP problems\nRequire: \u03b1, \u03b2: step size hyperparameters\nRequire: l: No of inner gradient steps\n1: randomly initialize \u03b8\n2: while not done do\n3:\n4:\n5:\n6: W0 = W\n7:\n8:\n9:\n10:\n11:\n12: end while\n\nrandomly initialize W\nSample CLP problem Ti \u223c p(T )\nSample Strain from p(Sk|Ti)\nfor j in 1, 2, . . . , l do\n\nWj = Wj\u22121 \u2212 \u03b1\u2207Wj\u22121 (cid:96)i(f\u03b8,Wl (Strain[:, 0]), Strain[:, 1])\n\nend for\nSample Stest from p(Sk|Ti)\nUpdate \u03b8 \u2190 \u03b8 \u2212 \u03b2\u2207\u03b8(cid:96)i(f\u03b8,Wl (Stest[:, 0]), Stest[:, 1])\n\n4\n\n\fin automatic differentiation have made it more feasible to compute gradient-based meta-learning\nupdates (Finn, 2018). Some meta-learning algorithms have similarly considered optimizations\nthrough multiple steps of updating for the few-shot learning setting (Finn et al., 2017; Li et al., 2017;\nAl-Shedivat et al., 2018; Nagabandi et al., 2019) for learning model initializations. The successes\nin these previous works in optimizing similar objectives motivate OML as a feasible objective for\nMeta-learning Representations for Continual Learning.\n\n4 Evaluation\n\nIn this section, we investigate the\nquestion: can we learn a representa-\ntion for continual learning that pro-\nmotes future learning and reduces\ninterference? We investigate this\nquestion by meta-learning the repre-\nsentations of\ufb02ine on a meta-training\ndataset. At meta-test time, we ini-\ntialize the continual learner with this\nrepresentation and measure predic-\ntion error as the agent learns the\nPLN online on a new set of CLP\nproblems (See Figure 1).\n\n4.1 CLP Benchmarks\n\nAlgorithm 2: Meta-Training : OML\nRequire: p(T ): distribution over CLP problems\nRequire: \u03b1, \u03b2: step size hyperparameters\n1: randomly initialize \u03b8\n2: while not done do\n3:\n4:\n5:\n6: W0 = W\n7:\n8:\n9:\n10:\n11:\n12:\n13: end while\n\nrandomly initialize W\nSample CLP problem Ti \u223c p(T )\nSample Strain from p(Sk|Ti)\nfor j = 1, 2, . . . , k do\n(Xj, Yj) = Strain[j]\nWj = Wj\u22121 \u2212 \u03b1\u2207Wj\u22121(cid:96)i(f\u03b8,Wj\u22121 (Xj), Yj)\n\nend for\nSample Stest from p(Sk|Ti)\nUpdate \u03b8 \u2190 \u03b8 \u2212 \u03b2\u2207\u03b8(cid:96)i(f\u03b8,Wk (Stest[:, 0]), Stest[:, 1])\n\nWe evaluate on a simulated regres-\nsion problem and a sequential clas-\nsi\ufb01cation problem using real data.\nIncremental Sine Waves: An Incremental Sine Wave CLP problem is de\ufb01ned by ten (randomly\ngenerated) sine functions, with x = (z, n) for z \u2208 [\u22125, 5] as input to the sine function and n a\none-hot vector for {1, . . . , 10} indicating which function to use. The targets are deterministic, where\n(x, y) corresponds to y = sinn(z). Each sine function is generated once by randomly selecting an\namplitude in the range [0.1, 5] and phase in [0, \u03c0]. A trajectory S400 from the CLP problem consists\nof 40 mini-batches from the \ufb01rst sine function in the sequence (Each mini-batch has eight elements),\nand then 40 from the second and so on. Such a trajectory has suf\ufb01cient information to minimize loss\nfor the complete CLP problem. We use a single regression head to predict all ten functions, where\nthe input id n makes it possible to differentiate outputs for the different functions. Though learnable,\nthis input results in signi\ufb01cant interference across different functions.\nSplit-Omniglot: Omniglot is a dataset of over 1623 characters from 50 different alphabets (Lake et al.,\n2015). Each character has 20 hand-written images. The dataset is divided into two parts. The \ufb01rst 963\nclasses constitute the meta-training dataset whereas the remaining 660 the meta-testing dataset. To\nde\ufb01ne a CLP problem on this dataset, we sample an ordered set of 200 classes (C1, C2, C3, . . . , C200).\nX and Y, then, constitute of all images of these classes. A trajectory S1000 from such a problem is a\ntrajectory of images \u2013 \ufb01ve images per class \u2013 where we see all \ufb01ve images of C1 followed by \ufb01ve\nimages of C2 and so on. This makes k = 5 \u00d7 200 = 1000. Note that the sampling operation de\ufb01nes\na distribution p(T ) over problems that we use for meta-training.\n4.2 Meta-Training Details\n\nIncremental Sine Waves: We sample 400 functions to create our meta-training set and 500 for\nbenchmarking the learned representation. We meta-train by sampling multiple CLP problems. During\neach meta-training step, we sample ten functions from our meta-training set and assign them task\nids from one to ten. We concatenate 40 mini-batches \u2013 each with 32 x,y pairs \u2013 generated from\nfunction one, then function two and so on, to create our training trajectory S400. For evaluation, we\nsimilarly randomly sample ten functions from the test set and create a single trajectory. We use SGD\non the MSE loss with a mini-batch size of 8 for online updates, and Adam (Kingma and Ba, 2014) for\noptimizing the OML objective. Note that the OML objective involves computing gradients through\n\n5\n\n\fFigure 3: Mean squared error across all 10 regression tasks. The x-axis in (a) corresponds to seeing\nall data points of samples for class 1, then class 2 and so on. These learning curves are averaged over\n50 runs, with error bars representing 95% con\ufb01dence interval drawn by 1,000 bootstraps. We can see\nthat the representation trained on iid data\u2014Pre-training\u2014is not effective for online updating. Notice\nthat in the \ufb01nal prediction accuracy in (b), Pre-training and SR-NN representations have accurate\npredictions for task 10, but high error for earlier tasks. OML, on the other hand, has a slight skew\nin error towards later tasks in learning but is largely robust. Oracle uses iid sampling and multiple\nepochs and serves as a best case bound.\n\na network unrolled for 400 steps. At evaluation time, we use the same learning rate as used during\nthe inner updates in the meta-training phase for OML. For our baselines, we do a grid search over\nlearning rates and report the results for the best performing parameter.\nWe found that having a deeper representation learning network (RLN) improved performance. We use\nsix layers for the RLN and two layers for the PLN. Each hidden layer has a width of 300. The RLN\nis only updated with the meta-update and acts as a \ufb01xed feature extractor during the inner updates in\nthe meta-learning objective and at evaluation time.\nSplit-Omniglot: We learn an encoder \u2013 a deep CNN with 6 convolution and two FC layers \u2013 using\nthe MAML-Rep and the OML objective. We treat the convolution parameters as \u03b8 and FC layer\nparameters as W . Because optimizing the OML objective is computationally expensive for H = 1000\n(It involves unrolling the computation graph for 1,000 steps), we approximate the two objectives.\nFor MAML-Rep we learn the \u03c6\u03b8 by maximizing fast adaptation for a 5 shot 5-way classi\ufb01er. For\nOML, instead of doing |Strain| no of inner-gradient steps as described in Algorithm 2, we go over\nStrain \ufb01ve steps at a time. For kth \ufb01ve steps in the inner loop, we accumulate our meta-loss on\nStest[0 : 5 \u00d7 k], and update our meta-parameters using these accumulated gradients at the end as\nexplained in Algorithm 3 in the Appendix. This allows us to never unroll our computation graphs for\nmore than \ufb01ve steps (Similar to truncated back-propagation through time) and still take into account\nthe effects of interference at meta-training.\nFinally, both MAML-Rep and OML use 5 inner gradient steps and similar network architectures for a\nfair comparison. Moreover, for both methods, we try multiple values for the inner learning rate \u03b1 and\nreport the results for the best parameter. For more details about hyper-parameters see the Appendix.\nFor more details on implementation, see Appendix A.\n\n4.3 Baselines\n\nWe compare MAML-Rep and OML \u2013 the two Meta-learneing based Representations Leanring\nmethods to three baselines.\nScratch simply learns online from a random network initialization, with no meta-training.\nPre-training uses standard gradient descent to minimize prediction error on the meta-training set.\nWe then \ufb01x the \ufb01rst few layers in online training. Rather than restricting to the same 6-2 architecture\nfor the RLN and PLN, we pick the best split using a validation set.\nSR-NN use the Set-KL method to learn a sparse representation (Liu et al., 2019) on the meta-training\nset. We use multiple values of the hyper-parameter \u03b2 for SR-NN and report results for one that\nperforms the best. We include this baseline to compare to a method that learns a sparse representation.\n\n6\n\nPretrainingSR-NNOMLOracle MeanSquared ErrorNo of functions learned0.00.51.01.513795108642Continual Regression ExperimentError Distribution13791086425123012301230Task ID MeanSquared Error\fFigure 4: Comparison of representations learned by the MAML-Rep, OML objective and the baselines\non Split-Omniglot. All curves are averaged over 50 CLP runs with 95% con\ufb01dence intervals drawn\nusing 1,000 bootstraps. At every point on the x-axis, we only report accuracy on the classes seen\nso far. Even though both MAML-Rep and OML learn representations that result in comparable\nperformance of classi\ufb01ers trained under the IID setting (c and d), OML out-performs MAML-Rep\nwhen learning online on a highly correlated stream of data showing it learns representations more\nrobust to interference. SR-NN, which does not do meta-learning, performs worse even under the IID\nsetting showing it learns worse representations.\n\n4.4 Meta-Testing\n\nWe report results of LCLP (Wonline, \u03b8meta) for fully online updates on a single Sk for each CLP\nproblem. For each of the methods, we separately tune the learning rate on a \ufb01ve validation trajectories\nand report results for the best performing parameter.\nIncremental Sine Waves: We plot the average mean squared error over 50 runs on the full testing\nset, when learning online on unseen sequences of functions, in Figure 3 (left). OML can learn new\nfunctions with a negligible increase in average MSE. The Pre-training baseline, on the other hand,\nclearly suffers from interference, with increasing error as it tries to learn more and more functions.\nSR-NN, with its sparse representation, also suffers from noticeably more interference than OML.\nFrom the distribution of errors for each method on the ten functions, shown in Figure 3 (right), we\ncan see that both Pre-training and SR-NN have high errors for functions learned in the beginning\nwhereas OML performs only slightly worse on those.\nSplit-Omniglot:\nWe report classi\ufb01cation accuracy on the train-\ning trajectory (Strain) as well as the test set in\nFigure 4. Note that training accuracy is a mean-\ningful metric in continual learning as it measures\nforgetting. The test set accuracy re\ufb02ects both\nforgetting and generalization error. Our method\ncan learn the training trajectory almost perfectly\nwith minimal forgetting. The baselines, on the\nother hand, suffer from forgetting as they learn\nmore classes sequentially. The higher training\naccuracy of our method also translates into bet-\nter generalization on the test set. The difference\nin the train and test performance is mainly due\nto how few samples are given per class: only 15\nfor training and 5 for testing.\nAs a sanity check, we also trained classi\ufb01ers by sampling data IID for 5 epochs and report the results\nin Fig. 4 (c) and (d). The fact that OML and MAML-Rep do equally well with IID sampling indicates\nthat the quality of representations (\u03c6\u03b8 = Rd) learned by both objectives are comparable and the\nhigher performance of OML is indeed because the representations are more suitable for incremental\nlearning.\nMoreover, to test if OML can learn representations on more complex datasets, we run the same\nexperiments on mini-imagenet and report the results in Figure 5.\n\nFigure 5: OML scales to more complex datasets\nsuch a Mini-imagenet. We use the existing meta-\ntraining/meta-testing split of mini-imagenet. At\nmeta-testing, we learn a 20 way classi\ufb01er using 30\nsamples per class.\n\n7\n\nAccuracy0.00.20.40.60.81.00.00.20.40.60.81.0No of classes learned incrementally0.00.20.40.60.81.0IID, Multiple Epochs, Train ErrorIID, Multiple Epochs, Test ErrorMAML-RepSRNN020010050150020010050150(c)(d)OMLScratchAll of themOnline, Single Pass, Train ErrorOnline, Single Pass, Test ErrorOMLMAML-RepSRNN0200100501500200100501500.00.20.40.60.81.0(a)(b)PretrainingScratchOMLMAML-RepSRNNPretrainingScratch1618161862010812140.60.40.20.30.20.50.40.1PretrainingOMLOMLPretraining62010812141.00.8Meta-testing: Train AccuracyAccuracyNo of classes learned incrementallySR-NNSR-NNMeta-testing: Test Accuracy\fFigure 6: We reshape the 2304 length representation vectors into 32x72, normalize them to have a\nmaximum value of one and visualize them; here random instance means representation for a randomly\nchosen input from the training set, whereas average activation is the mean representation for the\ncomplete dataset. For SR-NN, we re-train the network with a different value of parameter \u03b2 to have\nthe same instance sparsity as OML. Note that SR-NN achieves this sparsity by never using a big part\nof representation space. OML, on the other hand, uses the full representation space. In-fact, OML\nhas no dead neurons whereas even pre-training results in some part of the representation never being\nused.\n\n4.5 What kind of representations does OML learn?\n\nAs discussed earlier, French (1991) proposed that sparse representations could mitigate forgetting.\nIdeally, such a representation is instance sparse\u2013using a small percentage of activations to represent\nan input\u2013 while also utilizing the representation to its fullest. This means that while most neurons\nwould be inactive for a given input, every neuron would participate in representing some input. Dead\nneurons, which are inactive for all inputs, are undesirable and may as well be discarded. An instance\nsparse representation with no dead neurons reduces forgetting because each update changes only a\nsmall number of weights which in turn should only affect a small number of inputs. We hypothesize\nthat the representation learned by OML will be sparse, even though the objective does not explicitly\nencourage this property.\nWe compute the average instance sparsity on the Omniglot training set, for OML, SR-NN, and\nPre-training. OML produces the most sparse network, without any dead neurons. The network\nlearned by Pre-training, in comparison, uses over 10 times more neurons on average to represent an\ninput. The best performing SR-NN used in Figure 4 uses 4 times more neurons. We also re-trained\nSR-NN with a parameter to achieve a similar level of sparsity as OML, to compare representations of\nsimilar sparsity rather than representations chosen based on accuracy. We use \u03b2 = 0.05 which results\nin an instance sparsity similar to OML.\nWe visualize all the solutions in Figure\n6. The plots highlight that OML learns\na highly sparse and well-distributed rep-\nresentation, taking the most advantage of\nthe large capacity of the representation.\nSurprisingly, OML has no dead neurons,\nwhich is a well-known problem when learn-\ning sparse representations (Liu et al., 2019).\nEven Pre-training, which does not have\nan explicit penalty to enforce sparsity, has\nsome dead neurons. Instance sparsity and\ndead neurons percentage for each method\nare reported in Table 1.\n\nTable 1: Instance sparisty and dead neuron percentage\nfor different methods. OML learns highly sparse repre-\nsentations without any dead neurons. Even Pre-training,\nwhich does not optimize for sparsity, ends up with some\ndead neurons, on the other hand.\n\nInstance Sparsity Dead Neurons\n\n3.8%\n15%\n4.9%\n38%\n\nMethod\nOML\n\nSR-NN (Best)\nSR-NN (Sparse)\n\nPre-Training\n\n0%\n0.7%\n14%\n3%\n\n5\n\nImprovements by Combining with Knowledge Retention Approaches\n\nWe have shown that OML learns effective representations for continual learning. In this section, we\nanswer a different question: how does OML behave when it is combined with existing continual\n\n8\n\nOMLSR-NN (Sparse)Pre-trainingRandom Instance 1Random Instance 2Average Activation0.01.0Random Instance 3SR-NN\fTable 2: OML combined with existing continual learning methods. All memory-based methods\nuse a buffer of 200. Error margins represent one std over 10 runs. Performance of all methods is\nconsiderably improved when they learn from representations learned by OML moreover, even online\nupdates are competitive with rehearsal based methods with OML. Finally, online updates on OML\noutperform all methods when they learn from other representations. Note that MER does better than\napprox IID in some cases because it does multiple rehearsal-based updates for every sample.\n\nSplit-Omniglot\n\nOne class per task, 50 tasks\n\nFive classes per task, 20 tasks\n\nMethod\nOnline\nApprox IID\nER-Reservoir\nMER\nEWC\n\nStandard\n04.64 \u00b1 2.61\n53.95 \u00b1 5.50\n52.56 \u00b1 2.12\n54.88 \u00b1 4.12\n05.08 \u00b1 2.47\n\nOML\n64.72 \u00b1 2.57\n75.12 \u00b1 3.24\n68.16 \u00b1 3.12\n76.00 \u00b1 2.07\n64.44 \u00b1 3.13\n\nPre-training\n21.16 \u00b1 2.71\n54.29 \u00b1 3.48\n36.72 \u00b1 3.06\n62.76 \u00b1 2.16\n18.72 \u00b1 3.97\n\nStandard\n01.40 \u00b1 0.43\n48.02 \u00b1 5.67\n24.32 \u00b1 5.37\n29.02 \u00b1 4.01\n02.04 \u00b1 0.35\n\nOML\n55.32 \u00b1 2.25\n67.03 \u00b1 2.10\n60.92 \u00b1 2.41\n62.05 \u00b1 2.19\n56.03 \u00b1 3.20\n\nPre-training\n11.80 \u00b1 1.92\n46.02 \u00b1 2.83\n37.44 \u00b1 1.67\n42.05 \u00b1 3.71\n10.03 \u00b1 1.53\n\nlearning methods? We test the performance of EWC (Kirkpatrick et al., 2017), MER (Riemer et al.,\n2019) and ER-Reservoir (Chaudhry et al., 2019), in their standard form\u2014learning the whole network\nonline\u2014as well as with pre-trained \ufb01xed representations. We use pre-trained representations from\nOML and Pre-training, obtained in the same way as described in earlier sections. For the Standard\nonline form of these algorithms, to avoid the unfair advantage of meta-training, we initialize the\nnetworks by learning iid on the meta-training set.\nAs baselines, we also report results for (a) fully online SGD updates that update one point at a time in\norder on the trajectory and (b) approximate IID training where SGD updates are used on a random\nshuf\ufb02ing of the trajectory, removing the correlation.\nWe report the test set results for learning 50 tasks with one class per task and learning 20 tasks with 5\ntasks per class in Split-Omniglot in Table 2. For each of the methods, we do a 15/5 train/test split\nfor each Omniglot class and test multiple values for all the hyperparameters and report results for\nthe best setting. The conclusions are surprisingly clear. (1) OML improves all the algorithms; (2)\nsimply providing a \ufb01xed representation, as in Pre-training, does not provide nearly the same gains as\nOML and (3) OML with a basic Online updating strategy is already competitive, outperforming all\nthe continual learning methods without OML. There are a few additional outcomes of note. OML\noutperforms even approximate IID sampling, suggesting it is not only mitigating interference but also\nmaking learning faster on new data. Finally, the difference in online and experience replay based\nalgorithms for OML is not as pronounced as it is for other representations.\n\n6 Conclusion\n\nIn this paper, we proposed a meta-learning objective to learn representations that are robust to inter-\nference under online updates and promote future learning. We showed that using our representations,\nit is possible to learn from highly correlated data streams with signi\ufb01cantly improved robustness to\nforgetting. We found sparsity emerges as a property of our learned representations, without explicitly\ntraining for sparsity. We \ufb01nally showed that our method is complementary to the existing state of the\nart continual learning methods, and can be combined with them to achieve signi\ufb01cant improvements\nover each approach alone.\nAn important next step for this work is to demonstrate how to learn these representations online\nwithout a separate meta-training phase. Initial experiments suggest it is effective to periodically\noptimize the representation on a recent buffer of data, and then continue online update with this\nupdated \ufb01xed representation. This matches common paradigms in continual learning\u2014based on the\nideas of a sleep phase and background planning\u2014and is a plausible strategy for continually adapting\nthe representation network for a continual stream of data. Another interesting extension to the work\nwould be to use the OML objective to meta-learn some other aspect of the learning process \u2013 such as\na local learning rule (Metz et al., 2019) or an attention mechanism \u2013 by minimizing interference.\n\n9\n\n\f7 Acknowledgements\n\nThe authors would like to thank Hugo Larochelle, Nicolas Le Roux, and Chelsea Finn for helpful\nquestions and feedback, and anonymous reviewers for useful comments.\n\nReferences\nAl-Shedivat, Maruan, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel\n(2018). Continuous adaptation via meta-learning in nonstationary and competitive environments.\nInternational Conference on Learning Representations.\n\nAljundi, Rahaf, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars\n(2018). Memory aware synapses: Learning what (not) to forget. In European Conference on\nComputer Vision.\n\nAljundi, Rahaf, Min Lin, Baptiste Goujaud, and Yoshua Bengio (2019). Gradient based sample\n\nselection for online continual learning. Advances in Neural Information Processing Systems.\n\nAljundi, Rahaf, Marcus Rohrbach, and Tinne Tuytelaars (2019). Sel\ufb02ess sequential learning. Interna-\n\ntional Conference on Learning Representations.\n\nBengio, Yoshua, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S\u00e9bastien Lachapelle, Olexa Bilaniuk,\nAnirudh Goyal, and Christopher Pal (2019). A meta-transfer objective for learning to disentangle\ncausal mechanisms. arXiv preprint arXiv:1901.10912.\n\nChaudhry, Arslan, Marc\u2019Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny (2019).\n\nEf\ufb01cient lifelong learning with a-gem. International Conference on Learning Representations.\n\nChaudhry, Arslan, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K\nDokania, Philip HS Torr, and Marc\u2019Aurelio Ranzato (2019). Continual learning with tiny episodic\nmemories. arXiv:1902.10486.\n\nFinn, Chelsea (2018, Aug). Learning to Learn with Gradients. Ph. D. thesis, EECS Department,\n\nUniversity of California, Berkeley.\n\nFinn, Chelsea, Pieter Abbeel, and Sergey Levine (2017). Model-agnostic meta-learning for fast\n\nadaptation of deep networks. In International Conference on Machine Learning.\n\nFrench, Robert M (1991). Using semi-distributed representations to overcome catastrophic forgetting\n\nin connectionist networks. In Annual cognitive science society conference. Erlbaum.\n\nFrench, Robert M (1999). Catastrophic forgetting in connectionist networks. Trends in cognitive\n\nsciences.\n\nKingma, Diederik P and Jimmy Ba (2014). Adam: A method for stochastic optimization.\n\narXiv:1412.6980.\n\nKirkpatrick, James, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A\nRusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. (2017).\nOvercoming catastrophic forgetting in neural networks. National academy of sciences.\n\nLake, Brenden M, Ruslan Salakhutdinov, and Joshua B Tenenbaum (2015). Human-level concept\n\nlearning through probabilistic program induction. Science.\n\nLee, Sang-Woo, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang (2017). Overcom-\ning catastrophic forgetting by incremental moment matching. In Advances in Neural Information\nProcessing Systems.\n\nLi, Zhizhong and Derek Hoiem (2018). Learning without forgetting. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence.\n\nLi, Zhenguo, Fengwei Zhou, Fei Chen, and Hang Li (2017). Meta-sgd: Learning to learn quickly for\n\nfew-shot learning. arXiv:1707.09835.\n\n10\n\n\fLin, Long-Ji (1992). Self-improving reactive agents based on reinforcement learning, planning and\n\nteaching. Machine learning.\n\nLiu, Vincent, Raksha Kumaraswamy, Lei Le, and Martha White (2019). The utility of sparse\nrepresentations for control in reinforcement learning. AAAI Conference on Arti\ufb01cial Intelligence.\n\nLiu, Xialei, Marc Masana, Luis Herranz, Joost Van de Weijer, Antonio M Lopez, and Andrew D\nBagdanov (2018). Rotate your networks: Better weight consolidation and less catastrophic\nforgetting. In International Conference on Pattern Recognition.\n\nLopez-Paz, David and Marc\u2019Aurelio Ranzato (2017). Gradient episodic memory for continual\n\nlearning. In Advances in Neural Information Processing Systems.\n\nMetz, Luke, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-dickstein (2019). Meta-learning\nupdate rules for unsupervised representation learning. International Conference on Learning\nRepresentations.\n\nMnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. (2015). Human-level\ncontrol through deep reinforcement learning. Nature.\n\nNagabandi, Anusha, Chelsea Finn, and Sergey Levine (2019). Deep online learning via meta-learning:\nContinual adaptation for model-based rl. International Conference on Learning Representations.\n\nRebuf\ufb01, Sylvestre-Alvise, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert (2017).\nicarl: Incremental classi\ufb01er and tation learning. In Conference on Computer Vision and Pattern\nRecognition.\n\nRiemer, Matthew, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald\nTesauro (2019). Learning to learn without forgetting by maximizing transfer and minimizing\ninterference. International Conference on Learning Representations.\n\nSchmidhuber, Jurgen (1987). Evolutionary principles in self-referential learning, or on learning how\n\nto learn. Ph. D. thesis, Institut fur Informatik,Technische Universitat Munchen.\n\nShin, Hanul, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim (2017). Continual learning with deep\n\ngenerative replay. In Advances in Neural Information Processing Systems.\n\nSutton, Richard (1990). Integrated architectures for learning planning and reacting based on approxi-\n\nmating dynamic programming. In International Conference on Machine Learning.\n\nZenke, Friedemann, Ben Poole, and Surya Ganguli (2017). Continual learning through synaptic\n\nintelligence. In International Conference on Machine Learning.\n\n11\n\n\f", "award": [], "sourceid": 1052, "authors": [{"given_name": "Khurram", "family_name": "Javed", "institution": "University of Alberta"}, {"given_name": "Martha", "family_name": "White", "institution": "University of Alberta"}]}