{"title": "Gradient based sample selection for online continual learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11816, "page_last": 11825, "abstract": "A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overcome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous work often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameter gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.", "full_text": "Gradient based sample selection for online continual\n\nlearning\n\nRahaf Aljundi\u2217\n\nKU Leuven\n\nMin Lin\n\nMila\n\nBaptiste Goujaud\n\nMila\n\nrahaf.aljundi@gmail.com\n\nmavenlin@gmail.com\n\nbaptiste.goujaud@gmail.com\n\nYoshua Bengio\n\nMila\n\nyoshua.bengio@mila.quebec\n\nAbstract\n\nA continual learning agent learns online with a non-stationary and never-ending\nstream of data. The key to such learning process is to overcome the catastrophic\nforgetting of previously seen data, which is a well known problem of neural\nnetworks. To prevent forgetting, a replay buffer is usually employed to store the\nprevious data for the purpose of rehearsal. Previous works often depend on task\nboundary and i.i.d. assumptions to properly select samples for the replay buffer.\nIn this work, we formulate sample selection as a constraint reduction problem\nbased on the constrained optimization view of continual learning. The goal is\nto select a \ufb01xed subset of constraints that best approximate the feasible region\nde\ufb01ned by the original constraints. We show that it is equivalent to maximizing the\ndiversity of samples in the replay buffer with parameters gradient as the feature.\nWe further develop a greedy alternative that is cheap and ef\ufb01cient. The advantage\nof the proposed method is demonstrated by comparing to other alternatives under\nthe continual learning setting. Further comparisons are made against state of the\nart methods that rely on task boundaries which show comparable or even better\nresults for our method.\n\n1\n\nIntroduction\n\nThe central problem of continual learning is to overcome the catastrophic forgetting problem of\nneural networks. Current continual learning methods can be categorized into three major families\nbased on how the information of previous data are stored and used. We describe each of them below.\n\nThe prior-focused Prior-focused methods use a penalty term to regularize the parameters rather than\na hard constraint. The parameter gradually drifts away from the feasible regions of previous tasks,\nespecially when there is a long chain of tasks and when the tasks resemble each other [6]. It is often\nnecessary to hybridize the prior-focused approach with the replay-based methods for better results\n[13, 6].\n\nAnother major family is the parameter isolation methods which dedicates different parameters for\ndifferent tasks to prevent interference. Dynamic architectures that freeze/grow the network belong\nto this family. However, it is also possible to isolate parameters without changing the architecture\n[7, 11].\n\n\u2217Work mostly done while \ufb01rst author was a visiting researcher at Mila.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBoth of the above strategies explicitly associate neurons with different tasks, with the consequence\nthat task boundaries are mandatory during both training and testing. Due to the dependency on task\nboundaries during test, this family of methods tilts more towards multi-task learning than continual\nlearning.\n\nThe replay-based approach stores the information in the example space either directly in a replay\nbuffer or in a generative model. When learning new data, old examples are reproduced from the\nreplay buffer or generative model, which is used for rehearsal/retraining or used as constraints for the\ncurrent learning, yet the old examples could also be used to provide constraints [10].\n\nMost of the works in the above three families use a relaxed task incremental assumption: the data are\nstreamed one task at a time, with different distributions for each task, while keeping the independent\nand identically distributed (i.i.d.) assumption and performing of\ufb02ine training within each task.\nConsequently, they are not directly applicable to the more general setting where data are streamed\nonline with neither i.i.d. assumption nor task boundary information. Both prior-focused methods and\nreplay-based methods have the potential to be adapted to the general setting. However, we are mostly\ninterested in the replay-based methods in this work, since it is shown that prior-focused methods\nlead to no improvement or marginal improvement when applied on top of a replay-based method.\nSpeci\ufb01cally, we develop strategies to populate the replay buffer under the most general condition\nwhere no assumptions are made about the online data stream.\n\nOur contributions are as follows: 1) We formulate replay buffer population as a constraint selection\nproblem and formalize it as a solid angle minimization problem. 2) We propose a surrogate objective\nfor it and empirically verify that the surrogate objective aligns with the goal of solid angle minimiza-\ntion 3) As a cheap alternative for large sample selection, we propose a greedy algorithm that is as\nef\ufb01cient as reservoir sampling yet immune to imbalanced data stream. 4) We compare our method\nto different selection strategies and show the ability of our solutions to always select a subset of\nsamples that best represents the previous history 5) We perform experiments on continual learning\nbenchmarks and show that our method is on par with, or better than, the previous methods. Yet,\nrequiring no i.i.d. assumptions or task boundaries.\n\n2 Related Work\n\nOur continual learning approach belongs to the replay based family. Methods in this family alleviate\nforgetting by replaying stored samples from previous history when learning new ones. Although\nstorage of the original examples in memory for rehearsal dates back to 1990s [16], to date it is still a\nrule of thumb to overcome catastrophic forgetting in practical problems. For example, experience\nreplay is widely used in reinforcement learning where the data distributions are usually non-stationary\nand prone to catastrophic forgetting [9, 12].\n\nRecent works that use replay buffer for continual learning include iCaRL [14] and GEM [10], both of\nwhich allocate memory to store a core-set of examples from each task. These methods still require\ntask boundaries in order to divide the storage resource evenly to each task. There are also a few\nprevious works that deals with the situation where task boundary and i.i.d. assumption is not available.\nFor example, reservoir sampling has been employed in [5, 8] so that the data distribution in the replay\nbuffer follows the data distribution that has already been seen. The problem of reservoir sampling is\nthat the minor modes in the distribution with small probability mass may fail to be represented in the\nreplay buffer. As a remedy to this problem, coverage maximization is also proposed in [8]. It intends\nto keep diverse samples in the replay buffer using Euclidean distance as a difference measure.While\nthe Euclidean distance may be enough for low dimensional data, it could be uninformative when\nthe data lies in a structured manifold embedded in a high dimensional space. In contrast to previous\nworks, we start from the constrained optimization formulation of continual learning, and show that\nthe data selection for replay buffer is effectively a constraint reduction problem.\n\n3 Continual Learning as Constrained Optimization\n\nWe consider the supervised learning problem with an online stream of data where one or a few pairs\nof examples (x, y) are received at a time. The data stream is non-stationary with no assumption on\nthe distribution such as the i.i.d. hypothesis. Our goal is to optimize the loss on the current example(s)\nwithout increasing the losses on the previously learned examples.\n\n2\n\n\f3.1 Problem Formulation\n\nWe formulate our goal as the following constrained optimization problem. Without loss of generality,\nwe assume the examples are observed one at a time.\n\n\u03b8t = argmin\n\n\u2113(f (xt; \u03b8), yt)\n\n\u03b8\n\n(1)\n\ns.t. \u2113(f (xi; \u03b8), yi) \u2264 \u2113(f (xi; \u03b8t\u22121), yi); \u2200i \u2208 [0 . . t \u2212 1]\n\nf (.; \u03b8) is a model parameterized by \u03b8 and \u2113 is the loss function. t is the index of the current example\nand i indexes the previous examples.\n\nAs suggested by [10], the original constraints can be rephrased to the constraints in the gradient\nspace:\n\nhg, gii = (cid:28) \u2202\u2113(f (xt; \u03b8), yt)\n\n\u2202\u03b8\n\n,\n\n\u2202\u2113(f (xi; \u03b8), yi)\n\n\u2202\u03b8\n\n(cid:29) \u2265 0;\n\n(2)\n\nHowever, the number of constraints in the above optimization problem increases linearly with the\nnumber of previous examples. The required computation and storage resource for an exact solution\nof the above problem will increase inde\ufb01nitely with time. It is thus more desirable to solve the above\nproblem approximately with a \ufb01xed computation and storage budget. In practice, a replay buffer M\nlimited to M memory slots is often used to keep the previous examples. The constraints are thus only\nactive for (xi, yi) \u2208 M. How to populate the replay buffer then becomes a crucial research problem.\n\nGradient episodic memory (GEM) assumes access to task boundaries and an i.i.d. distribution within\neach task episode. It divides the memory budget evenly among the tasks. i.e. m = M/T slots is\nallocated for each task, where T is the number of tasks. The last m examples from each task are kept\nin the memory. This has clear limitations when the task boundaries are not available or when the\ni.i.d. assumption is not satis\ufb01ed. In this work, we consider the problem of how to populate the replay\nbuffer in a more general setting where the above assumptions are not available.\n\n3.2 Sample Selection as Constraint Reduction\n\nMotivated by Eq.1, we set our goal to selecting M examples so that the feasible region formed by the\ncorresponding reduced constraints is close to the feasible region of the original problem. We \ufb01rst\nconvert the original constraints in 2 to the corresponding feasible region:\n\nC = \\i\u2208[0..t\u22121]\n\n{g|hg, gii \u2265 0}\n\n(3)\n\nWe assume here that C is generally not empty. It is highly unlikely to happen if we consider a\nnumber of parameters much larger than the number of gradients gi, except if we encounter an outlier\nthat has different label yi with the same input xi. In this work we don\u2019t consider the existence of\noutliers. Geometrically, C is the intersection of the half spaces described by hg, gii \u2265 0, which forms\na polyhedral convex cone. The relaxed feasible region corresponding to the replay buffer is:\n\n\u02dcC = \\gi\u2208M\n\n{g|hg, gii \u2265 0}\n\n(4)\n\nFor best approximation of the original feasible region, we require \u02dcC to be as close to C as possible.\nIt is easy to see that C \u2282 \u02dcC because M \u2282 [g0 . . gt\u22121]. We illustrate the relation between C and \u02dcC in\nFigure 1.\n\nOn the left, C is represented while the blue hyperplane on the right corresponds to a constraint that\nhas been removed. Therefore, \u02dcC (on the right) is larger than C for the inclusion partial order. As\nwe want \u02dcC to be as close to C as possible, we actually want the \"smallest\" \u02dcC, where \"small\" here\nremains to be de\ufb01ned, as the inclusion order is not a complete order. A potential measure of the size\nof a convex cone is its solid angle de\ufb01ned as the intersection between the cone and the unit sphere.\n\nminimizeM \u03bbd\u22121\uf8eb\n\n\uf8edSd\u22121 \u2229 \\gi\u2208M\n\n{g|hg, gii \u2265 0}\uf8f6\n\uf8f8\n\n(5)\n\n3\n\n\fFigure 1: Feasible region (poly-\nhedral cone) before and after con-\nstraint selection. The selected\nconstraints (excluding the blue\none) are chosen to best approxi-\nmate the original feasible region.\n\nFigure 2: Correlation between\nsolid angle and our proposed sur-\nrogate in 200 dimension space in\nlog scale. Note that we only need\nmonotocity for our objective to\nhold.\n\nFigure 3:\nRelation be-\ntween angle formed by two\nvectors (\u03b1) and the associ-\nated feasible set (grey re-\ngion)\n\nwhere d denotes the dimension of the space, Sd\u22121 the unit sphere in this space, and \u03bbd\u22121 the Lebesgue\nmeasure in dimension d \u2212 1.\n\nTherefore, solving 5 would achieve our goal. Note that, in practice, the number of constraints and\nthus the number of gradients is usually smaller than the dimension of the gradient, which means that\nthe feasible space can be seen as the Cartesian product between its own intersection with span (M)\nand the orthogonal subspace of span (M). That being said, we can actually reduce our interest to\nthe size of the solid angle in the M -dimensional space span (M), as in 6.\n\nminimizeM \u03bbM \u22121\uf8eb\n\nM \u22121\n\n\uf8edSspan(M)\n\n\u2229 \\gi\u2208M\n\n{g|hg, gii \u2265 0}\uf8f6\n\uf8f8\n\n(6)\n\nwhere Sspan(M)\n\nM \u22121\n\ndenotes the unit sphere in span(M).\n\nNote that even if the sub-spaces span(M) are different from each other, they all have the same\ndimension as M , which is \ufb01xed, hence comparing their \u03bbM -measure makes sense. However, this\nobjective is hard to minimize since the formula of the solid angle is complex, as shown in [15] and\n[2]. Therefore, we propose, in the next section, a surrogate to this objective that is easier to deal with.\n\n3.3 An Empirical Surrogate to Feasible Region Minimization\n\nIntuitively, to decrease the feasible set, one must increase the angles between each pair of gradients.\nIndeed, this is directly visible in 2D with Figure 3. Based on this observation, we propose the\nsurrogate in Eq.9.\n\nminimizeM Xi,j\u2208M\n\nhgi, gji\nkgikkgjk\n\ns.t. M \u2282 [0 . . t \u2212 1]; |M| = M\n\n(7)\n\nWe empirically studied the relationship between the solid angle and the surrogate function in higher\ndimensional space using randomly sampled vectors as the gradient. Given a set of sampled vectors,\nthe surrogate value is computed analytically and the solid angle is estimated using Monte Carlo\napproximation of Eq.6. The results are presented in Figure 2, which shows a monotonic relation\nbetween the solid angle and our surrogate.\n\nIt is worth noting that minimization of Eq.9 is equivalent to maximization of the variance of the\ngradient direction as is shown in Eq.8.\n\nVarM(cid:20) g\n\nkgk(cid:21) =\n\n1\n\nM Xk\u2208M\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\ng\n\nkgk(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)\nM 2 Xi,j\u2208M\n\n1\n\n=1 \u2212\n\n1\n\nM Xk\u2208M\n\n2\n\n\u2212(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nhgi, gji\nkgikkgjk\n\n2\n\ng\n\nkgk(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(8)\n\n4\n\nCC~8.59.09.510.010.5log surrogate20.017.515.012.510.07.55.02.50.0log angleGenerated SamplesLinear Approximation\fThis brings up a new interpretation of the surrogate, which is maximizing the diversity of samples in\nthe replay buffer using the parameter gradient as the feature. Intuitively, keeping diverse samples in\nthe replay buffer could be an ef\ufb01cient way to use the memory budget. It is also possible to maximize\nthe variance directly on the samples or on the hidden representations, we argue that the parameter\ngradient could be a better option given its root in Eq. 1. This is also veri\ufb01ed with experiments.\n\n3.4 Online Sample Selection\n\n3.4.1 Online sample selection with Integer Quadratic Programming.\n\nWe assume an in\ufb01nite input stream of data where at each time a new sample(s) is received. From\nthis stream we keep a \ufb01xed buffer of size M to be used as a representative of the previous samples.\nTo reduce computation burden, we use a \u201crecent\u201d buffer in which we store the incoming examples\nand once is full we perform selection on the union of the replay buffer and the \u201crecent\u201d buffer and\nreplace the samples in the replay buffer with the selection. To perform the selection of M samples,\nwe solve Eq. 9 as an integer quadratic programming problem as shown in Appendix A.2. The exact\nprocedures are described in algorithm 1. While this is reasonable in the cases of small buffer size, we\nobserved a big overhead when utilizing larger buffers which is likely the case in practical scenario.\nThe overhead, comes from both the need to get the gradient of each sample in the buffer and the\nrecent buffer and from solving the quadratic problem that is polynomial w.r.t. the size of the buffer.\nSince this might limit the scalability of our approach, we suggest an alternative greedy method.\n\n3.4.2 An in-exact greedy alternative.\n\nWe propose an alternative greedy method based on heuristic, which could achieve the same goal of\nkeeping diverse examples in the replay buffer, but is much cheaper than performing integer quadratic\nprogramming. The key idea is to maintain a score for each sample in the replay buffer. The score\nis computed by the maximal cosine similarity of the current sample with a \ufb01xed number of other\nrandom samples in the buffer. When there are two samples similar to each other in the buffer, their\nscores are more likely to be larger than the others. In the beginning when the buffer is not full, we add\nincoming samples along with their score to the replay buffer. Once the buffer is full, we randomly\nselect samples from the replay buffer as the candidate to be replaced. We use the normalized score as\nthe probability of this selection. The score of the candidate is then compared to the score of the new\nsample to determine whether the replacement should happen or not.\n\nMore formally, denote the score as Ci for sample i in the buffer. Sample i is selected as a candidate to\n\nbe replaced with probability P (i) = Ci/Pj Cj . The replacement is a bernoulli event that happens\n\nwith probability Ci/(c + Ci) where Ci is the score of the candidate and c is the score of the new data.\nWe can apply the same procedure for each example when a batch of new data is received.\n\nAlgorithm 2 describes the main steps of our gradient based greedy sample selection procedure. It\ncan be seen that the major cost of this selection procedure corresponds only to the estimation of the\ngradients of the selected candidates which is a big computational advantage over the other selection\nstrategies.\n\n3.5 Constraint vs Regularization\n\nProjecting the gradient of the new sample(s) exactly into the feasible region is computationally very\nexpensive especially when using a large buffer. A usual work around for constrained optimization is\nto convert the constraints to a soft regularization loss. In our case, this is equivalent to performing\nrehearsal on the buffer. Note that [4] suggests to constrain only with one random gradient direction\nfrom the buffer as a cheap alternative that works equally well to constraining with the gradients\nof the previous tasks, it was later shown by the same authors [5] that rehearsal on the buffer has a\ncompetitive performance. In our method, we do rehearsal while in Appendix B.2 we evaluate both\nrehearsal and constrained optimization on a small subset of disjoint MNIST and show comparable\nresults.\n\n5\n\n\f3.6 Summary of the Proposed Approach\n\nTo recap, we start from the constrained optimization view of continual learning, which needs to\nbe relaxed by constraint selection. Instead of random selection, we perform constraint selection\nby minimizing the solid angle formed by the constraints. We propose a surrogate for the solid\nangle objective, and show their relation numerically. We further propose a greedy alternative that\nis computationally more ef\ufb01cient. Finally, we test the effectiveness of the proposed approach on\ncontinual learning benchmarks in the following section.\n\nAlgorithm 1 IQP Sample Selection\n\nAlgorithm 2 Greedy Sample Selection\n\n1: Input: Mr, Mb\n2: function SELECTSAMPLES(M, M )\n3:\n\n\u02c6M \u2190 argmin \u02c6M Pi,j\u2208 \u02c6M\ns.t. \u02c6M \u2282 M; | \u02c6M| = M\n4:\nreturn \u02c6M\n5:\n6: end function\n7: Initialize: Mr, Mb\n8: Receive: (x, y)\n\nhgi,gj i\n\nkgikkgj k\n\n\u22b2 one or few consecutive\n\nexamples\n\n9: Update(x, y, Mb)\n10: Mr \u2190 Mr \u222a {(x, y)}\n11: if len(Mr) > Mr then\n12: Mb \u2190 Mb \u222a Mr\n13: Mr \u2190 {}\n14:\n15:\n16:\n17: end if\n\nend if\n\nif len(Mb) > Mb then\n\nMb \u2190SelectSamples(Mb,Mb)\n\nkgkkGik ) + 1 \u22b2 make the score positive\n\n\u22b2 cosine similarity < 0\n\n1: Input: n, M\n2: Initialize: M, C\n3: Receive: (x, y)\n4: Update(x, y, M)\n5: X, Y \u2190 RandomSubset(M, n)\n6: g \u2190 \u2207\u2113\u03b8(x, y); G \u2190 \u2207\u03b8\u2113(X, Y )\n7: c = maxi( hg,Gii\n8: if len(M) >= M then\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: else\n17: M \u2190 M \u222a {(x, y)}; C \u222a {c}\n18: end if\n\ni \u223c P (i) = Ci/ Pj Cj\nr \u223c uniform(0, 1)\nif r < Ci/(Ci + c) then\n\nMi \u2190 (x, y); Ci \u2190 c\n\nif c < 1 then\n\nend if\n\nend if\n\n4 Experiments\n\nThis section serves to validate our approach and show its effectiveness at dealing with continual\nlearning problems where task boundaries are not available.\n\nBenchmarks\n\nWe consider 3 different benchmarks, detailed below.\n\nDisjoint MNIST: MNIST dataset divided into 5 tasks based on the labels with two labels in each\ntask. We use 1k examples per task for training and report results on all test examples.\n\nPermuted MNIST: We perform 10 unique permutations on the pixels of the MNIST images. The\npermutations result in 10 different tasks with same distributions of labels but different distributions\nof the input images. Following [10], each of the task in permuted MNIST contains only 1k training\nexamples. The test set for this dataset is the union of the MNIST test set with all different permutations.\n\nDisjoint CIFAR-10: Similar to disjoint MNIST, the dataset is split into 5 tasks according to the\nlabels, with two labels in each task. As this is harder than mnist, we use a total of 10k training\nexamples with 2k examples per task.\n\nIn all experiments, we use a \ufb01xed batch size of 10 samples and perform few iterations over a batch\n(1-5), note that this is different from multiple epochs over the whole data. In disjoint MNIST, we\nreport results using different buffer sizes in table 1. For permuted MNIST results are reported using\nbuffer size 300 while for disjoint CIFAR-10 we couldn\u2019t get sensible performance for the studied\nmethods with buffer size smaller than 1k. All results are averaged over 3 different random seeds.\n\nModels Following [10], for disjoint and permuted MNIST we use a two-layer neural network with\n100 neurons each while for CIFAR-10 we use ResNet18. Note that we employ a shared head in the\nincremental classi\ufb01cation experiments, which is much more challenging than the multi-head used\nin [10]. In all experiments, we use SGD optimizer with a learning rate of 0.05 for disjoint MNIST\nand permuted MNIST and 0.01 for disjoint Cifar-10.\n\n6\n\n\fTable 1: Average test accuracy of sample selection methods on disjoint MNIST with different buffer sizes.\n\nBuffer Size\n\nMethod\nRand\nGSS-IQP(ours)\nGSS-Clust\nFSS-Clust\nGSS-Greedy(ours)\n\n300\n\n400\n\n500\n\n37.5 \u00b1 1.3\n75.9 \u00b1 2.5\n75.7 \u00b1 2.2\n75.8 \u00b1 1.7\n82.6\u00b1 2.9\n\n45.9 \u00b14.8\n82.1 \u00b10.6\n81.4 \u00b14.4\n80.6 \u00b12.7\n84.6 \u00b1 0.9\n\n57.9 \u00b14.1\n84.1 \u00b12.4\n83.9\u00b11.6\n83.4 \u00b12.6\n84.8 \u00b1 1.8\n\nTable 2: Comparison of different selection strategies on permuted MNIST benchmark.\n\nMethod\n\nT1\n\nT2\n\nT3\n\nT4\n\nT5\n\nT6\n\nT7\n\nT8\n\nT9\n\nT10\n\nAvg\n\nRand\n\n67.01 \u00b1 2.7\n\n62.18\u00b14.6\n\n69.63\u00b13.2\n\n62.05\u00b12.4\n\n68.41 \u00b11.0\n\n72.81\u00b13.0\n\n77.67 \u00b1 2.3\n\n77.28\u00b11.8\n\n83.92\u00b10.6\n\n84.52 \u00b1 0.3\n\n72.54\u00b1 0.4\n\nGSS-IQP\n(ours)\nGSS-Clust\n\n74.1\u00b12.2\n\n69.73\u00b1 0.6\n\n70.77\u00b14.9\n\n70.5\u00b12.5\n\n73.34\u00b14.8\n\n78.6\u00b12.8\n\n81.8\u00b10.6\n\n81.8\u00b1 0.7\n\n86.4\u00b10.8\n\n85.45\u00b10.4\n\n77.3\u00b1 0.5\n\n75.3\u00b11.3\n\n75.22\u00b1 1.9\n\n76.66\u00b10.9\n\n75.09\u00b11.6\n\n78.76\u00b1 0.9\n\n81.14\u00b1 1.1\n\n81.32\u00b12.0\n\n83.87\u00b10.7\n\n84.52\u00b1 1.2\n\n85.52\u00b1 0.6\n\n79.74\u00b10.2\n\nFSS-Clust\n\n82.2\u00b1 0.9\n\n71.34\u00b12.3\n\n76.9\u00b1 1.3\n\n70.5\u00b14.1\n\n70.56\u00b1 1.4\n\n74.9\u00b1 1.5\n\n77.68\u00b13.3\n\n79.56\u00b1 2.6\n\n82.7\u00b1 1.5\n\n85.3\u00b1 0.6\n\n77.8 \u00b1 0.3\n\nGSS-Greedy\n(ours)\n\n83.35\u00b1 1.1\n\n70.84\u00b1 1.3\n\n72.48\u00b1 1.7\n\n70.5\u00b1 3.4\n\n72.8\u00b11.7\n\n73.75\u00b1 3.8\n\n79.86\u00b1 1.8\n\n80.45\u00b12.9\n\n82.56\u00b1 1.1\n\n84.8\u00b11.6\n\n77.3\u00b10.5\n\nTable 3: Comparison of different selection strategies on disjoint CIFAR10 benchmark.\n\nMethod\nRand\nGSS-Clust\nFSS-Clust\nGSS-Greedy(ours)\n\nT1\n\n0 \u00b1 0.0\n\n0.35 \u00b10.5\n0.2 \u00b1 0.2\n\n42.36 \u00b112.1\n\nT2\n\n0.49 \u00b10.4\n15.27\u00b18.3\n0.8\u00b10.5\n\n14.61\u00b1 2.7\n\nT3\n\n5.68 \u00b14.4\n7.96\u00b16.3\n5.4\u00b1 0.7\n13.60\u00b14.5\n\nT4\n\n52.18 \u00b1 0.8\n9.97\u00b12.1\n38.12\u00b1 5.2\n19.30 \u00b12.7\n\nT5\n\n84.96\u00b1 4.4\n77.83\u00b10.7\n87.90\u00b1 3.1\n77.83 \u00b14. 2\n\nAvg\n\n28.6\u00b11.2\n22.5\u00b1 0.4\n26.7\u00b1 1.5\n33.56\u00b1 1.7\n\n4.1 Comparison with Sample Selection Methods\n\nWe want to study the buffer population in the context of the online continual learning setting when no\ntask information are present and no assumption on the data generating distribution is made. Since\nmost existing works assume knowledge of task boundaries, we decide to deploy 3 baselines along\nwith our two proposed methods2 . Given a \ufb01xed buffer size M we compare the following:\n\nRandom (Rand): Whenever a new batch is received, it joins the buffer. When the buffer is full, we\nrandomly select samples to keep of size M from the new batch and samples already in buffer.\n\nOnline Clustering: A possible way to keep diverse samples in the buffer is online clustering with the\ngoal of selecting a set of M centroids. This can be done either in the feature space (FSS-Clust),\nwhere we use as a metric the distance between the samples features, here the last layer before\nclassi\ufb01cation, or in the gradient space (GSS-Clust), where as a metric we consider the Euclidean\ndistance between the normalized gradients. We adapted the doubling algorithm for incremental\nclustering described in [3].\n\nIQP Gradients (GSS-IQP): Our surrogate to select samples that minimize the feasible region de-\nscribed in Eq.9 and solved as an integer quadratic programming problem. Due to the cost of\ncomputation we report our GSS-IQP on permuted MNIST and disjoint MNIST only.\n\nGradient greedy selection (GSS-Greedy): Our greedy selection variant detailed in Algo. 2. Note\nthat differently from previous selection strategies, it doesn\u2019t require re-processing all the recent and\nbuffer samples to perform the selection which is a huge gain in the online streaming setting.\n\n4.2 Performance of Sample Selection Methods\n\nTables 1, 2, 3 report the test accuracy on each task at the end of the data stream of disjoint MNIST,\npermuted MNIST, disjoint CIFAR-10 sequentially. First of all, the accuracies reported in the tables\nmight appear lower than state of the art numbers. This is due to the strict online setting, we use\nshared head and more importantly we use no information of the task boundary. In contrast, all\nprevious works assume availability of task boundary either at training or both training and testing.\nThe performance of the random baseline Rand clearly indicates the dif\ufb01culty of this setting. It can\n\n2The code is available at https://github.com/rahafaljundi/Gradient-based-Sample-Selection\n\n7\n\n\fTable 4: Comparison with reservoir sampling on different imbalanced data sequences from disjoint MNIST.\n\nMethod\nReservoir\nGSS-IQP(ours)\nGSS-Greedy(ours)\n\nSeq1\n\n63.7\u00b10.8\n75.9 \u00b1 3.2\n71.2\u00b13.6\n\nSeq2\n\n69.4 \u00b1 0.7\n76.2\u00b1 4.1\n78.5\u00b1 2.7\n\nSeq3\n\n66.8\u00b14.8\n79.06\u00b10.7\n81.5\u00b12.3\n\nSeq4\n\n69.1\u00b12.4\n76.6\u00b12.0\n79.5\u00b10.6\n\nSeq5\n\n76.6\u00b11.6\n74.7\u00b1 1.8\n79.1\u00b10.7\n\nAvg\n\n69.12\u00b14.3\n76.49\u00b11.4\n77.96\u00b1 3.5\n\nbe seen that both of our selection methods stably outperform the different buffer sizes on different\nbenchmarks. Notably, the gradient based clustering GSS-Clust performs comparably and even\nfavorably on permuted MNIST to the feature clustering FSS-Clust suggesting the effectiveness of a\ngradient based metric in the continual learning setting. Surprisingly, GSS-Greedy performs on par\nand even better than the other selection strategies especially on disjoint CIFAR-10 indicating not only\na cheap but a strong sample selection strategy. It is worth noting that Rand achieves high accuracy on\nT4 and T5 of the Cifar-10 sequence. In fact, this is an artifact of the random selection strategy where\nat the end of the sequence, the buffer sampled by Rand is composed of very few samples from the\n\ufb01rst tasks, 24 samples from T1, T2 and T3, but more from the recent T4 (127) & T5 (849). As such,\nit forgets less the more recent task at the cost of older ones.\n\n4.3 Comparison with Reservoir Sampling\n\nReservoir sampling [17] is a simple replacement strategy to \ufb01ll the memory buffer when the task\nboundaries are unknown based on the underlying assumption that the overall data stream is i.i.d\ndistributed. It would work well when each of the tasks has a similar number of examples. However,\nit could lose the information on the under-represented tasks if some of the tasks have signi\ufb01cantly\nfewer examples than the others. In this paper we study and propose algorithms to sample from an\nimbalanced stream of data. Our strategy has no assumption on the data stream distribution, hence it\ncould be less affected by imbalanced data, which is often encountered in practice.\n\nWe test this scenario on disjoint MNIST. We modify the data stream to settings where one of the\ntasks has an order of magnitude more samples than the rest, generating 5 different sequences where\nthe the \ufb01rst sequence has 2000 samples of the \ufb01rst task and 200 from each other task, the second\nsequence has 2000 samples from the second task and 200 from others, and same strategy applies\nto the rest of the sequences. Table 4 reports the average accuracy at the end of each sequence\nover 3 runs with 300 samples as buffer size. It can be clearly seen that our selection strategies\noutperform reservoir sampling especially when \ufb01rst tasks are under-represented with many learning\nsteps afterwords leading to forgetting. Our improvement reaches 15%, while we don\u2019t show the\nindividual tasks accuracy here due to space limit, it worth noting that reservoir sampling suffers\nseverely on under-represented tasks resulting in very low accuracy.\n\nHaving shown the robustness of our selection strategies both GSS-IQP and GSS-Greedy, we move\nnow to compare with state of the art replay-based methods that allocate separate buffer per task and\nonly play samples from previous tasks during the learning of others.\n\n4.4 Comparison with State-of-the-art Task Aware Methods\n\nOur method ignores any tasks information which places us at a disadvantage because the methods\nthat we compare to utilize the task boundaries as an extra information. In spite of this disadvantage,\nwe show that our method performs similarly on these datasets.\n\nCompared Methods\n\nSingle: is a model trained online on the stream of data without any mechanism to prevent forgetting.\n\ni.i.d.online: is the Single baseline trained on an i.i.d. stream of the data.\n\ni.i.d.offline: is a model trained of\ufb02ine for multiple epochs with i.i.d. sampled batches. As such\ni.i.d.online trains on the i.i.d. stream for just one pass, while i.i.d.offline takes multiple\nepochs.\n\nGEM [10]: stores a \ufb01xed amount of random examples per task and uses them to provide constraints\nwhen learning new examples.\n\n8\n\n\fFigure 4: Comparison with state-of-the-art task aware replay methods GEM. Figures show test accuracy.\n\n(a) Disjoint MNIST\n\n(b) Permuted MNIST\n\n(c) Disjoint CIFAR-10\n\niCaRL [14]: follows an incremental classi\ufb01cation setting. It also stores a \ufb01xed number of examples\nper class but uses them to rehearse the network when learning new information.\n\nFor ours, we report both GSS-IQP and GSS-Greedy on permuted and disjoint MNIST and only\nGSS-Greedy on disjoint CIFAR-10 due to the computational burden.\n\nSince we perform multiple iterations over a given batch still in the online setting, we treat the number\nof iterations as a hyper parameter for GEM and iCaRL. We found that GEM performance constantly\ndeteriorates with multiple iterations while iCaRL improves.\n\nFigure 4a shows the test accuracy on disjoint MNIST which is evaluated during the training procedure\nat an interval of 100 training examples with 300 buffer size. For the i.i.d. baselines, we only show the\nachieved performance at the end of the training. For iCaRL, we only show the accuracy at the end of\neach task because iCaRL uses the selected exemplars for prediction that only happens at the end of\neach task. We observe that both variants of our method have a very similar learning curve to GEM\nexcept the few last iterations where GSS-IQP performance slightly drops.\nFigure 4b compares our methods with the baselines and GEM on the permuted MNIST dataset. Note\nthat iCaRL is not included, as it is designed only for incremental classi\ufb01cation. From the performance\nof the Single baseline it is apparent that permuted MNIST has less interference between the different\ntasks. Ours perform better than GEM and get close to i.i.d.online performance.\nFigure 4c shows the accuracy on disjoint CIFAR-10 evaluated during the training procedure at an\ninterval of 100 training examples. GSS-Greedy shows better performance than GEM and iCaRL, and\nit even achieves a better average test performance at the end of the sequence than i.i.d.online.\nWe found that GEM suffers more forgetting on previous tasks while iCaRL shows lower performance\non the last task. Note that our setting is much harder than of\ufb02ine tasks training used in iCaRL or the\nmulti-heard setting used in .\n\n5 Conclusion\n\nIn this paper, we prove that in the online continual learning setting we can smartly select a \ufb01nite\nnumber of data to be representative of all previously seen data without knowing task boundaries.\nWe aim for samples diversity in the gradient space and introduce a greedy selection approach that\nis ef\ufb01cient and constantly outperforming other selection strategies. We still perform as well as\nalgorithms that use the knowledge of task boundaries to select the representative examples. Moreover,\nour selection strategy gives us advantage under the settings where the task boundaries are blurry or\ndata are imbalanced.\n\nAcknowledgements\n\nRahaf Aljundi is funded by FWO.\n\nReferences\n\n[1] Rahaf Aljundi, Marcus Rohrbach, and Tinne Tuytelaars. Sel\ufb02ess sequential learning. In ICLR\n\n2019.\n\n[2] Matthias Beck, Sinai Robins, and Steven V Sam. Positivity theorems for solid-angle polynomials.\n\narXiv preprint arXiv:0906.4031, 2009.\n\n9\n\n010002000300040005000Number of learned examples0.00.20.40.60.81.0AccuracyT1T2T3T4T5010002000300040005000600070008000900010000Number of learned examples0.00.20.40.60.81.0AccuracyT1T2T3T4T5T6T7T8T9T100200040006000800010000Number of learned examples0.00.10.20.30.40.50.6AccuracyT1T2T3T4T5GSS-GreedyGEMSinglei.i.d.onlinei.i.d.offlineiCaRLGSS-IQP\f[3] Moses Charikar, Chandra Chekuri, Tom\u00e1s Feder, and Rajeev Motwani. Incremental clustering\n\nand dynamic information retrieval. SIAM Journal on Computing, 33(6):1417\u20131440, 2004.\n\n[4] Arslan Chaudhry, Marc\u2019Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Ef\ufb01cient\n\nlifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.\n\n[5] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K\nDokania, Philip HS Torr, and Marc\u2019Aurelio Ranzato. Continual learning with tiny episodic\nmemories. arXiv preprint arXiv:1902.10486, 2019.\n\n[6] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv\n\npreprint arXiv:1805.09733, 2018.\n\n[7] Robert M French. Dynamically constraining connectionist networks to produce distributed,\n\northogonal representations to reduce catastrophic interference. network, 1111:00001, 1994.\n\n[8] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Thirty-\n\nSecond AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[9] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report,\n\nCarnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.\n\n[10] David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural\n\nInformation Processing Systems, pages 6467\u20136476, 2017.\n\n[11] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by\niterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 7765\u20137773, 2018.\n\n[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[13] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual\n\nlearning. arXiv preprint arXiv:1710.10628, 2017.\n\n[14] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:\n\nIncremental classi\ufb01er and representation learning. In Proc. CVPR, 2017.\n\n[15] Jason M Ribando. Measuring solid angles beyond dimension three. Discrete & Computational\n\nGeometry, 36(3):479\u2013487, 2006.\n\n[16] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science,\n\n7(2):123\u2013146, 1995.\n\n[17] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical\n\nSoftware (TOMS), 11(1):37\u201357, 1985.\n\n10\n\n\f", "award": [], "sourceid": 6333, "authors": [{"given_name": "Rahaf", "family_name": "Aljundi", "institution": "KU Leuven, Belgium"}, {"given_name": "Min", "family_name": "Lin", "institution": "MILA"}, {"given_name": "Baptiste", "family_name": "Goujaud", "institution": "MILA"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}]}