{"title": "Uncertainty-based Continual Learning with Adaptive Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 4392, "page_last": 4402, "abstract": "We introduce a new neural network-based continual learning algorithm, dubbed as Uncertainty-regularized Continual Learning (UCL), which builds on traditional Bayesian online learning framework with variational inference. We focus on two significant drawbacks of the recently proposed regularization-based methods: a) considerable additional memory cost for determining the per-weight regularization strengths and b) the absence of gracefully forgetting scheme, which can prevent performance degradation in learning new tasks. In this paper, we show UCL can solve these two problems by introducing a fresh interpretation on the Kullback-Leibler (KL) divergence term of the variational lower bound for Gaussian mean-field approximation. Based on the interpretation, we propose the notion of node-wise uncertainty, which drastically reduces the number of additional parameters for implementing per-weight regularization. Moreover, we devise two additional regularization terms that enforce \\emph{stability} by freezing important parameters for past tasks and allow \\emph{plasticity} by controlling the actively learning parameters for a new task. Through extensive experiments, we show UCL convincingly outperforms most of recent state-of-the-art baselines not only on popular supervised learning benchmarks, but also on challenging lifelong reinforcement learning tasks. The source code of our algorithm is available at https://github.com/csm9493/UCL.", "full_text": "Uncertainty-based Continual Learning with\n\nAdaptive Regularization\n\nHongjoon Ahn1\u2217, Sungmin Cha2\u2217, Donggyu Lee2 and Taesup Moon1,2\n\n1 Department of Arti\ufb01cial Intelligence, 2Department of Electrical and Computer Engineering,\n\nSungkyunkwan University, Suwon, Korea 16419\n\n{hong0805, csm9493, ldk308, tsmoon}@skku.edu\n\nAbstract\n\nWe introduce a new neural network-based continual learning algorithm, dubbed\nas Uncertainty-regularized Continual Learning (UCL), which builds on traditional\nBayesian online learning framework with variational inference. We focus on two\nsigni\ufb01cant drawbacks of the recently proposed regularization-based methods: a)\nconsiderable additional memory cost for determining the per-weight regularization\nstrengths and b) the absence of gracefully forgetting scheme, which can prevent\nperformance degradation in learning new tasks. In this paper, we show UCL can\nsolve these two problems by introducing a fresh interpretation on the Kullback-\nLeibler (KL) divergence term of the variational lower bound for Gaussian mean-\n\ufb01eld approximation. Based on the interpretation, we propose the notion of node-\nwise uncertainty, which drastically reduces the number of additional parameters\nfor implementing per-weight regularization. Moreover, we devise two additional\nregularization terms that enforce stability by freezing important parameters for past\ntasks and allow plasticity by controlling the actively learning parameters for a new\ntask. Through extensive experiments, we show UCL convincingly outperforms\nmost of recent state-of-the-art baselines not only on popular supervised learning\nbenchmarks, but also on challenging lifelong reinforcement learning tasks. The\nsource code of our algorithm is available at https://github.com/csm9493/UCL.\n\n1\n\nIntroduction\n\nContinual learning, also called as lifelong learning, is a long-standing open problem in machine\nlearning in which data from multiple tasks continuously arrive and the learning algorithm should\nconstantly adapt to new tasks as well as not forget what it has learned in the past. The main challenge\nis to resolve the so-called stability-plasticity dilemma [2, 18]. Namely, a learning agent should be\nable to preserve what it has learned, but focusing too much on the stability may hinder it from quickly\nlearning a new task. On the other hand, when the agent focuses too much on the plasticity, it tends\nto quickly forget what it has learned. Particularly, for the arti\ufb01cial neural network (ANN)-based\nmodels, which became the mainstream of the machine learning methods, it is well-known that they\nare prone to such catastrophic forgetting phenomenon [17, 4]. As opposed to the ANNs, humans are\nable to maintain the obtained knowledge while learning a new task, and the forgetting in human brain\nhappens gradually rather than drastically. This difference motivates active research in developing\nneural network based continual learning algorithms.\nAs given in a comprehensive survey [20] on this topic, approaches for tackling the catastrophic\nforgetting in neural network based continual learning can be roughly grouped into three categories:\nregularization-based [14, 12, 30, 19], dynamic network architecture-based [23, 29], and dual memory\nsystem-based [22, 15, 27, 10]. While each category has its own merit, of particular interest are the\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fregularization-based methods, since they pursue to maximally utilize the limited network capacity by\nimposing constraints on the update of the network given a new task. Computationally, they typically\nare realized by adding regularization terms that penalize the changes in the network parameters when\nlearning a new task. This approach makes sense since it is well-known that neural network models\nare highly over-parametrized, and once successful, it can be also complementary to other approaches\nsince it can lead to the ef\ufb01cient usage of network capacity as the number of tasks grows, as in [25].\nThe recent state-of-the-art regularization-based methods typically implement the per-parameter\nregularization parameters based on several different principles inferring the importance of each\nparameter for given tasks; e.g., diagonal Fisher information matrix for EWC [12], variance term\nassociated with each weight parameter for VCL [19], and the path integrals of the gradient vector\n\ufb01elds for SI [30]. While these methods are shown to be very effective in several continual learning\nbenchmarks, a common caveat is that the amount of the memory required to store the model is\ntwice the original neural network parameters, since they need to store the individual regularization\nparameters. We note that this could be a limiting factor for being deployed with large network size.\nIn this paper, we propose a new regularization-based continual learning algorithm, dubbed as\nUncertainty-regularized Continual Learning (UCL), that stores much smaller number of additional\nparameters for regularization terms than the recent state-of-the-arts, but achieves much better perfor-\nmance in several benchmark datasets. Followings summarize our key contributions.\n\n\u2022 We adopt the standard Bayesian online learning framework, but make a fresh interpretation\nof the Kullback-Leibler (KL) divergence term of the variational lower bound for the Gaussian\nmean-\ufb01eld approximation case.\n\u2022 We de\ufb01ne a novel notion of \u201cuncertainty\u201d for each hidden node in a network by tying the\nlearnable variances of the incoming weights of a node. Moreover, we add two additional\nregularization terms to freeze the weights that are identi\ufb01ed to be important and to gracefully\nforget what was learned before and control the actively learning weights.\n\u2022 We achieve state-of-the-art performances on a number of continual learning benchmarks,\nincluding supervised learning (SL) tasks with deep convolutional neural networks and\nreinforcement learning (RL) tasks with different state-action spaces. Performing well on\nboth SL and RL continual learning tasks is a unique strength of our UCL.\n\n2 Related Work\n\nContinual learning There are numerous approaches in continual learning and we refer the readers\nto [20] for an extensive review. We only list work relevant to our method. The main approach of\nregularization-based methods in continual learning is to identify the important weights for the learned\ntasks and penalize the large updates on those weights when learning a new task. LwF [14] contains\ntask-speci\ufb01c layers, and keeps the similar outputs for the old tasks by utilizing knowledge distillation\n[9]. In EWC [12], the diagonal of the Fisher information matrix at the learned parameter of the\ngiven task is used for giving the relative regularization strength. An extended version of EWC, IMM\n[13], merged the posteriors based on the mean and the mode of the old and new parameters. SI [30]\ncomputes the parameter importance considering a path integral of gradient vector \ufb01elds during the\nparameter updates. VCL [19] also adopts Bayesian online learning framework as ours, but simply\napplies standard techniques that results in some drawbacks, which are elaborated in Section 3.1.\nSome work approached continual learning differently than the regularization-based method for the\nlimited network capacity case. PackNet [16] picks out task-speci\ufb01c weights based on the weight\npruning method, which requires saving the learnable binary masks for the weights. HAT [26]\nemploys node-wise attention mechanism per layer using the task identi\ufb01er embedding, but requires a\nknowledge on the number of tasks a priori, which is a critical limitation.\nVariational inference In standard Bayesian learning, the main idea of learning is ef\ufb01ciently ap-\nproximating the posterior distribution on the models. [6] introduces a practical variational inference\ntechnique for neural networks, which suggested that variational parameters can be learned using\nback-propagation. Another approach in variational inference is [11] which introduces the approx-\nimated lower bound of likelihood, and learn variational parameter using re-parameterization trick.\nIn [1], they introduce Unbiased Monte Carlo, which also uses back-propagation, but many kinds of\npriors can be used in the Unbiased Monte Carlo. In addition, there are several practical methods for\nvariational inference in neural networks, such as using dropout [5] or Expectation-Propagation [8].\n\n2\n\n\f3 Uncertainty-regularized Continual Learning (UCL)\n\n3.1 Notations and a review on Bayesian online learning\nConsider a discriminative neural network model, p(y|x,W), that returns a probability distribution\nover the output y given an input x and parameters W. In standard Bayesian learning, W is assumed\nto be sampled from some prior distribution p(W|\u03b1) that depends on some parameter \u03b1, and after\ni=1, obtaining the posterior p(W|\u03b1,D) becomes the central\nobserving some data D = {(xi, yi)}n\nproblem to learn the model parameters. Since exactly obtaining the posterior becomes intractable,\nvariational inference [1, 3, 6] instead tries to approximate this posterior with a more tractable\ndistribution q(W|\u03b8). The approximation is done by minimizing (over \u03b8) the so-called variational\nfree energy, which can be written as\n\nF(D, \u03b8) =Eq(W|\u03b8)[\u2212 log p(D|W)] + DKL(q(W|\u03b8)||p(W|\u03b1)),\n\nGaussian mean-\ufb01eld approximation, q(W|\u03b8) =(cid:81)\n\n(1)\nin which log p(D|W) is the log-likelihood of the data D determined by the model p(y|x,W), and\nDKL(\u00b7) is the Kullback-Leibler divergence. Moreover, the commonly used q(W|\u03b8) is the so-called\ni N (wi|\u00b5i, \u03c3i) with \u03b8 = (\u00b5, \u03c3), and \u03b8 can be\nlearned via reparameterization trick [11] and the standard back-propagation.\nIn Bayesian online learning framework, standard variational inference can be applied to the continual\nlearning setting. Namely, when a dataset for task t, Dt arrives, the framework solves to minimize\n\nF(Dt, \u03b8t) =Eq(W|\u03b8t)[\u2212 log p(Dt|W)] + DKL(q(W|\u03b8t)||q(W|\u03b8t\u22121))\n\n(2)\nover \u03b8t = (\u00b5t, \u03c3t), in which q(W|\u03b8t\u22121) stands for the posterior learned after observing Dt\u22121 acting\nas a prior for learning q(W|\u03b8t). Note in (2), we can observe that the KL-divergence term naturally\nacts as a regularization term. In VCL [19], they showed that the network learned by sequentially\nsolving (2) by using projection operator of the variational inference for each task t can successfully\ncombat the catastrophic forgetting problem to some extent.\nHowever, we argue that this Bayesian approach of VCL has several drawbacks as well. First, due to\nthe Monte-Carlo sampling of the model weights for computing the likelihood term in (2), the time\nand space complexity for learning grows with the sample size. Second, since the variance term is\nde\ufb01ned for every weight parameter, the number of parameters to maintain becomes exactly twice the\nsize of network weights. This becomes problematic when deploying a large-sized network, as is the\ncase in modern deep learning. In this paper, we present a novel approach which can resolve above\nproblems. Our key idea is rooted in a fresh interpretation of the closed form of KL-divergence term\nin (2) for the Gaussian mean-\ufb01eld approximation and the Bayesian neural network pruning [6, 1].\n\nInterpreting KL-divergence and motivation of UCL\n\n3.2\nWhile the KL divergence in (2) acts as a generic regularization term, we give a closer look at\nit, particularly for the Gaussian mean-\ufb01eld approximation case. Namely, after some algebra and\nevaluating the Gaussian integral, the closed-form of DKL(q(W|\u03b8t)(cid:107)q(W|\u03b8t\u22121)) becomes:\n\n,\n\n(3)\n\nL(cid:88)\n\nl=1\n\n(cid:104)(cid:13)(cid:13)(cid:13) \u00b5(l)\n(cid:124)\n\n1\n2\n\nt \u2212 \u00b5(l)\nt\u22121\n(cid:123)(cid:122)\n\u03c3(l)\nt\u22121\n\n(a)\n\n(cid:13)(cid:13)(cid:13)2\n(cid:125)\n\n2\n\n+ 1(cid:62)(cid:110)(cid:16) \u03c3(l)\n(cid:124)\n\nt\n\u03c3(l)\nt\u22121\n\n(cid:17)2 \u2212 log\n(cid:123)(cid:122)\n\n(cid:16) \u03c3(l)\n\nt\n\u03c3(l)\nt\u22121\n\n(cid:105)\n\n(cid:17)2(cid:111)\n(cid:125)\n\n(b)\n\n, \u03c3(l)\n\nin which L is the number of layers in the network, (\u00b5(l)\nt ) are the mean and standard deviation of\nt\nthe weight matrix for layer l that are subject to learning for task t, (\u00b5(l)\nt\u22121) are the same quantity\nthat are learned up to the previous task, the fraction notation means the element-wise division between\ntensors, and (cid:107) \u00b7 (cid:107)2\n2 stands for the Frobenius norm of a matrix. The detailed derivation of (3) is given\nin the Supplementary Materials. the term (a) in (3) can be interpreted as a square of the Mahalanobis\ndistance between the vectorized \u00b5(l)\nt\u22121)2), and\nit acts as a regularization term for \u00b5(l)\n, \u03c3(l)\nt ), the\ninverse of the variance learned up to task (t \u2212 1) is acting as per-weight regularization strengths for\nt\n\u00b5(l)\nt deviating from \u00b5(l)\nt\u22121)2 can be regarded as an\nuncertainty measure for the corresponding mean weight of \u00b5(l)\nt\u22121, and a weight with small uncertainty\n\nt\u22121, in which the covariance matrix is diag((\u03c3(l)\nt = (\u00b5(l)\n\nt\u22121. This makes sense since each element of (\u03c3(l)\n\n. Namely, when minimizing (3) over \u03b8(l)\n\nt and \u00b5(l)\n\nt\u22121, \u03c3(l)\n\nt\n\n3\n\n\ft\u22121, is acting as a regularization term for \u03c32\n\nshould be treated as important such that high penalty is imposed when signi\ufb01cantly getting updated\nfor a new task t. Moreover, the term (b) in (3), which is convex in (\u03c3(l)\nt )2 and is minimized when\n\u03c3(l)\nt = \u03c3(l)\nt . Note it promotes to preserve the learned\nuncertainty measure when updating for a new task. This also makes sense for preventing catastrophic\nforgetting since the weights identi\ufb01ed as important in previous tasks should be kept as important for\nfuture tasks as well such that the weights do not get updated too much by the term (a). Based on this\ninterpretation, we modify each term and devise a new loss function for UCL.\n3.3 Modifying the term (a)\n\nWe modify the term (a) in (3) based on the following three intuitions. First, instead of maintaining the\nuncertainty measure for each mean weight parameter of \u00b5t, we devise a notion of uncertainty for each\nnode of the network. Second, based on the node uncertainty, we set the high regularization strength\nfor a weight when either of the nodes it connects has low uncertainty. Third, we add additional\n(cid:96)1-regularizer such that a weight gets even more stringent penalty for getting updated when the weight\nhas large magnitude or low uncertainty, inspired by [1, 6]. We elaborate each of these intuitions\nbelow.\nWhile it is plausible to maintain the weight-level importance as in other work [12, 19, 30], we believe\nmaintaining the importance (or uncertainty in our case) at the level of node makes more sense, not\nonly for the purpose of reducing the model parameters, but also because the node value (or the\nactivation) is the basic unit for representing the learned information from a task. A similar intuition\nof working at node-level also appears in HAT [26], which devised a hard attention mechanism for\nimportant nodes, or dropout [28], which randomly drops nodes while training. In our setting, we\nde\ufb01ne the uncertainty of a node as illustrated in Figure 1; \ufb01rst constrain the incoming weights to\nthe node to have the same standard deviation parameters as in node j of layer (l \u2212 1) in Figure 1,\nthen set the variance as the uncertainty of the node. For the Gaussian mean-\ufb01eld approximation case,\nthis constraint corresponds to adding zero-mean i.i.d Gaussian noise (with difference variances for\ndifferent nodes) to the incoming weights when sampling for the variational learning.\nFor our second intuition, we derive the weight-level reg-\nularization scheme based on the following arguments.\nNamely, as shown in Figure 1, suppose a node is identi\ufb01ed\nas important (the orange nodes), i.e., has low uncertainty,\nfor the past tasks, and the learning of a new task is taking\nplace. We believe there are two major sources that can\ncause the catastrophic forgetting of the past tasks when\na weight update for a new task happens; 1) the negative\ntransfer (blue region) happening in the incoming weights\nof an important node, and 2) the information loss (pink\nregion) happening in the outgoing weights of an important\nnode. From the perspective of the important node, it is\nclear that when any of the incoming weights are signif-\nicantly updated during the learning of the new task, the\nnode\u2019s representation of the past tasks will signi\ufb01cantly\nget altered as the node will differently combine informa-\ntion from the lower layer, and hurt the past tasks accuracy.\nOn the other hand, when the outgoing weights of the important node are signi\ufb01cantly updated, the\ninformation of that node will get washed out during forward propagation, hence, it may not play an\nimportant role in computing the prediction, causing the accuracy drop for the past task.\nFrom above argument, we devise the weight-level regularization such that weight gets high regular-\nization strength when either of the node it connects has low uncertainty. This is realized by replacing\nthe term (a) of (3) with the following:\n\nFigure 1: Information loss and negative\ntransfer of an important node.\n\n(cid:16) L(cid:88)\n\nl=1\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)\u039b(l) (cid:12) (\u00b5(l)\n\n(cid:17)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nt \u2212 \u00b5(l)\nt\u22121)\n\n, where \u039b(l)\nij\n\n(cid:44) max\n\n(cid:110) \u03c3(l)\n\n,\n\ninit\n\u03c3(l)\nt\u22121,i\n\n\u03c3(l\u22121)\ninit\n\u03c3(l\u22121)\nt\u22121,j\n\n(cid:111)\n\n,\n\n(4)\n\nin which \u03c3(l)\ninit is the initial standard deviation hyperparameter for all weights on the l-th layer, L is\nis the mean weight matrix for layer l and task t, (cid:12) is the\nthe number of layers in the network, \u00b5(l)\nt\nelement-wise multiplication between matrices, and the matrix \u039b(l) de\ufb01nes the regularization strength\n\n4\n\n\ud835\udc59\u22122\ud835\udc59\u22121\ud835\udc59\ud835\udf0e&(()*)\ud835\udf0e&(()*)\ud835\udf0e&(()*)\t\ud835\udf0e-(()\ud835\udf0e.(()\ud835\udf0e/(()\ud835\udc56\ud835\udc5a\ud835\udc5b\ud835\udc57\ud835\udf07-&((): important node: unimportant node: information loss: negative transfer\ft\u22121,i or \u03c3(l\u22121)\n\nt\u22121,j is small, \u00b5(l)\n\nt,ij; i.e., when either \u03c3(l)\ninit correctly is important to control the stability of the learning process.\n\nfor the weight \u00b5(l)\nWe note setting \u03c3(l)\nWhile (4) is a sensible replacement of the term (a) in (3), our third intuition above is based on the\nobservation that (4) does not take into account of the magnitude of the learned weights, i.e., \u00b5(l)\nt\u22121. In\n[1, 6], they applied a heuristic for pruning network weights learned by variational inference; i.e., only\nkeeps the weight if the magnitude of the ratio \u00b5/\u03c3 is large, and prunes otherwise. Inspired by the\npruning heuristic, we devise an additional (cid:96)1-norm based regularizer\n\nt,ij gets high regularization strength.\n\nL(cid:88)\n\nl=1\n\ninit)2(cid:13)(cid:13)(cid:13)(cid:16) \u00b5(l)\n\nt\u22121\n\u03c3(l)\nt\u22121\n\n(\u03c3(l)\n\n(cid:17)2 (cid:12) (\u00b5(l)\n\n(cid:13)(cid:13)(cid:13)1\n\nt \u2212 \u00b5(l)\nt\u22121)\n\n,\n\n(5)\n\nin which the division and square inside the (cid:96)1-norm should be understood as the element-wise\noperations. Note \u03c3(l)\nt\u22121, and the i-th row of \u03c3(l)\nt\u22121 has the same\nvariance value associated with the i-th node in layer l. Thus, in (5), if the ratio (\u00b5(l)\nt\u22121,ij/\u03c3(l)\nt\u22121,i)2 is\nlarge, the (cid:96)1-norm will promote sparsity and \u00b5(l)\n\nt\u22121 has the same dimension as \u00b5(l)\n\nt,ij will tend to freeze to \u00b5(l)\n\nt\u22121,ij.\n\n3.4 Modifying the term (b)\n\nRegarding the term (b) in (3), we can also devise a similar loss on the uncertainties associated with\nnodes. As mentioned in Section 3.2, the loss will promote \u03c3(l)\nt\u22121, meaning that once a node\nbecomes important at task (t \u2212 1), it tends to stay important for a new task as well. While this makes\nsense for preventing the catastrophic forgetting as it may induce high regularization parameters for\npenalties in (4) and (5), one caveat is that the network capacity can quickly \ufb01ll up when the number\nof tasks grows. Therefore, we choose to add one more regularization term to the term (b) in (3),\n\nt = \u03c3(l)\n\n(\u03c3(l)\n\nt )2 \u2212 log(\u03c3(l)\n\n,\n\n(6)\n\nt )2(cid:17)\n\nto get close to\n\nwhich in\ufb02ates \u03c3(l)\nt\u22121 when minimized together with the term (b). The detailed\nt\nderivation of the minimizer is given in the Supplementary Materials. Therefore, if a node becomes\nuncertain when training current task, the regularization strength becomes smaller. Since our initial\nstandard deviation \u03c3(l)\ninit is usually set to be small, the additional term in (6) compared to the term (b)\nin (3) will tend to increase the number of \u201cactively\u201d learning nodes that have incoming weights with\nsuf\ufb01ciently large standard deviation values for exploration. Moreover, when a new task arrives while\nmost of the nodes have low uncertainty, (6) will force some of them to increase the uncertainty level\nto learn the new task, resulting in gracefully forgetting the past tasks.\n\n1(cid:62)(cid:16)\n\n1\n2\n\u221a\n2\u03c3(l)\n\nFigure 2: Colored hidden nodes and edges denote important nodes and highly regularized weights\ndue to (4), respectively. The width of colored edge denotes the regularization strength of (5). Note as\nnew task comes the uncertainty level of a node can vary due to (6), respresented with color changes.\n\n3.5 Final loss function for UCL\n\nCombining (4), (5), and (6), the \ufb01nal loss function for our UCL for task t becomes\n\u2212 log p(Dt|W) +\n\nt \u2212 \u00b5(l)\nt\u22121)\n\n+ (\u03c3(l)\n\nL(cid:88)\n\n(cid:104)(cid:16) 1\n\nl=1\n\n2\n\n+\n\n(cid:13)(cid:13)(cid:13)\u039b(l) (cid:12) (\u00b5(l)\n1(cid:62)(cid:110)(cid:16) \u03c3(l)\n\nt\n\u03c3(l)\nt\u22121\n\n\u03b2\n2\n\n(cid:13)(cid:13)(cid:13)2\n(cid:16) \u03c3(l)\n\n2\n\nt\n\u03c3(l)\nt\u22121\n\ninit)2(cid:13)(cid:13)(cid:13)(cid:16) \u00b5(l)\n(cid:17)2\n\n(cid:17)2 (cid:12) (\u00b5(l)\nt )2(cid:111)(cid:105)\n\nt\u22121\n\u03c3(l)\nt\u22121\nt )2 \u2212 log(\u03c3(l)\n\n+ (\u03c3(l)\n\n,\n\nt \u2212 \u00b5(l)\nt\u22121)\n\n(cid:13)(cid:13)(cid:13)1\n\n(cid:17)\n\n(7)\n\n(cid:17)2 \u2212 log\n\n5\n\nInputOutputInitTask1Task2Output\f, \u03c3(l)\n\nt }L\n\nt\n\nl=1 and has two hyperparameters, {\u03c3(l)\n\nwhich is minimized over {\u00b5(l)\nl=1 and \u03b2. The former\nserves as pivot values determining the degree of uncertainty of each node, and the latter controls the\nincreasing or decreasing speed of \u03c3(l)\n. As elaborated in above sections, it is clear that the uncertainty\nt\nof a node plays a critical role in setting the regularization parameters, hence, justi\ufb01es the name UCL.\nIllustration of the regularization mechanism of UCL is given in Figure 2. At the beginning epoch\nof task t, we sample from q(W|\u03b8t) with \u03b8t = \u03b8t\u22121, then continue to update \u03b8t in the subsequent\niterations. The model parameters are sampled every iteration, like in the usual Monte Carlo sampling,\nbut we set the number of sampling to 1 for each iteration. This is an important differentiation that\nenables the application of UCL to reinforcement learning tasks, which was impossible for VCL [19].\n\ninit}L\n\n4 Experimental Results\n\n4.1 Supervised learning\n\nWe evaluate the performance of UCL together with EWC [12], SI [30], VCL [19], and HAT [26]. We\nalso make a comparison with Coreset VCL proposed in [19]. The number of sampling weights was\n10 for VCL, and 1 for UCL. All of the results are averaged over 5 different seeds. For the experiments\nwith MNIST datasets, we used fully-connected neural networks (FNN), and with CIFAR-10/100 and\nOmniglot datasets, we used convolutional neural networks (CNN). The detailed architectures are\ngiven in each experiment section. Moreover, the initial standard deviations for UCL, {\u03c3(l)\nl=1, were\nset to be 0.06 for FNNs and adaptively set like the He initialization [7] for deeper CNNs, of which\ndetails are given in the Supplementary Material. The hyperparameter selections among the baselines\nare done fairly, and we list the selected hyperparameters in the Supplementary Materials.\n\ninit}L\n\nFigure 3: Experimental results on Permuted / Row Permuted MNIST with a single headed network.\n\nPermuted / Row Permuted MNIST We \ufb01rst test on the popular Permuted MNIST dataset. We\nused single-headed FNN that has two hidden layers with 400 nodes and ReLU activations for all\nmethods. We compare the average test accuracy over the learned tasks in Figure 3 (left). After training\non 10 tasks sequentially, EWC, SI, and VCL show little difference of performance among them\nachieving 91.8%, 91.1%, and 91.3% respectively. Although VCL with the coreset size of 200 makes\nan improvement of 2%, UCL outperforms all other baselines achieving 94.5%. Interestingly, HAT\nkeeps almost the same average accuracy as UCL until the \ufb01rst 5 tasks, but it starts to signi\ufb01cantly\ndeteriorate after task 7. This points out the limitation of applying HAT in a single-headed network.\nAs a variation of Permuted MNIST, we shuf\ufb02ed only rows of MNIST images instead of shuf\ufb02ing all\nthe image pixels, and we denoted it as Row Permuted MNIST. We empirically \ufb01nd that all algorithms\nare more prone to forgetting in Row Permuted MNIST. Looking at the accuracy scale of Figure 3\n(right), all the methods show severe degradation of performance compared to Permuted MNIST. This\nmay be due to permuting of the correlated row blocks causing more weight changes in the network.\nAfter 10 tasks, UCL again achieved the highest average accuracy, 86.5%, in this experiment as well.\nFor a better understanding of our model, Figure 4 visualize the learned standard deviations of nodes\nin all layers as the training proceeds. After the model trained on task 1, we \ufb01nd that just a few of\nthem become smaller than the initialized value of 0.06, and most of them become much larger in\nthe \ufb01rst hidden layer. Interestingly, the uncertain nodes in layer 1 show a drastic decline of their\nstandard deviations at a speci\ufb01c task as the learning progresses, which means the model had to make\nthem certain for adapting to the new task. On the other hand, all the nodes in the output layer had to\nreduce their uncertainty as early as possible considering even a small randomness can lead to a totally\ndifferent prediction. Most of the nodes in layer 2, in addition, do not show a monotonic tendency.\n\n6\n\n12345678910Task0.900.920.940.960.981.00AccuracyPermuted MNIST12345678910Task0.60.70.80.91.0AccuracyRow Permuted MNISTUCLEWCSIHATVCLVCL + Coreset\fFigure 4: Standard deviation histogram in the Permuted MNIST experiment. We randomly selected\n100 standard deviations for layer 1 and 2. In layer 3, all 10 nodes are shown.\nThis can be interpreted as many of them need not belong to a particular task. As a result, this gives\nthe plasticity and gracefully forgetting trait of our UCL.\n\nFigure 5: Ablation study in Permuted MNIST. Each line denotes the test accuracy.\n\nWe also carry out an ablation study on UCL\u2019s additional regularization terms. Figure 5 shows the\nresults of three variations that lack one of the ingredients of the proposed UCL on Permuted MNIST.\n\u201cUCL w/o upper freeze\u201d stands for using \u039b(l)\nt\u22121,i in (4), and we observe regularizing\nthe outgoing weights of an important node in UCL very important. \u201cUCL w/o (5)\u201d stands for the\nremoving (5) from (7), and we clearly see the pruning heuristic based weight freezing is also very\nimportant. \u201cUCL w/o (6)\u201d stands for not using (6) and it shows that while the accuracy of Task 1 & 2\nare even higher than UCL, but the accuracy drastically decreases after Task 3. This is because due to\nthe rapid decrease of model capacity since \u201cactively\u201d learning weights reduce when (6) is not used.\nSplit MNIST We test also in the splitted dataset setting that each task consists of 2 consecutive\nclasses of MNIST dataset. This benchmark was used in [30, 19] and has total 5 tasks. We used\n\nij = \u03c3(l)\n\ninit/\u03c3(l)\n\nFigure 6: Experimental results on Split MNIST(top) and Split notMNIST(bottom)\n\nmulti-headed FNN hat has two hidden layers with 256 nodes and ReLU activations for all methods.\nIn Figure 6 (top), we compare the test accuracy of each task together with the average accuracy\nover all observed tasks at the right end. UCL accomplishes the same 5 tasks average accuracy as\nHAT; 99.7%, which is slightly better than the results of SI and VCL with coreset, 99.0%, and 98.7%,\nrespectively. Note UCL signi\ufb01cantly outperforms EWC and VCL. We also point out that HAT makes\na critical assumption to know the number of tasks a priori, while UCL need not.\n\n7\n\n246810Task0.00.10.20.30.4STDLayer 1246810Task0.00.10.20.30.4Layer 2246810Task0.00.10.20.30.4Layer 3246810Task0.900.920.940.960.981.00Task 1UCLUCL w/o upper freezeUCL w/o (5)UCL w/o (6)246810TaskTask 2246810TaskTask 3246810TaskTask 4246810TaskTask 5246810TaskAverage12345Tasks0.940.950.960.970.980.991.00Task 1 (0 or 1)UCLEWCSIHATVCLVCL + Random Coreset12345TasksTask 2 (2 or 3)12345TasksTask 3 (4 or 5)12345TasksTask 4 (6 or 7)12345TasksTask 5 (8 or 9)12345TasksAverage12345Tasks0.800.850.900.951.00Task 1 (A or F)12345TasksTask 2 (B or G)12345TasksTask 3 (C or H)12345TasksTask 4 (D or I)12345TasksTask 5 (E or J)12345TasksAverage\fSplit notMNIST Here, we make an assessment on another splitted dataset tasks with notMNIST\ndataset, which has 10 character classes. We split the characters of notMNIST into 5 groups same as\nVCL[19]: A/F, B/G, C/H, D/I, and E/J. We used multi-headed FNN hat has four hidden layers with\n150 nodes and ReLU activations for all methods. Unlike the previous experiments, SI shows similar\nresults to EWC around 84% average accuracy, and VCL attains a better result of 90.1% (in Figure 6)\n(bottom). Our UCL again achieves a superior an outstanding result of 95.7%, that is higher than HAT\nand VCL with coreset: 95.2% and 93.7%, respectively.\nSplit CIFAR and Omniglot To check the effectiveness of UCL beyond the MNIST tasks, we\nexperimented our UCL on three additional datasets, Split CIFAR-100, Split CIFAR10/100 and\nOmniglot. For Split CIFAR-100, each task consists of 10 consecutive classes of CIFAR-100, for Split\nCIFAR-10/100, we combined CIFAR-10 and Split CIFAR-100, and for Omniglot, each alphabet is\ntreated as a single task, and we used all 50 alphabets. For Omniglot, as in [25], we rescaled all images\nto 28 \u00d7 28 and augmented the dataset by including 20 random permutations (rotations and shifting)\nfor each image. For these datasets, unlike in the previous experiments using FNNs, we used deeper\nCNN architectures, in which the notion of uncertainty in the convolution layer is de\ufb01ned for each\nchannel (i.e., \ufb01lter). We used multi-headed outputs for all experiments, and 8 different random seed\nruns are averages for all datasets. The details of experiments using CNNs, including the architectures\nand hyperparameters, are given in the Supplementary Materials. In Figure 7, we compared UCL\nwith EWC and SI and carried out extensive hyperparameter search for fair comparison. We did not\ncompare with VCL since it did not have any results on vision datasets with CNN architectures.\n\n(a) Split CIFAR-100\nFigure 7: Experiments on supervised learning using convolutional neural network\n\n(b) Split CIFAR-10/100\n\n(c) Omniglot\n\nIn Split CIFAR-100, EWC and SI achieve 60.5% and 60.0% respectively. However, UCL outperforms\nSI and EWC achieving 63.4%. In a slightly different task, Split CIFAR-10/100, which prevents\nover\ufb01tting on Split CIFAR-100 using a model pre-trained on CIFAR-10, UCL also outperforms\nbaselines by achieving 73.2%. In Omniglot, although UCL becomes slightly unstable for the \ufb01rst\ntask, it eventually achieves 83.9% average accuracy on all 50 tasks. However, EWC and SI only\nachieves 68.1% and 74.2% respectively, much lower than UCL. From above three results, we clearly\nobserve UCL clearly outperforms the baselines for more diverse and sophisticated vision datasets\nand for deeper CNN architectures.\n\nTable 1: The number of parameters used for each benchmark.\n\nDataset\\Methods\nPermuted MNIST\n\nSplit MNIST\n\nSplit notMNIST\n\nSplit CIFAR10/100\n\nOmniglot\n\nSI\n\nHAT\n\nEWC\nVCL\nVanilla\n1435K 1435K 486K 1914K\n478K\n272K 1077K\n808K\n808K\n270K\n559K\n559K\n190K\n749K\n187K\n2467K 2467K\n839K\n1773K 1995K 1995K\n\n-\n-\n\n-\n-\n\nUCL\n960K\n538K\n375K\n1655K\n1884K\n\nComparison of model parameters Table 1 shows the number of model parameters in each experi-\nment. Vanilla stands for the base network architecture of all methods. It is shown that UCL has fewer\nparameters than other regularization-based approaches. Especially, UCL has almost half the number\nof VCL, which is based on the similar variational framework. Although HAT shows the least number\nof parameters, we stress it has the critical drawback of requiring to know the number of tasks a priori.\n\n4.2 Reinforcement learning\nHere, we also tested UCL for the continual reinforcement learning tasks. Roboschool [24] consists of\n12 tasks and each task has a different shape of the state and continuous action space, and goal. From\nthese tasks, we randomly chose eight tasks and sequentially learned each task (with 5 million update\n\n8\n\n12345678910Task0.500.550.600.650.700.75AccuracyUCLEWCSI1234567891011Task0.700.730.760.790.820.850.88AccuracyUCLEWCSI5101520253035404550Task0.20.30.40.50.60.70.80.91.0AccuracyUCLEWCSI\fsteps) in the following order, {Walker-HumanoidFlagrun-Hooper-Ant-InvertedDoublePendulum-\nCheetah-Humanoid-InvertedPendulum}. We trained a FNN model using PPO [24] as a training\nalgorithm and selected EWC and Fine-tuning as baselines. All baselines were experimented in exactly\nthe same condition, and we carried out an extensive hyperparameter search for fair comparison.\nMore experimental details, network architectures, and hy-\nperparameters are given in the Supplementary Materials.\nFigure 8 shows the cumulative normalized rewards up to\nthe learned task, and Figure 9 shows the normalized re-\nwards for each task with vertical dotted lines showing the\nboundaries of the tasks. The normalization in the \ufb01gures\nwas done for each task with the maximum rewards ob-\ntained by EWC (\u03bb = 10). The high cumulative sum thus\ncorresponds to effectively combating the catastrophic for-\ngetting (CF), and we note Fine-tuning mostly suffers from\nCF (e.g., Task2 or Task4). Note we show two versions of\nUCL, with different \u03b2 hyperparameter values. In Figure 8,\nwe observe both versions of UCL signi\ufb01cantly outperform\nboth EWC and Fine-tuning. We believe the reason why\nEWC does not excel as in Figure 4B of the original EWC\npaper [12] is because we consider pure continual learning setting, while [12] allows learning tasks\nmultiple times in a recurring fashion. Moreover, a possible reason why UCL achieves such high\nrewards in RL setting may be due to the by-product of our weight sampling procedure; namely, the\nGaussian perturbation of the weights for the variational inference enables an effective exploration of\npolicies for RL as suggested in [21]. Figure 9 shows UCL overwhelmingly surpasses EWC partic-\nularly for Task1 and Task3 (by both achieving high rewards and not forgetting), and it contributes\nto the signi\ufb01cant difference between EWC in Figure 8. We also experimentally checked the role\nof \u03b2 for gracefully forgetting; although UCL with \u03b2 = 5 \u00d7 10\u22126 results in overall better rewards,\nwith \u03b2 = 5 \u00d7 10\u22125 does better in learning new tasks, e.g., Task5/7/8, by adding more plasticity to\nthe network. To the best of our knowledge, this result shows for the \ufb01rst time that the pure continual\nlearning is possible for reinforcement learning with continuous action space and different observation\nshapes. We stress that there are very few algorithms in the literature that work well on both SL and\nRL continual learning setting, and our UCL is very competitive in that sense.\n\nFigure 8: Cumulative normalized re-\nwards.\n\nFigure 9: Normalized rewards for each task throughout learning of 8 RL tasks. Each task is learned\nwith 5 million training steps. UCL excels in both not forgetting past tasks and learning new tasks.\n5 Conclusion\nWe proposed UCL, a new uncertainty-based regularization method for overcoming catastrophic\nforgetting. We proposed the notion of node-wise uncertainty motivated from the Bayesian online\nlearning framework and devised novel regularization terms for dealing with stability-plasticity\ndilemma. As a result, UCL convincingly outperformed other state-of-the-art baselines in both\nsupervised and reinforcement learning benchmarks with much fewer additional parameters.\n\n9\n\nTask1Task2Task3Task4Task5Task6Task7Task802468101214Accumulated performanceUCL ((l)init = 5x103, = 5x105)UCL ((l)init = 5x103, = 5x106)EWC ( = 10)Fine-tuning05101520253035400246Task15101520253035400.00.20.40.60.81.0Task21015202530354002468Task31520253035400.250.000.250.500.751.00Task420253035400.751.001.251.501.752.00Task5253035400.000.250.500.751.001.25Task63035400.00.20.40.60.81.0Task735400.00.51.01.52.02.5Task8UCL ((l)init = 5x103, = 5x105)UCL ((l)init = 5x103, = 5x106)EWC ( = 10)Fine-tuning\fAcknowledgements\n\nThis work is supported in part by ICT R&D Program [No. 2016-0-00563, Research on adaptive ma-\nchine learning technology development for intelligent autonomous digital companion], AI Graduate\nSchool Support Program [No.2019-0-00421], and ITRC Support Program [IITP-2019-2018-0-01798]\nof MSIT / IITP of the Korean government.\n\nReferences\n[1] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\nin neural network. In International Conference on Machine Learning (ICML), pages 1613\u20131622,\n2015.\n\n[2] Gail A Carpenter and Stephen Grossberg. Art 2: Self-organization of stable category recognition\n\ncodes for analog input patterns. Applied Optics, 26(23):4919\u20134930, 1987.\n\n[3] Geoffrey E. Hinton and Drew Van Camp. Keeping neural networks simple by minimizing the\n\ndescription length of the weights. Proceedings of COLT-93, 07 1999.\n\n[4] Robert M French. Catastrophic forgetting in connectionist networks. Trends in Cognitive\n\nSciences, 3(4):128\u2013135, 1999.\n\n[5] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In International Conference on Machine Learning (ICLR), pages\n1050\u20131059, 2016.\n\n[6] Alex Graves. Practical variational inference for neural networks.\n\nInformation Processing Systems (NIPS), pages 2348\u20132356, 2011.\n\nIn Advances in Neural\n\n[7] K. He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[8] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable\nlearning of bayesian neural networks. In International Conference on Machine Learning (ICML),\npages 1861\u20131869, 2015.\n\n[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[10] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning.\n\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[11] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations (ICLR), 2014.\n\n[12] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,\nDemis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas-\ntrophic forgetting in neural networks. Proceedings of the National Academy of Sciences,\n114(13):3521\u20133526, 2017.\n\n[13] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming\ncatastrophic forgetting by incremental moment matching. In Advances in Neural Information\nProcessing Systems (NIPS), pages 4652\u20134662. 2017.\n\n[14] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 40(12):2935\u20132947, 2017.\n\n[15] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning.\n\nIn Advances in Neural Information Processing System (NIPS), pages 6467\u20136476. 2017.\n\n[16] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network\nby iterative pruning. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2018.\n\n10\n\n\f[17] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:\nThe sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages\n109\u2013165. Elsevier, 1989.\n\n[18] Martial Mermillod, Aur\u00e9lia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: In-\nvestigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers\nin Psychology, 4:504, 2013.\n\n[19] Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual\n\nlearning. In International Conference on Learning Representations (ICLR), 2018.\n\n[20] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter.\n\nContinual lifelong learning with neural networks: A review. CoRR, abs/1802.07569, 2018.\n\n[21] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen,\nTamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration.\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[22] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:\nIncremental classi\ufb01er and representation learning. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 2001\u20132010, 2017.\n\n[23] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[24] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[25] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska,\nYee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable frame-\nwork for continual learning. In International Conference on Machine Learning (ICML), pages\n4528\u20134537, 2018.\n\n[26] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic\nforgetting with hard attention to the task. In International Conference on Machine Learning\n(ICML), pages 4548\u20134557, 2018.\n\n[27] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep\ngenerative replay. In Advances in Neural Information Processing System (NIPS), pages 2990\u2013\n2999. 2017.\n\n[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[29] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with\ndynamically expandable networks. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[30] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic\nintelligence. In International Conference on Machine Learning (ICML), pages 3987\u20133995,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 2447, "authors": [{"given_name": "Hongjoon", "family_name": "Ahn", "institution": "Sunkyunkwan University"}, {"given_name": "Sungmin", "family_name": "Cha", "institution": "Sungkyunkwan University"}, {"given_name": "Donggyu", "family_name": "Lee", "institution": "Sungkyunkwan university"}, {"given_name": "Taesup", "family_name": "Moon", "institution": "Sungkyunkwan University (SKKU)"}]}