{"title": "Overcoming Catastrophic Forgetting by Incremental Moment Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 4652, "page_last": 4662, "abstract": "Catastrophic forgetting is a problem of neural networks that loses the information of the first task after training the second task. Here, we propose a method, i.e. incremental moment matching (IMM), to resolve this problem. IMM incrementally matches the moment of the posterior distribution of the neural network which is trained on the first and the second task, respectively. To make the search space of posterior parameter smooth, the IMM procedure is complemented by various transfer learning techniques including weight transfer, L2-norm of the old and the new parameter, and a variant of dropout with the old parameter. We analyze our approach on a variety of datasets including the MNIST, CIFAR-10, Caltech-UCSD-Birds, and Lifelog datasets. The experimental results show that IMM achieves state-of-the-art performance by balancing the information between an old and a new network.", "full_text": "Overcoming Catastrophic Forgetting by\n\nIncremental Moment Matching\n\nSang-Woo Lee1, Jin-Hwa Kim1, Jaehyun Jun1, Jung-Woo Ha2, and Byoung-Tak Zhang1,3\n\nSeoul National University1\n\nClova AI Research, NAVER Corp2\n\nSurromind Robotics3\n\n{slee,jhkim,jhjun}@bi.snu.ac.kr jungwoo.ha@navercorp.com\n\nbtzhang@bi.snu.ac.kr\n\nAbstract\n\nCatastrophic forgetting is a problem of neural networks that loses the information\nof the \ufb01rst task after training the second task. Here, we propose a method, i.e. in-\ncremental moment matching (IMM), to resolve this problem. IMM incrementally\nmatches the moment of the posterior distribution of the neural network which is\ntrained on the \ufb01rst and the second task, respectively. To make the search space\nof posterior parameter smooth, the IMM procedure is complemented by various\ntransfer learning techniques including weight transfer, L2-norm of the old and the\nnew parameter, and a variant of dropout with the old parameter. We analyze our ap-\nproach on a variety of datasets including the MNIST, CIFAR-10, Caltech-UCSD-\nBirds, and Lifelog datasets. The experimental results show that IMM achieves\nstate-of-the-art performance by balancing the information between an old and a\nnew network.\n\n1\n\nIntroduction\n\nCatastrophic forgetting is a fundamental challenge for arti\ufb01cial general intelligence based on neural\nnetworks. The models that use stochastic gradient descent often forget the information of previous\ntasks after being trained on a new task [1, 2]. Online multi-task learning that handles such problems\nis described as continual learning. This classic problem has resurfaced with the renaissance of deep\nlearning research [3, 4].\nRecently, the concept of applying a regularization function to a network trained by the old task to\nlearning a new task has received much attention. This approach can be interpreted as an approxima-\ntion of sequential Bayesian [5, 6]. Representative examples of this regularization approach include\nlearning without forgetting [7] and elastic weight consolidation [8]. These algorithms succeeded in\nsome experiments where their own assumption of the regularization function \ufb01ts the problem.\nHere, we propose incremental moment matching (IMM) to resolve the catastrophic forgetting prob-\nlem. IMM uses the framework of Bayesian neural networks, which implies that uncertainty is intro-\nduced on the parameters in neural networks, and that the posterior distribution is calculated [9, 10].\nThe dimension of the random variable in the posterior distribution is the number of the parameters\nin the neural networks. IMM approximates the mixture of Gaussian posterior with each component\nrepresenting parameters for a single task to one Gaussian distribution for a combined task. To merge\nthe posteriors, we introduce two novel methods of moment matching. One is mean-IMM, which\nsimply averages the parameters of two networks for old and new tasks as the minimization of the\naverage of KL-divergence between one approximated posterior distribution for the combined task\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Geometric illustration of incremental moment matching (IMM). Mean-IMM simply av-\nerages the parameters of two neural networks, whereas mode-IMM tries to \ufb01nd a maximum of the\nmixture of Gaussian posteriors. To make IMM be reasonable, the search space of the loss function\nbetween the posterior means \u00b51 and \u00b52 should be reasonably smooth and convex-like. To \ufb01nd a\n\u00b52 which satis\ufb01es this condition of a smooth and convex-like path from \u00b51, we propose applying\nvarious transfer techniques for the IMM procedure.\n\nand each Gaussian posterior for the single task [11]. The other is mode-IMM, which merges the pa-\nrameters of two networks using a Laplacian approximation [9] to approximate a mode of the mixture\nof two Gaussian posteriors, which represent the parameters of the two networks.\nIn general, it is too na\u00efve to assume that the \ufb01nal posterior distribution for the whole task is Gaussian.\nTo make our IMM work, the search space of the loss function between the posterior means needs to\nbe smooth and convex-like. In other words, there should not be high cost barriers between the means\nof the two networks for an old and a new task. To make our assumption of Gaussian distribution for\nneural network reasonable, we applied three main transfer learning techniques on the IMM proce-\ndure: weight transfer, L2-norm of the old and the new parameters, and our newly proposed variant\nof dropout using the old parameters. The whole procedure of IMM is illustrated in Figure 1.\n\n2 Previous Works on Catastrophic Forgetting\n\nOne of the major approaches preventing catastrophic forgetting is to use an ensemble of neural net-\nworks. When a new task arrives, the algorithm makes a new network, and shares the representation\nbetween the tasks [12, 13]. However, this approach has a complexity issue, especially in inference,\nbecause the number of networks increases as the number of new tasks that need to be learned in-\ncreases.\nAnother approach studies the methods using implicit distributed storage of information, in typical\nstochastic gradient descent (SGD) learning. These methods use the idea of dropout, maxout, or neu-\nral module to distributively store the information for each task by making use of the large capacity of\nthe neural network [4]. Unfortunately, most studies following this approach had limited success and\nfailed to preserve performance on the old task when an extreme change to the environment occurred\n[3]. Alternatively, Fernando et al. [14] proposed PathNet, which extends the idea of the ensemble\napproach for parameter reuse [13] within a single network. In PathNet, a neural network has ten or\ntwenty modules in each layer, and three or four modules are picked for one task in each layer by\nan evolutionary approach. This method alleviates the complexity issue of the ensemble approach to\ncontinual learning in a plausible way.\nThe approach with a regularization term also has received attention. Learning without forgetting\n(LwF) is one example of this approach, which uses the pseudo-training data from the old task [7].\nBefore learning the new task, LwF puts the training data of the new task into the old network,\nand uses the output as pseudo-labels of the pseudo-training data. By optimizing both the pseudo-\ntraining data of the old task and the real data of the new task, LwF attempts to prevent catastrophic\nforgetting. This framework is promising where the properties of the pseudo training set is similar to\nthe ideal training set. Elastic weight consolidation (EWC), another example of this approach, uses\nsequential Bayesian estimation to update neural networks for continual learning [8]. In EWC, the\nposterior distribution trained by the previous task is used to update the new prior distribution. This\nnew prior is used for learning the new posterior distribution of the new task in a Bayesian manner.\n\n2\n\n1\u00b52\u00b51:2Mode\u00b51:2Mean\u00b5111111:2121122()()Mode\u00b5\u00b5\u00b5-----=S+SS+S1:212()/2Mean\u00b5\u00b5\u00b5=+12\u00b5\u00b5\u00ae2212\u00b5\u00b5-1212()dropout\u00b5\u00b5\u00b5+\u00d7-Find !2, which makes !1:2\tperform better drop-transferweight-transferL2-transfer\fEWC assumes that the covariance matrix of the posterior is diagonal and there are no correlations\nbetween the nodes. Though this assumption is fragile, EWC performs well in some domains.\nEWC is a monumental recent work that uses sequential Bayesian for continual learning of neural\nnetworks. However, updating the parameter of complex hierarchical models by sequential Bayesian\nestimation is not new [5]. Sequential Bayes was used to learn topic models from stream data by\nBroderick et al. [6]. Huang et al. applied sequential Bayesian to adapt a deep neural network to\nthe speci\ufb01c user in the speech recognition domain [15, 16]. They assigned the layer for the user\nadaptation and applied MAP estimation to this single layer. Similar to our IMM method, Bayesian\nmoment matching is used for sum-product networks, a kind of deep hierarchical probabilistic model\n[17]. Though sum-product networks are usually not scalable to large datasets, their online learning\nmethod is useful, and it achieves similar performance to the batch learner. Our method using moment\nmatching focuses on continual learning and deals with signi\ufb01cantly different statistics between tasks,\nunlike the previous method.\n\n3\n\nIncremental Moment Matching\n\nIn incremental moment matching (IMM), the moments of posterior distributions are matched in an\nincremental way. In our work, we use a Gaussian distribution to approximate the posterior distri-\nbution of parameters. Given K sequential tasks, we want to \ufb01nd the optimal parameter \u00b5\u2217\n1:K and\n\u03a3\u2217\n1:K of the Gaussian approximation function q1:K from the posterior parameter for each kth task,\n(\u00b5k, \u03a3k).\n\np1:K \u2261 p(\u03b8|X1,\u00b7\u00b7\u00b7 , XK, y1,\u00b7\u00b7\u00b7 , yK) \u2248 q1:K \u2261 q(\u03b8|\u00b51:K, \u03a31:K)\n\npk \u2261 p(\u03b8|Xk, yk) \u2248 qk \u2261 q(\u03b8|\u00b5k, \u03a3k)\n\n(1)\n(2)\n\nq1:K denotes an approximation of the true posterior distribution p1:K for the whole task, and qk\ndenotes an approximation of the true posterior distribution pk over the training dataset (Xk, yk) for\nthe kth task. \u03b8 denotes the vectorized parameter of the neural network. The dimension of \u00b5k and\n\u00b51:k is D, and the dimension of \u03a3k and \u03a31:k is D \u00d7 D, respectively, where D is the dimension of \u03b8.\nFor example, a multi-layer perceptrons (MLP) with [784-800-800-800-10] has the number of nodes,\nD = 1917610 including bias terms.\nNext, we explain two proposed moment matching algorithms for the continual learning of modern\ndeep neural networks. The two algorithms generate two different moments of Gaussian with different\nobjective functions for the same dataset.\n\n(cid:80)K\n\n3.1 Mean-based Incremental Moment Matching (mean-IMM)\n\nMean-IMM averages the parameters of two networks in each layer, using mixing ratios \u03b1k with\nk \u03b1k = 1. The objective function of mean-IMM is to minimize the following local KL-distance\n\nor the weighted sum of KL-divergence between each qk and q1:K [11, 18]:\n\n(cid:80)K\nk \u03b1kKL(qk||q1:K)\n1:K =(cid:80)K\n\n\u00b5\u2217\n\n1:K = argmin\n\u00b51:K ,\u03a31:K\n\n\u00b5\u2217\n1:K, \u03a3\u2217\n\n1:K =(cid:80)K\n\n\u03a3\u2217\n\nk \u03b1k\u00b5k\nk \u03b1k(\u03a3k + (\u00b5k \u2212 \u00b5\u2217\n\n1:K)(\u00b5k \u2212 \u00b5\u2217\n\n1:K)T )\n\n(3)\n\n(4)\n(5)\n\n\u00b5\u2217\n1:K and \u03a3\u2217\n1:K are the optimal solution of the local KL-distance. Notice that covariance information\nis not needed for mean-IMM, since calculating \u00b5\u2217\n1:K does not require any \u03a3k. A series of \u00b5k is\nsuf\ufb01cient to perform the task. The idea of mean-IMM is commonly used in shallow networks [19,\n20]. However, the contribution of this paper is to discover when and how mean-IMM can be applied\nin modern deep neural networks and to show it can performs better with other transfer techniques.\nFuture works may include other measures to merge the networks, including the KL-divergence be-\n\ntween q1:K and the mixture of each qk (i.e. KL(q1:K||(cid:80)K\n\nk \u03b1kqk)) [18].\n\n3\n\n\f3.2 Mode-based Incremental Moment Matching (mode-IMM)\n\nMode-IMM is a variant of mean-IMM which uses the covariance information of the posterior of\nGaussian distribution. In general, a weighted average of two mean vectors of Gaussian distributions\nis not a mode of MoG. In discriminative learning, the maximum of the distribution is of primary\ninterest. According to Ray and Lindsay [21], all the modes of MoG with K clusters lie on (K \u2212 1)-\nk ak = 1}.\n\ndimensional hypersurface {\u03b8|\u03b8 = ((cid:80)K\n\nk \u00b5k), 0 < ak < 1 and(cid:80)\n\nk )\u22121((cid:80)K\n\nSee Appendix A for more details.\nMotivated by the above description, a mode-IMM approximate MoG with Laplacian approximation,\nin which the logarithm of the function is expressed by the Taylor expansion [9]. Using Laplacian\napproximation, the MoG is approximated as follows:\n\nk ak\u03a3\u22121\n\nk ak\u03a3\u22121\n\nlog q1:K \u2248(cid:80)K\n\n\u03b8T ((cid:80)K\n1:K \u00b7 ((cid:80)K\n1:K = ((cid:80)K\n\nk \u03b1k log qk + C = \u2212 1\n2\n1:K = \u03a3\u2217\n\u00b5\u2217\n\u03a3\u2217\n\nk \u03b1k\u03a3\u22121\nk )\u22121\n\nk \u03b1k\u03a3\u22121\n\nk \u00b5k)\n\nk )\u03b8 + ((cid:80)K\n\nk \u03b1k\u03a3\u22121\n\nk \u03b1k\u03a3\u22121\n\nk \u00b5k)\u03b8 + C(cid:48)\n\n(6)\n\n(7)\n(8)\n\n(cid:21)\n\n(cid:20) \u2202\n\nFor Equation 8, we add \u0001I to the term to be inverted in practice, with an identity matrix I and a small\nconstant \u0001.\nHere, we assume diagonal covariance matrices, which means that there is no correlation among\nparameters. This diagonal assumption is useful, since it decreases the number of parameters for\neach covariance matrix from O(D2) to O(D) for the dimension of the parameters D.\nFor covariance, we use the inverse of a Fisher information matrix, following [8, 22]. The main\nidea of this approximation is that the square of gradients for parameters is a good indicator of their\nprecision, which is the inverse of the variance. The Fisher information matrix for the kth task, Fk is\nde\ufb01ned as:\n\nFk = E\n\nln p(\u02dcy|x, \u00b5k) \u00b7 \u2202\n\u2202\u00b5k\n\nln p(\u02dcy|x, \u00b5k)T\n\n\u2202\u00b5k\n\n,\n\n(9)\n\nwhere the probability of the expectation follows x \u223c \u03c0k and \u02dcy \u223c p(y|x, \u00b5k), where \u03c0k denotes an\nempirical distribution of Xk.\n\n4 Transfer Techniques for Incremental Moment Matching\n\nIn general, the loss function of neural networks is not convex. Consider that shuf\ufb02ing nodes and\ntheir weights in a neural network preserves the original performance. If the parameters of two neural\nnetworks initialized independently are averaged, it might perform poorly because of the high cost\nbarriers between the parameters of the two neural networks [23]. However, we will show that various\ntransfer learning techniques can be used to ease this problem, and make the assumption of Gaussian\ndistribution for neural networks reasonable. In this section, we introduce three practical techniques\nfor IMM, including weight-transfer, L2-transfer, and drop-transfer.\n\n4.1 Weight-Transfer\n\nWeight-transfer initialize the parameters for the new task \u00b5k with the parameters of the previous\ntask \u00b5k\u22121 [24]. In our experiments, the use of weight-transfer was critical to the continual learn-\ning performance. For this reason, the experiments on IMM in this paper use the weight-transfer\ntechnique by default.\nThe weight-transfer technique is motivated by the geometrical property of neural networks discov-\nered in the previous work [23]. They found that there is a straight path from the initial point to the\nsolution without any high cost barrier, in various types of neural networks and datasets. This dis-\ncovery suggests that the weight-transfer from the previous task to the new task makes a smooth loss\n\n4\n\n\fFigure 2: Experimental results on visualizing the effect of weight-transfer. The geometric property\nof the parameter space of the neural network is analyzed. Brighter is better. \u03b81, \u03b82, and \u03b83 are the\nvectorized parameters of trained networks from randomly selected subsets of the CIFAR-10 dataset.\nThis \ufb01gure shows that there are better solutions between the three locally optimized parameters.\n\nsurface between two solutions for the tasks, so that the optimal solution for both tasks lies on the\ninterpolated point of the two solutions.\nTo empirically validate the concept of weight-transfer, we use the linear path analysis proposed by\nGoodfellow et al. [23] (Figure 2). We randomly chose 18,000 instances from the training dataset\nof CIFAR-10, and divided them into three subsets of 6,000 instances each. These three subsets are\nused for sequential training by CNN models, parameterized by \u03b81, \u03b82, and \u03b83, respectively. Here, \u03b82\nis initialized from \u03b81, and then \u03b83 is initialized from \u03b82, in the same way as weight-transfer. In this\nanalysis, each loss and accuracy is evaluated at a series of points \u03b8 = \u03b81 + \u03b1(\u03b82 \u2212 \u03b81) + \u03b2(\u03b83 \u2212\n\u03b82), varying \u03b1 and \u03b2. In Figure 2, the loss surface of the model on each online subset is nearly\n3 (\u03b81 + \u03b82 + \u03b83), which is the same as the solution\nconvex. The \ufb01gure shows that the parameter at 1\nof mean-IMM, performs better than any other reference points \u03b81, \u03b82, or \u03b83. However, when \u03b82 is\nnot initialized by \u03b81, the convex-like shape disappears, since there is a high cost barrier between the\nloss function of \u03b81 and \u03b82.\n\n4.2 L2-transfer\n\nL2-transfer is a variant of L2-regularization. L2-transfer can be interpreted as a special case of\nEWC where the prior distribution is Gaussian with \u03bbI as a covariance matrix. In L2-transfer, a\nregularization term of the distance between \u00b5k\u22121 and \u00b5k is added to the following objective function\nfor \ufb01nding \u00b5k, where \u03bb is a hyperparameter:\n\nlog p(yk|Xk, \u00b5k) \u2212 \u03bb \u00b7 ||\u00b5k \u2212 \u00b5k\u22121||2\n\n2\n\n(10)\n\nThe concept of L2-transfer is commonly used in transfer learning [25, 26] and continual learning\n[7, 8] with large \u03bb. Unlike the previous usage of large \u03bb, we use small \u03bb for the IMM procedure.\nIn other words, \u00b5k is \ufb01rst trained by Equation 10 with small \u03bb, and then merged to \u00b51:k in our\nIMM. Since we want to make the loss surface between \u00b5k\u22121 and \u00b5k smooth, and not to minimize\nthe distance between \u00b5k\u22121 and \u00b5k. In convex optimization, the L2-regularizer makes the convex\nfunction strictly convex. Similarly, we hope L2-transfer with small \u03bb help to \ufb01nd a \u00b5k with a convex-\nlike loss space between \u00b5k\u22121 and \u00b5k.\n\n4.3 Drop-transfer\n\nDrop-transfer is a novel method devised in this paper. Drop-transfer is a variant of dropout where\n\u00b5k\u22121 is the zero point of the dropout procedure. In the training phase, the following \u02c6\u00b5k,i is used for\nthe weight vector corresponding to the ith node \u00b5k,i:\n\n(cid:40)\n\n\u02c6\u00b5k,i =\n\n\u00b5k\u22121,i,\n1\u2212p \u00b7 \u00b5k,i \u2212 p\n\n1\n\nif ith node is turned off\n\n1\u2212p \u00b7 \u00b5k\u22121,i, otherwise\n\n(11)\n\nwhere p is the dropout ratio. Notice that the expectation of \u02c6\u00b5k,i is \u00b5k,i.\n\n5\n\n\fTable 1: The averaged accuracies on the disjoint MNIST for two sequential tasks (Top) and the\nshuf\ufb02ed MNIST for three sequential tasks (Bottom). The untuned setting refers to the most natural\nhyperparameter in the equation of each algorithm, whereas the tuned setting refers to using heuristic\nhand-tuned hyperparameters. Hyperparam denotes the main hyperparameter of each algorithm. For\nIMM with transfer, only \u03b1 is tuned. The numbers in the parentheses refer to standard deviation.\nEvery IMM uses weight-transfer.\n\nUntuned\n\nHyperparam\n\nTuned\n\nHyperparam\n\nDisjoint MNIST Experiment\nSGD [3]\nL2-transfer [25]\nDrop-transfer\nEWC [8]\nMean-IMM\nMode-IMM\nL2-transfer + Mean-IMM\nL2-transfer + Mode-IMM\nDrop-transfer + Mean-IMM\nDrop-transfer + Mode-IMM\nL2, Drop-transfer + Mean-IMM\nL2, Drop-transfer + Mode-IMM\n\nShuf\ufb02ed MNIST Experiment\nSGD [3]\nL2-transfer [25]\nDrop-transfer\nEWC [8]\nMean-IMM\nMode-IMM\nL2-transfer + Mean-IMM\nL2-transfer + Mode-IMM\nDrop-transfer + Mean-IMM\nDrop-transfer + Mode-IMM\nL2, Drop-transfer + Mean-IMM\nL2, Drop-transfer + Mode-IMM\n\nExplanation of\nHyperparam\n\nepoch per dataset\n\n\u03bb in (10)\np in (11)\n\u03bb in (20)\n\u03b12 in (4)\n\u03b12 in (7)\n\u03bb / \u03b12\n\u03bb / \u03b12\np / \u03b12\np / \u03b12\n\n\u03bb / p / \u03b12\n\u03bb / p / \u03b12\n\nepoch per dataset\n\n\u03bb in (10)\np in (11)\n\u03bb in (20)\n\u03b13 in (4)\n\u03b13 in (7)\n\u03bb / \u03b13\n\u03bb / \u03b13\np / \u03b13\np / \u03b13\n\n\u03bb / p / \u03b13\n\u03bb / p / \u03b13\n\n10\n-\n0.5\n1.0\n0.50\n0.50\n\n60\n-\n0.5\n-\n\n0.33\n0.33\n\n0.001 / 0.50\n0.001 / 0.50\n0.5 / 0.50\n0.5 / 0.50\n\n0.001 / 0.5 / 0.50\n0.001 / 0.5 / 0.50\n\nHyperparam\n\n1e-4 / 0.33\n1e-4 / 0.33\n0.5 / 0.33\n0.5 / 0.33\n\n1e-4 / 0.5 / 0.33\n1e-4 / 0.5 / 0.33\n\n-\n\nAccuracy\n\n47.72 (\u00b1 0.11)\n51.72 (\u00b1 0.79)\n47.84 (\u00b1 0.04)\n90.45 (\u00b1 2.24)\n91.49 (\u00b1 0.98)\n78.34 (\u00b1 1.82)\n92.52 (\u00b1 0.41)\n80.75 (\u00b1 1.28)\n93.35 (\u00b1 0.49)\n66.10 (\u00b1 3.19)\n93.97 (\u00b1 0.32)\n\n-\n\n-\n\nAccuracy\n\n89.15 (\u00b1 2.34)\n94.75 (\u00b1 0.62)\n93.23 (\u00b1 1.37)\n98.02 (\u00b1 0.05)\n90.38 (\u00b1 1.74)\n98.16 (\u00b1 0.08)\n90.79 (\u00b1 1.30)\n97.80 (\u00b1 0.07)\n89.51 (\u00b1 2.85)\n97.83 (\u00b1 0.10)\n\n0.001 / 0.60\n0.001 / 0.45\n0.5 / 0.60\n0.5 / 0.50\n\n0.001 / 0.5 / 0.75\n0.001 / 0.5 / 0.45\n\nHyperparam\n\n0.05\n0.05\n0.5\n600M\n0.55\n0.45\n\n-\n\n1e-3\n0.2\n-\n\n0.55\n0.60\n\n1e-4 / 0.65\n1e-4 / 0.60\n0.5 / 0.65\n0.5 / 0.55\n\n1e-4 / 0.5 / 0.90\n1e-4 / 0.5 / 0.50\n\nAccuracy\n\n71.32 (\u00b1 1.54)\n85.81 (\u00b1 0.52)\n51.72 (\u00b1 0.79)\n52.72 (\u00b1 1.36)\n91.92 (\u00b1 0.98)\n92.02 (\u00b1 0.73)\n92.62 (\u00b1 0.95)\n92.73 (\u00b1 0.35)\n92.64 (\u00b1 0.60)\n93.35 (\u00b1 0.49)\n93.97 (\u00b1 0.23)\n94.12 (\u00b1 0.27)\n\nAccuracy\n\u223c95.5 [8]\n\n\u223c98.2 [8]\n\n96.37 (\u00b1 0.62)\n96.86 (\u00b1 0.21)\n95.02 (\u00b1 0.42)\n98.08 (\u00b1 0.08)\n95.93 (\u00b1 0.31)\n98.30 (\u00b1 0.08)\n96.49 (\u00b1 0.44)\n97.95 (\u00b1 0.08)\n97.36 (\u00b1 0.19)\n97.92 (\u00b1 0.05)\n\nThere are studies [27, 20] that have interpreted dropout as an exponential ensemble of weak learners.\nBy this perspective, since the marginalization of output distribution over the whole weak learner is\nintractable, the parameters multiplied by the inverse of the dropout rate are used for the test phase\nin the procedure. In other words, the parameters of the weak learners are, in effect, simply averaged\noversampled learners by dropout. At the process of drop-transfer in our continual learning setting,\nwe hypothesize that the dropout process makes the averaged point of two arbitrary sampled points\nusing Equation 11 a good estimator.\nWe investigated the search space of the loss function of the MLP trained from the MNIST handwrit-\nten digit recognition dataset for with and without dropout regularization, to supplement the evidence\nof the described hypothesis. Dropout regularization makes the accuracy of a sampled point from\ndropout distribution and an average point of two sampled parameters, from 0.450 (\u00b1 0.084) to 0.950\n(\u00b1 0.009) and 0.757 (\u00b1 0.065) to 0.974 (\u00b1 0.003), respectively. For the case of both with and without\ndropout, the space between two arbitrary samples is empirically convex, and \ufb01ts to the second-order\nequation. Based on this experiment, we expect not only that the search space of the loss function\nbetween modern neural networks can be easily nearly convex [23], but also that regularizers, such\nas dropout, make the search space smooth and the point in the search space have a good accuracy in\ncontinual learning.\n\n5 Experimental Results\n\nWe evaluate our approach on four experiments, whose settings are intensively used in the previous\nworks [4, 8, 7, 12]. For more details and experimental results, see Appendix D. The source code for\nthe experiments is available in Github repository1.\nDisjoint MNIST Experiment. The \ufb01rst experiment is the disjoint MNIST experiment [4]. In this\nexperiment, the MNIST dataset is divided into two datasets: the \ufb01rst dataset consists of only digits\n{0, 1, 2, 3, 4} and the second dataset consists of the remaining digits {5, 6, 7, 8, 9}. Our task is 10-\n\n1https://github.com/btjhjeon/IMM_tensor\ufb02ow\n\n6\n\n\fFigure 3: Test accuracies of two IMM models with weight-transfer on the disjoint MNIST (Left),\nthe shuf\ufb02ed MNIST (Middle), and the ImageNet2CUB experiment (Right). \u03b1 is a hyperparameter\nthat balances the information between the old and the new task.\n\nFigure 4: Test accuracies of IMM with various transfer techniques on the disjoint MNIST. Both L2-\ntransfer and drop-transfer boost the performance of IMM and make the optimal value of \u03b1 larger\nthan 1/2. However, drop-transfer tends to make the accuracy curve more smooth than L2-transfer\ndoes.\n\nclass joint categorization, unlike the setting in the previous work, which considers two independent\ntasks of 5-class categorization. Because the inference should decide whether a new instance comes\nfrom the \ufb01rst or the second task, our task is more dif\ufb01cult than the task of the previous work.\nWe evaluate the models both on the untuned setting and the tuned setting. The untuned setting refers\nto the most natural hyperparameter in the equation of each algorithm. The tuned setting refers to\nusing heuristic hand-tuned hyperparameters. Consider that tuned hyperparameter setting is often\nused in previous works of continual learning as it is dif\ufb01cult to de\ufb01ne a validation set in their setting.\nFor example, when the model needs to learn from the new task after learning from the old task, a low\nlearning rate or early stopping without a validation set, or arbitrary hyperparameter for balancing is\nused [3, 8]. We discover hyperparameters in the tuned setting not only to \ufb01nd the oracle performance\nof each algorithm, but also to show that there exist some paths consisting of the point that performs\nreasonably for both tasks. Hyperparam in Table 1 denotes hyperparameter mainly searched in the\ntuned setting. Table 1 (Top) and Figure 3 (Left) shows the experimental results from the disjoint\nMNIST experiment.\nIn our experimental setting, the usual SGD-based optimizers always perform less than 50%, because\nthe biases of the output layer for the old task are always pushed to large negative values, which\nimplies that our task is dif\ufb01cult. Figure 4 also shows that mode-IMM is robust with \u03b1 and the\noptimal \u03b1 of mean-IMM is larger than 1/2 in the disjoint MNIST experiment.\nShuf\ufb02ed MNIST Experiment. The second experiment is the shuf\ufb02ed MNIST experiment [3, 8] of\nthree sequential tasks. In this experiment, the \ufb01rst dataset is the same as the original MNIST dataset.\nHowever, in the second dataset, the input pixels of all images are shuf\ufb02ed with a \ufb01xed, random per-\nmutation. In previous work, EWC reaches the performance level of the batch learner, and it is argued\nthat EWC overcomes catastrophic forgetting in some domains. The experimental details are similar\nto the disjoint MNIST experiment, except all models are allowed to use dropout regularization. In\nthe experiment, the \ufb01rst dataset is the same as the original MNIST dataset. However, in the second\nand the third dataset, the input pixels of all images are shuf\ufb02ed with a \ufb01xed, random permutation,\n\n7\n\n00.20.40.60.8100.20.40.60.811.2alpha, for weighing two networksTest AccuracyThe disjoint MNIST experiment First Task, Mean\u2212IMMSecond Task, Mean\u2212IMMFirst Task, Mode\u2212IMMSecond Task, Mode\u2212IMM00.20.40.60.810.950.9550.960.9650.970.9750.980.9850.990.9951alpha, for weighing two networksTest AccuracyThe shuffled MNIST experiment First Task, Mean\u2212IMMSecond Task, Mean\u2212IMMFirst Task, Mode\u2212IMMSecond Task, Mode\u2212IMM00.20.40.60.810.520.540.560.580.60.62alpha, for weighing two networksTest AccuracyThe ImageNet2CUB experiment First Task, Mean\u2212IMMSecond Task, Mean\u2212IMMFirst Task, Mode\u2212IMMSecond Task, Mode\u2212IMM00.20.40.60.810.50.550.60.650.70.750.80.850.90.95alpha, for weighing two networksTest AccuracyThe disjoint MNIST experiment Mean\u2212IMMMode\u2212IMML2\u2212transfer + Mean\u2212IMML2\u2212transfer + Mode\u2212IMM00.20.40.60.810.50.550.60.650.70.750.80.850.90.95alpha, for weighing two networksTest AccuracyThe disjoint MNIST experiment Mean\u2212IMMMode\u2212IMMDrop\u2212transfer + Mean\u2212IMMDrop\u2212transfer + Mode\u2212IMML2, Drop\u2212transfer + Mean\u2212IMML2, Drop\u2212transfer + Mode\u2212IMM\fTable 2: Experimental results on the Lifelog dataset among different classes (location, sub-location,\nand activity) and different subjects (A, B, C). Every IMM uses weight-transfer.\n\nDual memory architecture [12]\nMean-IMM\nMode-IMM\n\nLocation\n\n78.11\n77.60\n77.14\n\nSub-location Activity\n52.92\n52.74\n54.07\n\n72.36\n73.78\n75.76\n\nA\n\n67.02\n67.03\n67.97\n\nB\n\n58.80\n57.73\n60.12\n\nC\n\n77.57\n79.35\n78.89\n\nrespectively. Therefore, the dif\ufb01culty of the three datasets is the same, though a different solution is\nrequired for each dataset.\nTable 1 (Bottom) and Figure 3 (Middle) shows the experimental results from the shuf\ufb02ed MNIST\nexperiment. Notice that accuracy of drop-transfer (p = 0.2) alone is 96.86 (\u00b1 0.21) and L2-transfer\n(\u03bb = 1e-4) + drop-transfer (p = 0.4) alone is 97.61 (\u00b1 0.15). These results are competitive to EWC\nwithout dropout, whose performance is around 97.0.\nImageNet to CUB Dataset. The third experiment is the ImageNet2CUB experiment [7], the con-\ntinual learning problem from the ImageNet dataset to the Caltech-UCSD Birds-200-2011 \ufb01ne-\ngrained classi\ufb01cation (CUB) dataset [28]. The numbers of classes of ImageNet and CUB dataset\nare around 1K and 200, and the numbers of training instances are 1M and 5K, respectively. In the\nImageNet2CUB experiment, the last-layer is separated for the ImageNet and the CUB task. The\nstructure of AlexNet is used for the trained model of ImageNet [29]. In our experiment, we match\nthe moments of the last-layer \ufb01ne-tuning model and the LwF model, with mean-IMM and mode-\nIMM.\nFigure 3 (Right) shows that mean-IMM moderately balances the performance of two tasks between\ntwo networks. However, the balanced hyperparameter of mode-IMM is far from \u03b1 = 0.5. We think\nthat it is because the scale of the Fisher matrix F is different between the ImageNet and the CUB\ntask. Since the number of training data of the two tasks is different, the mean of the square of the\ngradient, which is the de\ufb01nition of F , tends to be different. This implies that the assumption of\nmode-IMM does not always hold for heterogeneous tasks. See Appendix D.3 for more information\nincluding the learning methods of IMM where a different class output layer or a different scale of\nthe dataset is used.\nOur results of IMM with LwF exceed the previous state-of-the-art performance, whose model is\nalso LwF. This is because, in the previous works, the LwF model is initialized by the last-layer \ufb01ne-\ntuning model, not directly by the original AlexNet. In this case, the performance loss of the old task\nis not only decreased, but also the performance gain of the new task is decreased. The accuracies of\nour mean-IMM (\u03b1 = 0.5) are 56.20 and 56.73 for the ImageNet task and the CUB task, respectively.\nThe gains compared to the previous state-of-the-art are +1.13 and -1.14. In the case of mean-IMM\n(\u03b1 = 0.8) and mode-IMM (\u03b1 = 0.99), the accuracies are 55.08 and 59.08 (+0.01, +1.12), and 55.10\nand 59.12 (+0.02, +1.35), respectively.\nLifelog Dataset. Lastly, we evaluate the proposed methods on the Lifelog dataset [12]. The Lifelog\ndataset consists of 660,000 instances of egocentric video stream data, collected over 46 days from\nthree participants using Google Glass [30]. Three class categories, location, sub-location, and activ-\nity, are labeled on each frame of video. In the Lifelog dataset, the class distribution changes con-\ntinuously and new classes appear as the day passes. Table 2 shows that mean-IMM and mode-IMM\nare competitive to the dual-memory architecture, the previous state-of-the-art ensemble model, even\nthough IMM uses single network.\n\n6 Discussion\n\nA Shift of Optimal Hyperparameter of IMM. The tuned setting shows there often exists some \u03b1\nwhich makes the performance of the mean-IMM close to the mode-IMM. However, in the untuned\nhyperparameter setting, mean-IMM performs worse when more transfer techniques are applied. Our\nBayesian interpretation in IMM assumes that the SGD training of the k-th network \u00b5k is mainly\naffected by the k-th task and is rarely affected by the information of the previous tasks. However,\ntransfer techniques break this assumption; thus the optimal \u03b1 is shifted to larger than 1/k. For-\ntunately, mode-IMM works more robustly than mean-IMM where transfer techniques are applied.\n\n8\n\n\fFigure 4 illustrates the change of the test accuracy curve corresponding to the applied transfer tech-\nniques and the following shift of the optimal \u03b1 in mean-IMM and mode-IMM.\nBayesian Approach on Continual Learning. Kirkpatrick et al. [8] interpreted that the Fisher\nmatrix F as weight importance in explaining their EWC model. In the shuf\ufb02ed MNIST experiment,\nsince a large number of pixels always have a value of zero, the corresponding elements of the Fisher\nmatrix are also zero. Therefore, EWC does work by allowing weights to change, which are not used\nin the previous tasks. On the other hand, mode-IMM also works by selectively balancing between\ntwo weights using variance information. However, these assumptions on weight importance do not\nalways hold, especially in the disjoint MNIST experiment. The most important weight in the disjoint\nMNIST experiment is the bias term in the output layer. Nevertheless, these bias parts of the Fisher\nmatrix are not guaranteed to be the highest value nor can they be used to balance the class distribution\nbetween the \ufb01rst and second task. We believe that using only the diagonal of the covariance matrix\nin Bayesian neural networks is too na\u00efve in general and that this is why EWC failed in the disjoint\nMNIST experiment. We think it could be alleviated in future work by using a more complex prior,\nsuch as a matrix Gaussian distribution considering the correlations between nodes in the network\n[31].\nBalancing the Information of an Old and a New Task. The IMM procedure produces a neural\nnetwork without a performance loss for kth task \u00b5k, which is better than the \ufb01nal solution \u00b51:k in\nterms of the performance of the kth task. Furthermore, IMM can easily weigh the importance of\ntasks in IMM models in real time. For example, \u03b1t can be easily changed for the solution of mean-\nt \u03b1t\u00b5t . In actual service situations of IT companies, the importance of the old\nand the new task frequently changes in real time, and IMM can handle this problem. This property\ndifferentiates IMM from the other continual learning methods using the regularization approach,\nincluding LwF and EWC.\n\nIMM \u00b51:k = (cid:80)k\n\n7 Conclusion\n\nOur contributions are four folds. First, we applied mean-IMM to the continual learning of modern\ndeep neural networks. Mean-IMM makes competitive results to comparative models and balances\nthe information between an old and a new network. We also interpreted the success of IMM by the\nBayesian framework with Gaussian posterior. Second, we extended mean-IMM to mode-IMM with\nthe interpretation of mode-\ufb01nding in the mixture of Gaussian posterior. Mode-IMM outperforms\nmean-IMM and comparative models in various datasets. Third, we introduced drop-transfer, a novel\nmethod proposed in the paper. Experimental results showed that drop-transfer alone performs well\nand is similar to the EWC without dropout, in the domain where EWC rarely forgets. Fourth, We\napplied various transfer techniques in the IMM procedure to make our assumption of Gaussian\ndistribution reasonable. We argued that not only the search space of the loss function among neural\nnetworks can easily be nearly convex, but also regularizers, such as dropout, make the search space\nsmooth, and the point in the search space have good accuracy. Experimental results showed that\napplying transfer techniques often boost the performance of IMM. Overall, we made state-of-the-\nart performance in various datasets of continual learning and explored geometrical properties and a\nBayesian perspective of deep neural networks.\n\nAcknowledgments\n\nThe authors would like to thank Jiseob Kim, Min-Oh Heo, Donghyun Kwak, Insu Jeon, Christina\nBaek, and Heidi Tessmer for helpful comments and editing. This work was supported by the\nNaver Corp. and partly by the Korean government (IITP-R0126-16-1072-SW.StarLab, IITP-2017-0-\n01772-VTT, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF). Byoung-Tak Zhang is the cor-\nresponding author.\n\nReferences\n[1] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:\nThe sequential learning problem. Psychology of learning and motivation, 24:109\u2013165, 1989.\n\n9\n\n\f[2] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive\n\nsciences, 3(4):128\u2013135, 1999.\n\n[3] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empiri-\ncal investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint\narXiv:1312.6211, 2013.\n\n[4] Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and J\u00fcrgen\nSchmidhuber. Compete to compute. In Advances in neural information processing systems,\npages 2310\u20132318, 2013.\n\n[5] Zoubin Ghahramani. Online variational bayesian learning. In NIPS workshop on Online Learn-\n\ning, 2000.\n\n[6] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan.\nStreaming variational bayes. In Advances in Neural Information Processing Systems, pages\n1727\u20131735, 2013.\n\n[7] Zhizhong Li and Derek Hoiem. Learning without forgetting.\n\nComputer Vision, pages 614\u2013629. Springer, 2016.\n\nIn European Conference on\n\n[8] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An-\ndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the National Academy\nof Sciences, 2017.\n\n[9] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural\n\ncomputation, 4(3):448\u2013472, 1992.\n\n[10] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncer-\ntainty in neural network. In Proceedings of the 32nd International Conference on Machine\nLearning (ICML-15), pages 1613\u20131622, 2015.\n\n[11] Jacob Goldberger and Sam T Roweis. Hierarchical clustering of a mixture model. In Advances\n\nin Neural Information Processing Systems, pages 505\u2013512, 2005.\n\n[12] Sang-Woo Lee, Chung-Yeon Lee, Dong Hyun Kwak, Jiwon Kim, Jeonghee Kim, and Byoung-\nTak Zhang. Dual-memory deep learning architectures for lifelong learning of everyday human\nIn Twenty-Fifth International Joint Conference on Arti\ufb01cial Intelligencee, pages\nbehaviors.\n1669\u20131675, 2016.\n\n[13] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[14] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu,\nAlexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super\nneural networks. arXiv preprint arXiv:1701.08734, 2017.\n\n[15] Zhen Huang, Jinyu Li, Sabato Marco Siniscalchi, I-Fan Chen, Chao Weng, and Chin-Hui Lee.\nFeature space maximum a posteriori linear regression for adaptation of deep neural networks.\nIn Fifteenth Annual Conference of the International Speech Communication Association, 2014.\n\n[16] Zhen Huang, Sabato Marco Siniscalchi, I-Fan Chen, Jinyu Li, Jiadong Wu, and Chin-Hui Lee.\nMaximum a posteriori adaptation of network parameters in deep models. In Sixteenth Annual\nConference of the International Speech Communication Association, 2015.\n\n[17] Abdullah Rashwan, Han Zhao, and Pascal Poupart. Online and distributed bayesian moment\nmatching for parameter learning in sum-product networks. In Proceedings of the 19th Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 1469\u20131477, 2016.\n\n[18] Kai Zhang and James T Kwok. Simplifying mixture models through function approximation.\n\nNeural Networks, IEEE Transactions on, 21(4):644\u2013658, 2010.\n\n10\n\n\f[19] Manas Pathak, Shantanu Rane, and Bhiksha Raj. Multiparty differential privacy via aggre-\ngation of locally trained classi\ufb01ers. In Advances in Neural Information Processing Systems,\npages 1876\u20131884, 2010.\n\n[20] Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Information\n\nProcessing Systems, pages 2814\u20132822, 2013.\n\n[21] Surajit Ray and Bruce G Lindsay. The topography of multivariate normal mixtures. Annals of\n\nStatistics, pages 2042\u20132065, 2005.\n\n[22] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv\n\npreprint arXiv:1301.3584, 2013.\n\n[23] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural\n\nnetwork optimization problems. arXiv preprint arXiv:1412.6544, 2014.\n\n[24] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features\nin deep neural networks? In Advances in neural information processing systems, pages 3320\u2013\n3328, 2014.\n\n[25] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In Proceed-\nings of the tenth ACM SIGKDD international conference on Knowledge discovery and data\nmining, pages 109\u2013117. ACM, 2004.\n\n[26] Wolf Kienzle and Kumar Chellapilla. Personalized handwriting recognition via biased regu-\nlarization. In Proceedings of the 23rd international conference on Machine learning, pages\n457\u2013464. ACM, 2006.\n\n[27] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-\ndinov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[28] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The\n\ncaltech-ucsd birds-200-2011 dataset. Tech. Rep. CNS-TR-2011-001, 2011.\n\n[29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[30] Sang-Woo Lee, Chung-Yeon Lee, Dong-Hyun Kwak, Jung-Woo Ha, Jeonghee Kim, and\nByoung-Tak Zhang. Dual-memory neural networks for modeling cognitive activities of hu-\nmans via wearable sensors. Neural Networks, 2017.\n\n[31] Christos Louizos and Max Welling. Structured and ef\ufb01cient variational deep learning with\n\nmatrix gaussian posteriors. arXiv preprint arXiv:1603.04733, 2016.\n\n[32] Surajit Ray and Dan Ren. On the upper bound of the number of modes of a multivariate normal\n\nmixture. Journal of Multivariate Analysis, 108:41\u201352, 2012.\n\n[33] Carlos Am\u00e9ndola, Alexander Engstr\u00f6m, and Christian Haase. Maximum number of modes of\n\ngaussian mixtures. arXiv preprint arXiv:1702.05066, 2017.\n\n[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2434, "authors": [{"given_name": "Sang-Woo", "family_name": "Lee", "institution": "Seoul National University"}, {"given_name": "Jin-Hwa", "family_name": "Kim", "institution": "Seoul National University"}, {"given_name": "Jaehyun", "family_name": "Jun", "institution": "Seoul National University"}, {"given_name": "Jung-Woo", "family_name": "Ha", "institution": "Clova, NAVER Corp."}, {"given_name": "Byoung-Tak", "family_name": "Zhang", "institution": "Seoul National University & Surromind Robotics"}]}