Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
This work propose the application of Kronecker factored online Laplace approximation for overcoming catastrophic forgetting of neural networks. My main criticism of this paper is its lack of novelty/originality. As mentioned in the paper, using online Laplace propagation for continual learning of neural networks has already been explored in elastic weight consolidation (EWC) with its variants. Also, using Kronecker factored approximation of the Hessian has already been studied by Botev et. al. Hence it was hard for me to find part of the work that did not come from existing works. Still, I think this work provides a useful contribution to the field by building up on the popular framework of applying Laplace projection with state-of-art Hessian approximations and might be worth accepting to the conference. Especially, it is worth pointing that empirical results show great performance of the proposed result and should be of broad interest. As a minor question, I am confused by the overall experimental results, in particular that both EWC and synaptic intelligence (SI) seems to underperform by a large margin compared to simple diagonal approximation of Hessians, which is contrary to the experiments in the respective papers. Is there any possible explanations for this? ------- I have read the author's feedbacks and keeping my score as it is. However I am still a bit surprised by the online Laplace algorithm outperforming two algorithms in Figure 1, since comparison between EWC and Online Laplace is still in debate. Especially the reference mentioned by authors have been replied in "Reply to Huszár: The elastic weight consolidation penalty is empirically valid", by Kirkpatrick et al.. Further SI was originally reported to outperform both EWC and Online Laplace. However I also believe that this might result from different experimental settings, especially using up to 50 tasks. Hence, I would like to suggest the authors to add a footnote to describe this symptom and indicating it goes against some of existing works, to prevent the readers' confusion.
*** Edit after the rebuttal period: I have read other reviews and the author feedback. The experiments in this paper are solid and the storyline is clear, however, the proposed method is of arguably low novelty. I will keep my evaluation and rating as is. *** Summary: The paper considers online structured Laplace approximation for training neural networks to mitigate catastrophic forgetting. The idea is to perform Laplace approximation for a task to obtain a Gaussian approximation to the posterior, use this Gaussian as the prior for the weights for next task, and repeat. This results in an addition of an L2 regularisation term to the objective and requires the Hessian of the objective evaluated at the MAP estimate — this is approximated by a Kronecker factored empirical Fisher information matrix. Adding multiple penalties as in EWC and reweighting the empirical FIM term were considered. The take-home message seems to be that using a non-diagonal approximation to the posterior gives a significant gain for all methods across many supervised training tasks. Comments: The paper is very well written and in my opinion deserves acceptance. The key idea is relatively simple in hindsight, that is to use structured curvature information for the Laplace approximation (Ritter et al, 2018) in the continual learning setting. Previous works such as EWC, which can be interpreted as online Laplace approximation, only consider a diagonal approximation to the Fisher information matrix. The experiments showed that the addition of the structured approximation is clearly beneficial to online learning, in a similar fashion to the same structured approximation yielding better Laplace approximation in the batch setting (Ritter et al, 2018). This paper combines several existing techniques and approximations to form a principled approach to continual learning. One could argue that the novelty of this paper is incremental, but the practical advantage of the proposed method seems significant and, as such, could generate interests in the community as continual learning has been an active area of research in the last couple of years. One thing I dislike is that this method relies on a validation set to tune the regularisation hyper-parameters and the performance is very sensitive to these hyperparams, and it feels unsatisfying that vanilla (approximate) online Bayesian updating is not sufficient to warrant good performance. The variational continual learning framework of Nguyen et al (2018) for example does not introduce any additional hyperparameters to reweight the regularisation term and the expected likelihood term, while still performing quite well with a diagonal Gaussian approximation.
The proposed method improves upon the recently developed EWC and SI by incorporating Kronecker factored approximation of the Hessian matrix. Previously, while the diagonal of Fisher information matrix was used to approximate the Gaussian precision matrix of the posterior of the parameter, this work utilizes the block structure of the Hessian matrix. The paper also uses online version of approximation, which is based on the Bayesian online learning, and experimentally show that the proposed method significantly outperforms the two baselines, EWC and SI. I think to improve the manuscript, it should be better to point out the computation complexity for the suggested method. Namely, for EWC, computing the regularizer simply requires computing the gradient for each parameter, but, for the proposed work, we may need more computation. Clearly pointing out the complexity will make the paper more stronger. ######### Update after rebuttal I have read the authors' rebuttal. Regarding the computational complexity, I think what author's argued in the rebuttal seemed reasonable. The main reason for my initial rating of the paper was based on the strong experimental result given in Fig.1. Running the experiments to 50 tasks and showing the big gap between the proposed method and EWC/SI, which consist of current state-of-the-arts. But, reading others' comments as well as more carefully checking the references, I also realized that the technical novelty - including the Kronecker factored approximation of the Hessian - is low. However, I still think the experimental contribution of Fig1 of the paper is still strong and valid to wider community. So, I decided to reduce my original rating to 7.