Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Summary: The paper presents a regularization-based continual learning method, UCL, where during the training of the current task the parameters of the network are regularized based on their uncertainty in the previous tasks (less uncertainty means that a parameter is important and should not be altered in future tasks). Instead of measuring the uncertainty at the parameter level, as done in the earlier works (e.g.) Variational Continual Learning (VCL), the authors propose to measure uncertainty over the neurons resulting in less number of learnable parameters (mean and variances) to store. To compute the neurons uncertainty, UCL imposes a constraint that all the weights going into a neuron share the same/ common variance. To learn the parameters, a variational objective is used where the authors cleverly opened up the KL term in the ELBO and play with it to impose constraints on the variances of different neurons. The results are reported on MNIST benchmarks and RL tasks. Positives: 1 - The paper is well-written and the final model is well-motivated. 2 - The KL term (Eq. 3) gives interesting insights and the authors exploit the insights carefully to impose different restrictions on the parameters update to avoid forgetting while keeping the network capacity sufficient to learn new tasks. 3 - I quite like the ablation performed on different facets of the objective function in Figure 5. Negatives: I overall quite liked the paper. While the method is well-motivated, I am concerned with the experiments (especially the supervised learning experiment). The reporting of results on only MNIST is a growing concern in the CL community and there’s enough literature available showing that the MNIST is not a very good benchmark to test whether a continual learning method is working or not. While most of the other benchmarks that people deploy (e.g.) Split-[CIFAR, CUB, miniImageNet] have their own problems, they at least make the problem setup more complex and interesting. At the very least, I expect the authors to try their method on these benchmarks and see how their method fare compared to others. I am leaning towards borderline pending the requested experiments. Post-rebuttal: I read the authors' response and they adequately answer my main concern of testing their approach only on the MNIST for supervised learning experiments. As for the novelty, I still believe the proposed interpretation of the KL term in the ELBO is original and gives interesting insights to the online bayesian learning frameworks. I will recommend acceptance.
Post-rebuttal: I have read the rebuttal. I think the rebuttal has sufficiently addressed my questions. With the new experiments for supervised learning and reinforcement learning, I think the paper is much stronger now. So, I will vote for accepting this paper. --------- - Originality: I think the proposed method is novel. - Quality: 1. I find the proposed regularization term a bit messy with 3 components added to the original objective function. However, the paper explains the reasons behind the modifications well. 2. On L206: did you sample only one sample weight once at the beginning of the optimization process? Or did you sample one sample weight at every iteration of the optimization? From Eq 7, it seems that you took the first approach, which I find very strange that it could work at all since the log-likelihood term would not depend on the parameters of the model in this case. - Clarity: The paper is mostly clear, except that the authors used too many "so-called" in Sec 3.1, which gives the impression that they don't agree with the names of the methods in literature. Please consider fixing them if it is not intentional. - Significance: I think the contributions in this paper is reasonably significant.
This paper changes the way of the regularization of the VCL algorithm based on some heuristic intuitions. The idea of maintaining the uncertainty for each hidden node is actually interesting. And it also makes more sense to give high regularization strength when either of the nodes it connects has low uncertainty. However, I think the proposed method in this paper focuses on the modification on the regularization which only provides incremental improvement with respect to VCL. The ideology behind the proposed method is not well explained from a theoretical viewpoint. The effectiveness of the proposed algorithm cannot be verified just through a two-hidden-layers fully connected network. Since this paper uses two hyperparameters to control the uncertainty, it is important to include experimental results show the influence of them.