NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5
Title:Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation

Reviewer 1

Overall, the proposed method is original (it harness the linear modulation idea from the visual question answering field and the style transfer field) and efficient. The paper is well written and clear. The scope of the algorithm is large, making thus the proposed method useful and significant. The algorithms of the MAML family (Finn, C., Abbeel, P., & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. 2017) are designed to find model weights that are a good starting fine-tuning point for a task not seen during training. To achieve this, these algorithms rely on a set of different tasks, a task being a set of annotated data and an error criterion. The model will be trained in such a way that a few steps of fine-tuning on these different tasks yield the lowest possible error on these tasks. Because several tasks were seen during training, it provides an efficient initialization for fine-tuning on a new test set task. Those frameworks yield great performance on several family of tasks, such as few-shot learning or reinforcement learning. MAML (and related) seeks a common initialization for all the tasks at hand, no matter how different they are. This paper stated that seeking a single initialization for an entire tasks distribution is limiting the achievable performance over this distribution and will thus prevent the algorithm to work on a diverse tasks distribution. Then, the paper proposes a new meta-learning method able to overcome this limitation, by having au auxiliary network modulating the initialization depending on the task mode and evaluates its effectiveness on several family of tasks. The proposed method is state-of-the-art on all the evaluated tasks. The Introduction and the Related Works sections of the paper deal with meta-learning in general and its limitations. There is no missing major works in the bibliography. The Related Works and the Preliminaries are focused on the MAML algorithm, which is normal because the current algorithm is built upon MAML and is fairly different from the other kinds of meta-learning methods. The method is well explained. Reading the supplementary materials may be required to understand the details of the modulation of the parameters. The FiLM modulation operation is taken from a paper in the visual questions answering field, but the field of style transfer has also used similar methods (AdaIN) to control the style of the output image based on the style of an input image. The figures are clear, but some of them (Figure 2 and 3) absolutely needs to be viewed in color. The Experiments section follows the same architecture as the MAML experiments : regression, few-shot classification, reinforcement learning. The experiments are adapted for the method but do not look like hand-crafted for this particular algorithm. For instance, merging datasets (with the same labels) seems a reasonable strategy to creates a multi-modal tasks distribution. The section is covering all the aspects of the conducted experiments, and many more details are available from the supplementary materials (such as the used networks’ architectures or the training hyper-parameters). The comparison baselines are well chosen, even if the MAML method has gotten a few general improvements. The source code is released along the paper, and is of great quality.

Reviewer 2

This paper suggests an extension to MAML that focuses on integrating model-based meta-learners and gradient-based meta learners. The method recognizes the task "mode" and adapts an initialization that can be learned through a few gradient steps. A practical optimization algorithm is presented for learning the proposed framework. The paper is technically sound. MMAML is tested on three different tasks, including regression on synthetic data, image classification on challenging datasets, and RL on standard datasets. In the image classification tests, for the two-mode case mini-imagenet has been used, while for the 3-mode case and 5-mode case, the additional datasets are not very challenging. The paper claims that the gap between the proposed method and MAML is larger when there are more modes, suggesting that the impact of the method is better seen in cases with more modes. However, it is not clear whether the improvement in performance is due to the ability to better handle higher number of modes or it is because the additional modes are simple tasks and don't have much different distributions. The Two-digit MNIST and three-digit MNIST datasets added on the 5-mode case are both easy tasks and have relatively similar distributions. While it is expected to have lower accuracies when more modes are present, perhaps because MNIST-based datasets are too easy, the five-mode experiments have very high accuracies as compared to 2-mode tests. Some other classification datasets such as CIFAR might be a better choice to evaluation. The paper is generally well-written and structured clearly. The idea of the paper has a good practical impact and solves a limitation of MAML. Update after authors feedback: Authors have addressed the concerns. I would like to update my overall score to 8.

Reviewer 3

This paper studies meta-learning in the context of a multi-modal task distribution (e.g. few-shot image classification where input-output pairs come from entirely different datasets). The authors first note that MAML—which finds a single initialization of the parameters and updates those parameters to the task via standard gradient updates— is not well-suited to this setup because the diversity of the tasks likely requires substantially different parameters. Motivated by this observation, this paper proposes an extension to MAML, called multi-modal MAML (MMAML), which is designed to capture the multi-modality in the parameter space. More specifically, this paper uses a separate modulation network that adapts the task parameters through an affine feature transformation. The modulated task parameters then undergo the usual MAML gradient update. The paper evaluates the proposed extension on three tasks: a synthetic few-shot regression problem, a few-shot image classification problem, and a meta-RL problem. They compare their method against two baselines: MAML and multi-MAML, a variant of MAML which has access to the ground-truth task mode label. For the regression task, they show that 1) MMAML significantly outperforms MAML (especially when the number of modes increases), and that 2) surprisingly MMAML also outperforms multi-MAML, and that 2) the modulation step is doing the most of the work, already resulting in decent performance on the regression tasks without using the gradient steps. More or less similar findings are reported for the image classification problem. Strengths: 1. The paper is well-written and the core idea is well-motivated and easy-to-follow 2. The method is quite versatile and is evaluated on a diverse set of tasks (ranging from synthetic regression to image classification and reinforcement learning) 3. I believe the baselines are sensible (though the paper would benefit from comparison to stronger methods, see weaknesses) Weaknesses: 1. I believe the paper would be much stronger if it compares against stronger baselines like prototypical nets, proto-MAML, bayesian MAML and TADAM. Though the authors mention these methods in the related work, they do not directly compare against them. In the supplementary material, they say the following about bayesian MAML: “We believe the model distribution is still unimodal (with a Gaussian prior), which is not well-designed to address multimodal task distributions (similar to MAML).” Such claims should be backed by empirical evaluations. 2. Although the authors show through t-SNE visualizations that the modulation network successfully separates the modes of the task distribution, I believe more can be done to investigate the learned modulation step. For example, one of the premises of the proposed method is that the pure gradient step of MAML can’t bring you far enough in parameter space to obtain good performance across nodes. I think it will be informative if you check the norm of the modulation step vs the norm of the gradient updates, and compare this to the norm of MAML updates. Typos: L83. Those UPDATE: After reading the rebuttal I have updated my score to 7.