{"title": "Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 12, "abstract": "Model-agnostic meta-learners aim to acquire meta-learned parameters from similar tasks to adapt to novel tasks from the same distribution with few gradient updates. With the flexibility in the choice of models, those frameworks demonstrate appealing performance on a variety of domains such as few-shot image classification and reinforcement learning. However, one important limitation of such frameworks is that they seek a common initialization shared across the entire task distribution, substantially limiting the diversity of the task distributions that they are able to learn from. In this paper, we augment MAML with the capability to identify the mode of tasks sampled from a multimodal task distribution and adapt quickly through gradient updates. Specifically, we propose a multimodal MAML (MMAML) framework, which is able to modulate its meta-learned prior parameters according to the identified mode, allowing more efficient fast adaptation. We evaluate the proposed model on a diverse set of few-shot learning tasks, including regression, image classification, and reinforcement learning. The results not only demonstrate the effectiveness of our model in modulating the meta-learned prior in response to the characteristics of tasks but also show that training on a multimodal distribution can produce an improvement over unimodal training. The code for this project is publicly available at https://vuoristo.github.io/MMAML.", "full_text": "Multimodal Model-Agnostic Meta-Learning via\n\nTask-Aware Modulation\n\nRisto Vuorio\u22171\n\nShao-Hua Sun\u22172\n\nHexiang Hu2\n\nJoseph J. Lim2\n\n1University of Michigan\n\n2University of Southern California\n\nvuoristo@gmail.com\n\n{shaohuas, hexiangh, limjj}@usc.edu\n\nAbstract\n\nModel-agnostic meta-learners aim to acquire meta-learned parameters from similar\ntasks to adapt to novel tasks from the same distribution with few gradient updates.\nWith the \ufb02exibility in the choice of models, those frameworks demonstrate appeal-\ning performance on a variety of domains such as few-shot image classi\ufb01cation and\nreinforcement learning. However, one important limitation of such frameworks is\nthat they seek a common initialization shared across the entire task distribution,\nsubstantially limiting the diversity of the task distributions that they are able to learn\nfrom. In this paper, we augment MAML [5] with the capability to identify the mode\nof tasks sampled from a multimodal task distribution and adapt quickly through\ngradient updates. Speci\ufb01cally, we propose a multimodal MAML (MMAML) frame-\nwork, which is able to modulate its meta-learned prior parameters according to the\nidenti\ufb01ed mode, allowing more ef\ufb01cient fast adaptation. We evaluate the proposed\nmodel on a diverse set of few-shot learning tasks, including regression, image\nclassi\ufb01cation, and reinforcement learning. The results not only demonstrate the\neffectiveness of our model in modulating the meta-learned prior in response to the\ncharacteristics of tasks but also show that training on a multimodal distribution\ncan produce an improvement over unimodal training. The code for this project is\npublicly available at https://vuoristo.github.io/MMAML.\n\n1\n\nIntroduction\n\nHumans make effective use of prior knowledge to acquire new skills rapidly. When the skill of interest\nis related to a wide range of skills that one have mastered before, we can recall relevant knowledge of\nprior skills and exploit them to accelerate the new skill acquisition procedure. For example, imagine\nthat we are learning a novel snowboarding trick with knowledge of basic skills about snowboarding,\nskiing, and skateboarding. We accomplish this feat quickly by exploiting our basic snowboarding\nknowledge together with inspiration from our skiing and skateboarding experience.\nCan machines likewise quickly master a novel skill based on a variety of related skills they have\nalready acquired? Recent advances in meta-learning [48, 6, 4] have attempted to tackle this problem.\nThey offer machines a way to rapidly adapt to a new task using few samples by \ufb01rst learning an\ninternal representation that matches similar tasks. Such representations can be learned by considering\na distribution over similar tasks as the training data distribution. Model-based (i.e. RNN-based)\nmeta-learning approaches [4, 52, 27, 25] propose to recognize the task identity from a few sample\ndata, use the task identity to adjust a model\u2019s state (e.g. RNN\u2019s internal state or an external memory)\nand make the appropriate predictions with the adjusted model. Those methods demonstrate good\nperformance at the expense of having to hand-design architectures, yet the optimal strategy of\ndesigning a meta-learner for arbitrary tasks may not always be obvious to humans. On the other hand,\nmodel-agnostic meta-learning frameworks [5, 7, 15, 18, 8, 28, 36, 35] seek an initialization of model\n\n\u2217Contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fparameters that a small number of gradient updates will lead to superior performance on a new task.\nWith the \ufb02exibility in the model choices, these frameworks demonstrate appealing performance on a\nvariety of domains, including regression, image classi\ufb01cation, and reinforcement learning.\nWhile most of the existing model-agnostic meta-learners rely on a single initialization, different tasks\nsampled from a complex task distributions can require substantially different parameters, making it\ndif\ufb01cult to \ufb01nd a single initialization that is close to all target parameters. If the task distribution is\nmultimodal with disjoint and far apart modes (e.g. snowboarding, skiing), one can imagine that a\nset of separate meta-learners with each covering one mode could better master the full distribution.\nHowever, associating each task with one of the meta-learners not only requires additional task identity\ninformation, which is often not available or could be ambiguous when the modes are not clearly\ndisjoint, but also disables transferring knowledge across different modes of the task distribution. To\novercome this issue, we aim to develop a meta-learner that is able to acquire mode-speci\ufb01c prior\nparameters and adapt quickly given tasks sampled from a multimodal task distribution.\nTo this end, we leverage the strengths of the two main lines of existing meta-learning techniques:\nmodel-based and model-agnostic meta-learning. Speci\ufb01cally, we propose to augment MAML [5]\nwith the capability of generalizing across a multimodal task distribution. Instead of learning a single\ninitialization point in the parameter space, we propose to \ufb01rst compute the task identity of a sampled\ntask by examining task related data samples. Given the estimated task identity, our model then\nperforms modulation to condition the meta-learned initialization on the inferred task mode. Then,\nwith these modulated parameters as the initialization, a few steps of gradient-based adaptation are\nperformed towards the target task to progressively improve its performance. An illustration of our\nproposed framework is shown in Figure 1.\nTo investigate whether our method can acquire meta-learned prior parameters by learning tasks sam-\npled from multimodal task distributions, we design and conduct experiments on a variety of domains,\nincluding regression, image classi\ufb01cation, and reinforcement learning. The results demonstrate the\neffectiveness of our approach against other systems. A further analysis has also shown that our\nmethod learns to identify task modes without extra supervision.\nThe main contributions of this paper are three-fold as follows:\n\u2022 We identify and empirically demonstrate the limitation of having to rely on a single initialization\n\u2022 We propose a framework together with an algorithm to address this limitation. Speci\ufb01cally, it\ngenerates a set of meta-learned prior parameters and adapts quickly given tasks from a multimodal\ntask distribution leveraging both model-based and model-agnostic meta-learning.\n\nin a family of widely used model-agnostic meta-learners.\n\n\u2022 We design a set of multimodal meta-learning problems and demonstrate that our model offers a\nbetter generalization ability in a variety of domains, including regression, image classi\ufb01cation,\nand reinforcement learning.\n\n2 Related Work\n\nThe idea of empowering the machines with the capability of learning to learn [44] has been widely\nexplored by the machine learning community. To improve the ef\ufb01ciency of handcrafted optimizers,\na \ufb02urry of recent works has focused on learning to optimize a learner model. Pioneered by [38,\n2], optimization algorithms with learned parameters have been proposed, enabling the automatic\nexploitation of the structure of learning problems. From a reinforcement learning perspective, [21]\nrepresents an optimization algorithm as a learning policy. [1] trains LSTM optimizers to learn update\nrules from the gradient history, and [34] trains a meta-learner LSTM to update a learner\u2019s parameters.\nSimilar approach for continual learning is explored in [49].\nRecently, investigating how we can replicate the ability of humans to learn new concepts from one or a\nfew instances, known as few-shot learning, has drawn people\u2019s attention due to its broad applicability\nto different \ufb01elds. To classify images with few examples, metric-based meta-learning frameworks\nhave been proposed [16, 48, 42, 41, 43, 29, 3], which strive to learn a metric or distance function that\ncan be used to compare two different samples effectively. Recent works along this line [29, 53, 19]\nshare a conceptually similar idea with us and seek to perform task-speci\ufb01c adaptation with different\ntype transformations. Due to the limited space, we defer the detailed discussion to the supplementary\n\n2\n\n\fmaterial. While impressive results have been shown, it is nontrivial to adopt them for complex tasks\nsuch as acquiring robotic skills using reinforcement learning [12, 22, 14, 33, 9, 10, 20].\nOn the other hand, instead of learning a metric, model-based (i.e. RNN-based) meta-learning models\nlearn to adjust model states (e.g. a state of an RNN [25, 4, 51] or external memory [37, 27]) using\na training dataset and output the parameters of a learned model or the predictions given test inputs.\nWhile these methods have the capacity to learn any mapping from datasets and test samples to their\nlabels, they could suffer from over\ufb01tting and show limited generalization ability [6].\nModel-agnostic meta-learners [5, 7, 15, 18, 8, 28, 36, 35] are agnostic to concrete model con\ufb01gura-\ntions. Speci\ufb01cally, they aim to learn a parameter initialization under a certain task distribution, that\naims to provide a favorable inductive bias for fast gradient-based adaptation. With its model agnostic\nnature, appealing results have been shown on a variety of learning problems. However, assuming\ntasks are sampled from a concentrated distribution and pursuing a common initialization to all tasks\ncan substantially limit the performance of such methods on multimodal task distributions where the\ncenter in the task space becomes ambiguous.\nIn this paper, we aim to develop a more powerful model-agnostic meta-learning framework which is\nable to deal with complex multimodal task distributions. To this end, we propose a framework, which\n\ufb01rst identi\ufb01es the mode of sampled tasks, similar to model-based meta-learning approaches, and then\nit modulates the meta-learned prior parameters to make the model better \ufb01t to the identi\ufb01ed mode.\nFinally, the model is \ufb01ne-tuned on the target task rapidly through gradient steps.\n\n3 Preliminaries\n\nThe goal of meta-learning is to quickly learn task-speci\ufb01c functions that map between input data and\nthe desired output (xk, yk)Kt\nk=1 for different tasks t, where the number of data Kt is small. A task is\nde\ufb01ned by the underlying data generating distribution P(X) and a conditional probability Pt(Y | X).\nFor instance, we consider \ufb01ve-way image classi\ufb01cation tasks with xk to be images and yk to be the\ncorresponding labels, sampled from a task distribution. The data generating distribution is unimodal\nif it contains classi\ufb01cation tasks that belong to a single input and label domain (e.g. classifying\ndifferent combination of digits). A multimodal counterpart therefore contains classi\ufb01cation tasks\nfrom multiple different input and label domains (e.g. classifying digits vs. classifying birds). We\ndenote the later distribution of tasks to be the multimodal task distribution.\nIn this paper, we aim to rapidly adapt to a novel task sampled from a multimodal task distribution.\nWe consider a target dataset D consisting of tasks sampled from a multimodal distribution. The\ndataset is split into meta-training and meta-testing sets, which are further divided into task-speci\ufb01c\nand validation DvalT sets. A meta-learner learns about the underlying structure of the\ntraining DtrainT\ntask distribution through training on the meta-training set and is evaluated on meta-testing set.\nOur work builds upon Model-Agnostic Meta-Learning (MAML) algorithm [5]. MAML seeks an\ninitialization of parameters \u03b8 for a meta-learner such that it can be optimized towards a new task with\na small number of gradient steps minimizing the task-speci\ufb01c objectives on the training data DtrainT\n,\nwith the adapted parameters generalize well to the validation data DvalT . The initialization of the\nparameters is trained by sampling mini-batches of tasks from D, computing the adapted parameters\nfor all DtrainT\nin the batch, evaluating adapted parameters to compute the validation losses on the DvalT\nand \ufb01nally update the initial parameters \u03b8 using the gradients from the validation losses.\n\n4 Method\n\nOur goal is to develop a framework to quickly master a novel task from a multimodal task distribution.\nWe call the proposed framework Multimodal Model-Agnostic Meta-Learning (MMAML). The main\nidea of MMAML is to leverage two complementary neural networks to quickly adapt to a novel task.\nFirst, a network called the modulation network predicts the identity of the mode of a task. Then the\npredicted mode identity is used as an input by a second network called the task network, which is\nfurther adapted to the task using gradient-based optimization. Speci\ufb01cally, the modulation network\naccesses data points from the target task and produces a set of task-speci\ufb01c parameters to modulate\nthe meta-learned prior parameters of the task network. Finally, the modulated task network (but\n\n3\n\n\fAlgorithm 1 MMAML META-TRAINING PROCEDURE.\n1: Input: Task distribution P (T ), Hyper-parameters \u03b1 and \u03b2\n2: Randomly initialize \u03b8 and \u03c9.\n3: while not DONE do\n4:\n5:\n6:\n7:\n\nSample batches of tasks Tj \u223c P (T )\nfor all j do\nInfer \u03c5 = h({x, y}K; \u03c9h) with K samples from DtrainTj\n.\nGenerate parameters \u03c4 = {gi(\u03c5; \u03c9g) | i = 1,\u00b7\u00b7\u00b7 , N}\nto modulate each block of the task network f.\nEvaluate \u2207\u03b8LTj (f (x; \u03b8, \u03c4 );DtrainTj ) w.r.t the K samples\nCompute adapted parameter with gradient descent:\nTj = \u03b8 \u2212 \u03b1\u2207\u03b8LTj(cid:0)f (x; \u03b8, \u03c4 );DtrainTj (cid:1)\n\u03b8(cid:48)\nUpdate \u03b8 with \u03b2\u2207\u03b8(cid:80)Tj\u223cP (T ) LTj(cid:0)f (x; \u03b8(cid:48), \u03c4 );DvalTj(cid:1)\nUpdate \u03c9g with \u03b2\u2207\u03c9g(cid:80)Tj\u223cP (T ) LTj(cid:0)f (x; \u03b8(cid:48), \u03c4 );DvalTj(cid:1)\nUpdate \u03c9h with \u03b2\u2207\u03c9h(cid:80)Tj\u223cP (T ) LTj(cid:0)f (x; \u03b8(cid:48), \u03c4 );DvalTj(cid:1)\n\n10:\n11:\n12:\n13:\n14: end while\n\nend for\n\n8:\n9:\n\nFigure 1: Model overview. The modulation net-\nwork produces a task embedding \u03c5 , which is used\nto generate parameters {\u03c4i} that modulates the\ntask network. The task network adapts modulated\nparameters to \ufb01t to the target task.\n\nnot the task-speci\ufb01c parameters from modulation network) is further adapted to target task through\ngradient-based optimization. A conceptual illustration can be found in Figure 1.\nIn the rest of this section, we introduce our modulation network and a variety of modulation operators\nin section 4.1. Then we describe our task network and the training details for MMAML in section 4.2.\n\n4.1 Modulation Network\n\nAs mentioned above, modulation network is responsible for identifying the mode of a sampled task,\nand generate a set of parameters speci\ufb01c to the task. To achieve this, it \ufb01rst takes the given K data\npoints and their labels {xk, yk}k=1,...,K as input to the task encoder f and produces an embedding\nvector \u03c5 that encodes the characteristics of a task:\n\n\u03c5 = h(cid:16){(xk, yk) | k = 1,\u00b7\u00b7\u00b7 , K}; \u03c9h(cid:17)\n\n(1)\n\nThen the task-speci\ufb01c parameters \u03c4 are computed based on the encoded task embedding vector\n\u03c5, which is further used to modulate the meta-learned prior parameters of the task network. The\ntask network (introduced later at Section 4.2) can be an arbitrarily parameterized function, with\nmultiple building blocks (or layers) such as deep convolutional networks [11], or multi-layer recurrent\nnetworks [32]. To modulate the parameters of each block in the task network as good initialization for\nsolving the target task, we apply block-wise transformations to scale and shift the output activation of\neach hidden unit in the network (i.e. the output of a channel of a convolutional layer or a neuron of a\nfully-connected layer). Speci\ufb01cally, the modulation network produces the modulation vectors for\neach block i, denoted as\n\n\u03c4i = gi (\u03c5; \u03c9g) , where i = 1,\u00b7\u00b7\u00b7 , N,\n\n(2)\nwhere N is the number of blocks in the task network. We formalize the procedure of applying\nmodulation as: \u03c6i = \u03b8i (cid:12) \u03c4i, where \u03c6i is the modulated prior parameters for the task network,\nand (cid:12) represents a general modulation operator. We investigate some representative modulation\noperations including attention-based (softmax) modulation [26, 47] and feature-wise linear modula-\ntion (FiLM) [31, 30, 13]. We empirically observe that FiLM performs better and more stable than\nattention-based modulation (see Section 5 for details), and therefore use FiLM as default operator for\nmodulation. The details of these modulation operators can be found in the supplementary material.\n\n4.2 Task Network\n\nThe parameters of each block of the task network are modulated using the task-speci\ufb01c parameters\n\u03c4 = {\u03c4i | i = 1,\u00b7\u00b7\u00b7 , N} generated by the modulation network, which can generate a mode-aware\ninitialization in the parameter space f (x; \u03b8, \u03c4 ). After the modulation step, few steps of gradient\ndescent are performed on the meta-learned prior parameters of the task network to further optimize\n\n4\n\nModulation NetworkTask Networkxy((K\u21e5SamplesTask EncoderTask EmbeddingModulation NetworkModulation NetworkMLPsxy\u27132\u23272\u27131\u23271\u2327n\u2713n\u2026\u02c6y\fSinusoidal\n\nLinear\n\nQuadratic\n\nTransformed (cid:96)1 Norm\n\nTanh\n\n(a) MMAML post modulation vs. other prior models\n\n(b) MMAML post adaptation vs. other posterior models\n\nFigure 2: Qualitative Visualization of Regression on Five-modes Simple Functions Dataset. (a): We compare\nthe predicted function shapes of modulated MMAML against the prior models of MAML and Multi-MAML,\nbefore gradient updates. Our model can \ufb01t the target function with limited observations and no gradient updates.\n(b): The predicted function shapes after \ufb01ve steps of gradient updates, MMAML is qualitatively better. More\nvisualizations in Supplementary Material.\n\nthe objective function for a target task Ti. Note that the task-speci\ufb01c parameters \u03c4i are kept \ufb01xed and\nonly the meta-learned prior parameters of the task network are updated. We describe the concrete\nprocedure in the form of the pseudo-code as shown in Algorithm 1. The same procedure of modulation\nand gradient-based optimization is used both during meta-training and meta-testing time.\nDetailed network architectures and training hyper-parameters are different by the domain of applica-\ntions, we defer the complete details to the supplementary material.\n\n5 Experiments\n\nWe evaluate our method (MMAML) and baselines in a variety of domains including regression, image\nclassi\ufb01cation, and reinforcement learning, under the multimodal task distributions. We consider the\nfollowing model-agnostic meta-learning baselines:\n\u2022 MAML [5] represents the family of model-agnostic meta-learners. The architecture of MAML\non each individual domain is designed to be the same as task network in MMAML.\n\u2022 Multi-MAML consists of M (the number of modes) MAML models and each of them is\nspeci\ufb01cally trained on the tasks sampled from a single mode. The performance of this baseline\nis evaluated by choosing models based on ground-truth task-mode labels. This baseline can be\nviewed as the upper-bound of performance for MAML. If it outperforms MAML, it indicates\nthat MAML\u2019s performance is degenerated due to the multimodality of task distributions. Note\nthat directly comparing the other algorithms to Multi-MAML is not fair as it uses additional\ninformation which is not available in real world scenarios.\n\nNote that we aim to develop a general model-agnostic meta-learning framework and therefore the\ncomparison to methods that achieved great performance on only an individual domain are omitted. A\nmore detailed discussion can be found in the supplementary material.\n\n5.1 Regression Experiments\n\nSetups. We experiment with our models in multimodal few-shot regression. In our setup, \ufb01ve pairs\nof input/output data {xk, yk}k=1,...,K are sampled from a one dimensional function and provided to a\nlearning model. The model is asked to predict L output values yq\n1, ..., xq\nL.\nTo construct the multimodal task distribution, we set up \ufb01ve different functions: sinusoidal, linear,\nquadratic, transformed (cid:96)1 norm, and hyperbolic tangent functions, and treat them as discrete task\nmodes. We then evaluate three different task combinations with two functions, three functions and\n\ufb01ve functions in them. For each task, \ufb01ve pairs of data are sampled and Gaussian noise is added to the\n\nL for input queries xq\n\n1, ..., yq\n\n5\n\nData PointsGround TruthMAMLMultiMAMLMMAML\fTable 1: Mean square error (MSE) on the multimodal 5-shot regression with 2, 3, and 5 modes. A Gaussian\nnoise with \u00b5 = 0 and \u03c3 = 0.3 is applied. Multi-MAML uses ground-truth task modes to select the corresponding\nMAML model. Our method (with FiLM modulation) outperforms other methods by a margin.\n\nMethod\n\n2 Modes\n\n3 Modes\n\n5 Modes\n\nPost Modulation\n\nPost Adaptation\n\nPost Modulation\n\nPost Adaptation\n\nPost Modulation\n\nPost Adaptation\n\nMAML [5]\nMulti-MAML\nLSTM Learner\nOurs: MMAML (Softmax)\nOurs: MMAML (FiLM)\n\n-\n-\n\n0.362\n1.548\n2.421\n\n1.085\n0.433\n\n-\n\n0.361\n0.336\n\n-\n-\n\n0.548\n2.213\n1.923\n\n1.231\n0.713\n\n-\n\n0.444\n0.444\n\n-\n-\n\n0.898\n2.421\n2.166\n\n1.668\n1.082\n\n-\n\n0.939\n0.868\n\nTable 2: Classi\ufb01cation testing accuracies on the multimodal few-shot image classi\ufb01cation with 2, 3, and 5\nmodes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method\noutperforms MAML and achieve comparable results with Multi-MAML in all the scenarios.\n\nMethod & Setup\n\n2 Modes\n\n3 Modes\n\n5 Modes\n\nWay\nShot\n\nMAML [5]\nMulti-MAML\nMMAML (ours)\n\n5-way\n\n5-way\n\n5-way\n\n5-shot\n\n20-way\n1-shot\n\n20-way\n1-shot\n1-shot\n66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85%\n66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78%\n69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97%\n\n20-way\n1-shot\n\n1-shot\n\n5-shot\n\n1-shot\n\n5-shot\n\noutput value y, which further increases the dif\ufb01culty of identifying which function generated the data.\nPlease refer to the supplementary materials for details and parameters for regression experiments.\nBaselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two\nbaseline methods, both with MLP task networks. Our method (MMAML) augments the task network\nwith a modulation network. We choose to use an LSTM to serve as the modulation network due to its\nnature as good at handling sequential inputs and generate predictive outputs. Data points (sorted by\nx value) are \ufb01rst input to this network to generate task-speci\ufb01c parameters that modulate the task\nnetwork. The modulated task network is then further adapted using gradient-based optimization.\nTwo variants of modulation operators \u2013 softmax and FiLM are explored to be used in our approach.\nAdditionally, to study the effectiveness of the LSTM model, we evaluate another baseline (referred\nto as the LSTM Learner) that uses the LSTM as the modulation network (with FiLM) but does not\nperform gradient-based updates. Please refer to the supplementary materials for concrete speci\ufb01cation\nof each model.\nResults. The quantitative results are shown in Table 1. We observe that MAML has the highest\nerror in all settings and that incorporating task identity (Multi-MAML) can improve over MAML\nsigni\ufb01cantly. This suggests that MAML degenerates under multimodal task distributions. The LSTM\nlearner outperforms both MAML and Multi-MAML, showing that the sequence model can effectively\ntackle this regression task. MMAML improves over the LSTM learner signi\ufb01cantly, which indicates\nthat with a better initialization (produced by the modulation network), gradient-based optimization\ncan lead to superior performance. Finally, since FiLM outperforms Softmax consistently in the\nregression experiments, we use it for as the modulation method in the rest of experiments.\nWe visualize the predicted function shapes of MAML, Multi-MAML and MMAML (with FiLM) in\nFigure 2. We observe that modulation can signi\ufb01cantly modify the prediction of the initial network\nto be close to the target function (see Figure 2 (a)). The prediction is then further improved by\ngradient-based optimization (see Figure 2 (b)). tSNE [23] visualization of the task embedding (Figure\n3) shows that our embedding learns to separate the input data of different tasks, which can be seen as\na evidence of the mode identi\ufb01cation capability of MMAML.\n\n5.2\n\nImage Classi\ufb01cation\n\nSetup. The task of few-shot image classi\ufb01cation considers the problem of classifying images\ninto N classes with a small number (K) of labeled samples available (i.e. N-way K-shot). To\ncreate a multimodal few-shot image classi\ufb01cation task, we combine multiple widely used datasets\n(OMNIGLOT [17], MINI-IMAGENET [34], FC100 [29], CUB [50], and AIRCRAFT [24]) to form a\n\n6\n\n\f(a) Regression\n\n(b) Image classi\ufb01cation\n\n(c) RL Reacher\n\n(d) RL Point Mass\n\nFigure 3: tSNE plots of the task embeddings produced by our model from randomly sampled tasks; marker\ncolor indicates different modes of a task distribution. The plots (b) and (d) reveal a clear clustering according\nto different task modes, which demonstrates that MMAML is able to identify the task from a few samples\nand produce a meaningful embedding \u03c5. (a) Regression: the distance between modes aligns with the intuition\nof the similarity of functions (e.g. a quadratic function can sometimes be similar to a sinusoidal or a linear\nfunction while a sinusoidal function is usually different from a linear function) (b) Few-shot image classi\ufb01cation:\neach dataset (i.e. mode) forms its own cluster. (c-d) Reinforcement learning: The numbered clusters represent\ndifferent modes of the task distribution. The tasks from different modes are clearly clustered together in the\nembedding space.\n\nmeta-dataset following the train/test splits used in the prior work, similar to [46]. The details of all\nthe datasets can be found in the supplementary material.\nWe train models on the meta-datasets with two modes (OMNIGLOT and MINI-IMAGENET), three\nmodes (OMNIGLOT, MINI-IMAGENET, and FC100), and \ufb01ve modes (all the \ufb01ve datasets). We use a\n4-layer convolutional network for both MAML and our task network.\nResults. The results are shown in Table 2, demonstrating that our method achieves better results\nagainst MAML and performs comparably to Multi-MAML. The performance gap between ours and\nMAML becomes larger when the number of modes increases, suggesting our method can handle\nmultimodal task distributions better than MAML. Also, compared to Multi-MAML, our method\nachieves slightly better results partially because our method learns from all the datasets yet each\nMulti-MAML is likely to over\ufb01t to a single dataset with a smaller number of classes (e.g. MINI-\nIMAGENET and FC100). This \ufb01nding aligns with the current trend of meta-learning from multiple\ndatasets [46]. The detailed performance on each dataset can be found in the supplementary material.\nTo gain insights to the task embeddings \u03c5 produced by our model, we randomly sample 2000 5-mode\n5-way 1-shot tasks and employ tSNE to visualize \u03c5 in Figure 3 (b), showing that our task embedding\nnetwork captures the relationship among modes, where each dataset forms an individual cluster. This\nstructure shows that our task encoder learns a reasonable task embedding space, which allows the\nmodulation network to modulate the parameters of the task network accordingly.\n\n5.3 Reinforcement Learning\n\n(a) Point Mass\n\n(b) Reacher\n\n(c) Ant\n\n(d) Ant Goal Distribution\n\nFigure 4: RL environments. Three environments are used to explore the capability of MMAML to adapt\nin multimodal task distributions in RL. In all of the environments the agent is tasked to reach a goal marked\nby a star of a sphere in the \ufb01gures. The goals are sampled from a multimodal distribution in two or three\ndimensions depending on the environment. In POINT MASS (a) the agent navigates a simple point mass agent in\n2-dimensions. In REACHER (b) the agent controls a 3-link robot arm in 2-dimensions. In ANT (c) the agent\ncontrols four-legged ant robot and has to navigate to the goal. The goals are sampled from a 2-dimensional\ndistribution presented in \ufb01gure (d), while the agent itself is 3-dimensional.\n\n7\n\n\fFigure 5: Visualizations of MMAML and ProMP trajectories in the 4-mode Point Mass 2D envi-\nronment. Each trajectory originates in the green star. The contours present the multimodal goal\ndistribution. Multiple trajectories are shown per each update step. For each column: the leftmost\n\ufb01gure depicts the initial exploratory trajectories without modulation or gradient adaptation applied.\nThe middle \ufb01gure presents ProMP after one gradient adaptation step and MMAML after a gradient\nadaptation step and the modulation step, which are computed based on the same initial trajectories.\nThe \ufb01gure on the right presents the methods after two gradient adaptation steps in addition to the\nMMAML modulation step.\n\nFigure 6: Visualizations of MMAML and ProMP trajectories in the ANT and REACHER environments.\nThe \ufb01gures represent randomly sampled trajectories after the modulation step and two gradient steps\nfor REACHER and three for ANT. Each frame sequence represents a complete trajectory, with the\nbeginning, middle and end of the trajectories captured by the left, middle and right frames respectively.\nVideos of the trained agents can be found at https://vuoristo.github.io/MMAML/.\n\nSetup. Along with few-shot classi\ufb01cation and regression, reinforcement learning (RL) has been a\ncentral problem where meta-learning has been studied [40, 39, 52, 5, 25, 35]. Similarly to the other\ndomains, the objective in meta-reinforcement learning (meta-RL) is to adapt to a novel task based\non limited experience with the task. For RL problems, the inner loop updates of gradient-based\nmeta-learning take the form of policy gradient updates. For a more detailed description of the\nmeta-RL problem setting, we refer the reader to [35].\nWe seek to verify the ability of MMAML to learn to adapt to tasks sampled from multimodal task\ndistributions based on limited interaction with the environment. We do so by evaluating MMAML\nand the baselines on four continuous control environments using the MuJoCo physics simulator [45].\nIn all of the environments, the agent is rewarded on every time step for minimizing the distance\nto the goal. The goals are sampled from multimodal goal distributions with environment speci\ufb01c\nparameters. The agent does not observe the location of the goal directly but has to learn to \ufb01nd it\nbased on the reward instead. To provide intuition on the environments, illustrations of the robots are\npresented in Figure 4. Examples of trajectories are presented in Figure 5 for POINT MASS and in\nFigure 6 for ANT and REACHER. Complete details of the environments and goal distributions can be\nfound in the supplementary material.\nBaselines and Our Approach. To identify the mode of a task distribution with MMAML, we\nrun the initial policy to interact with the environment and collect a batch of trajectories. These\ninitial trajectories are used for two purposes: computing the adapted parameters using a gradient-\nbased update and modulating the updated parameters based on the task embedding \u03c5 computed\nby the modulation network. The modulation vectors \u03c4 are kept \ufb01xed for the subsequent gradient\nupdates. Descriptions of the network architectures and training hyperparameters are deferred to\n\n8\n\nProMPMMAMLReacherAnt\fTable 3: The mean and standard deviation of cumulative reward per episode for multimodal reinforcement\nlearning problems with 2, 4 and 6 modes reported across 3 random seeds. Multi-ProMP is ProMP trained on an\neasier task distribution which consists of a single mode of the multimodal distribution to provide an approximate\nupper limit on the performance on each task.\n\nMethod\n\nProMP [35]\nMulti-ProMP\n\nOurs\n\nPOINT MASS 2D\n\n2 Modes\n-397 \u00b1 20\n-109 \u00b1 6\n-136 \u00b1 8\n\n4 Modes\n-523 \u00b1 51\n-109 \u00b1 6\n-209 \u00b1 32\n\n6 Modes\n-330 \u00b1 10\n-92 \u00b1 4\n-169 \u00b1 48\n\n2 Modes\n-12 \u00b1 2.0\n-4.3 \u00b1 0.1\n-10.0 \u00b1 1.0\n\nREACHER\n4 Modes\n-13.8 \u00b1 2.5\n-4.3 \u00b1 0.1\n-11.0 \u00b1 0.8\n\nANT\n\n6 Modes\n-14.9 \u00b1 2.9\n-4.3 \u00b1 0.1\n-10.9 \u00b1 1.1\n\n2 Modes\n-761 \u00b1 48\n-624 \u00b1 38\n-711 \u00b1 25\n\n4 Modes\n-953 \u00b1 46\n-611 \u00b1 31\n-904 \u00b1 37\n\nthe supplementary material. Due to credit-assignment problems present in the MAML for RL\nalgorithm [5] as identi\ufb01ed in [35], we optimize our policies and modulation networks with ProMP [35]\nalgorithm, which resolves these issues.\nWe use ProMP both as the training algorithm for MMAML and as a baseline. Multi-ProMP is an\narti\ufb01cial baseline to show the performance of training one policy for each mode using ProMP. In\npractice we train an agent for only one of the modes since the task distributions are symmetric and\nthe agent is initialized to a random pose.\nResults. The results for the meta-RL experiments presented in Table 3 show that MMAML consis-\ntently outperforms the unmodulated ProMP. The good performance of Multi-ProMP, which only\nconsiders a single mode suggests that the dif\ufb01culty of adaptation in our environments results mainly\nfrom the multiple modes. We \ufb01nd that the dif\ufb01culty of the RL tasks does not scale directly with the\nnumber of modes, i.e. the performance gap between MMAML and ProMP for POINT MASS with\n6 modes is smaller than the gap between them for 4 modes. We hypothesize the more distinct the\ndifferent modes of the task distribution are, the more dif\ufb01cult it is for a single policy initialization to\nmaster. Therefore, adding intermediate modes (going from 4 to 6 modes) can in some cases make the\ntask distribution easier to learn.\nThe tSNE visualizations of embeddings of random tasks in the POINT MASS and REACHER domains\nare presented in Figure 3. The clearly clustered embedding space shows that the task encoder\nis capable of identifying the task mode and the good results MMAML achieves suggest that the\nmodulation network effectively utilizes the task embeddings to tackle the multimodal task distribution.\n\n6 Conclusion\n\nWe present a novel approach that is able to leverage the strengths of both model-based and model-\nagnostic meta-learners to discover and exploit the structure of multimodal task distributions. Given\na few samples from a target task, our modulation network \ufb01rst identi\ufb01es the mode of the task\ndistribution and then modulates the meta-learned prior in a parameter space. Next, the gradient-based\nmeta-learner ef\ufb01ciently adapts to the target task through gradient updates. We empirically observe\nthat our modulation network is capable of effectively recognizing the task modes and producing\nembeddings that captures the structure of a multimodal task distribution. We evaluated our proposed\nmodel in multimodal few-shot regression, image classi\ufb01cation and reinforcement learning, and\nachieved superior generalization performance on tasks sampled from multimodal task distributions.\n\nAcknowledgment\n\nThis work was initiated when Risto Vuorio was at SK T-Brain and was partially supported by SK\nT-Brain. The authors are grateful for the fruitful discussion with Kuan-Chieh Wang, Max Smith, and\nYoungwoon Lee. The authors appreciate the anonymous NeurIPS reviewers as well as the anonymous\nreviewers who rejected this paper but provided constructive feedback for improving this paper in\nprevious submission cycles.\n\n9\n\n\fReferences\n[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul,\nBrendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nAdvances in Neural Information Processing Systems, 2016. 2\n\n[2] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning\n\nrule. In Preprints Conf. Optimality in Arti\ufb01cial and Biological Neural Networks, 1992. 2\n\n[3] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at\n\nfew-shot classi\ufb01cation. In International Conference on Learning Representations, 2019. 2\n\n[4] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL 2: Fast\n\nreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016. 1, 3\n\n[5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of\n\nDeep Networks. In International Conference on Machine Learning, 2017. 1, 2, 3, 5, 6, 8, 9\n\n[6] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient\ndescent can approximate any learning algorithm. In International Conference on Learning Representations,\n2018. 1, 3\n\n[7] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic Model-Agnostic Meta-Learning. In Advances\n\nin Neural Information Processing Systems, 2018. 1, 3\n\n[8] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting gradient-based\nmeta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018. 1, 3\n\n[9] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for\nrobotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and\nAutomation, 2017. 3\n\n[10] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum\nentropy deep reinforcement learning with a stochastic actor. In International Conference on Machine\nLearning, 2018. 3\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4\n\n[12] Hexiang Hu, Liyu Chen, Boqing Gong, and Fei Sha. Synthesized policies for transfer and adaptation\n\nacross tasks and environments. In Neural Information Processing Systems, 2018. 3\n\n[13] Minyoung Huh, Shao-Hua Sun, and Ning Zhang. Feedback adversarial learning: Spatial feedback\nfor improving generative adversarial networks. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2019. 4\n\n[14] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen,\nEthan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement\nlearning for vision-based robotic manipulation. In Conference on Robot Learning, 2018. 3\n\n[15] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian\n\nmodel-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018. 1, 3\n\n[16] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image\n\nrecognition. In Deep Learning Workshop at International Conference on Machine Learning, 2015. 2\n\n[17] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple\n\nvisual concepts. In Conference of the Cognitive Science Society, 2011. 6\n\n[18] Yoonho Lee and Seungjin Choi. Gradient-Based Meta-Learning with Learned Layerwise Metric and\n\nSubspace. In International Conference on Machine Learning, 2018. 1, 3\n\n[19] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace.\n\nIn International Conference on Machine Learning, 2018. 2\n\n[20] Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J. Lim. Composing\ncomplex skills by learning transition policies. In International Conference on Learning Representations,\n2019. 3\n\n10\n\n\f[21] Ke Li and Jitendra Malik. Learning to Optimize. In International Conference on Learning Representations,\n\n2016. 2\n\n[22] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nIn International\n\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nConference on Learning Representations, 2016. 3\n\n[23] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. In Journal of Machine Learning\n\nResearch, 2008. 6\n\n[24] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual\n\nclassi\ufb01cation of aircraft. arXiv preprint airxiv:1306.5151, 2013. 6\n\n[25] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner.\n\nIn International Conference on Learning Representations, 2018. 1, 3, 8\n\n[26] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent models of visual\n\nattention. In Advances in Neural Information Processing Systems. 2014. 4\n\n[27] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine Learning,\n\n2017. 1, 3\n\n[28] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm.\n\narXiv:1803.02999, 2018. 1, 3\n\narXiv preprint\n\n[29] Boris N. Oreshkin, Pau Rodriguez, and Alexandre Lacoste. TADAM: Task dependent adaptive metric for\n\nimproved few-shot learning. In Advances in Neural Information Processing Systems, 2018. 2, 6\n\n[30] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-\n\nadaptive normalization. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4\n\n[31] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual\nReasoning with a General Conditioning Layer. In Association for the Advancement of Arti\ufb01cial Intelligence,\n2018. 4\n\n[32] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke\nZettlemoyer. Deep contextualized word representations. In North American Chapter of the Association for\nComputational Linguistics, 2018. 4\n\n[33] Aravind Rajeswaran*, Vikash Kumar*, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov,\nand Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and\nDemonstrations. In Robotics: Science and Systems (RSS), 2018. 3\n\n[34] Sachin Ravi and Hugo Larochelle. Optimization as a Model for Few-Shot Learning. In International\n\nConference on Learning Representations, 2017. 2, 6\n\n[35] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel, Dmitriy Shingarey, Lukas\nKaul, Tamim Asfour, C Dometios Athanasios, You Zhou, et al. Promp: Proximal meta-policy search. In\nInternational Conference on Learning Representations, 2019. 1, 3, 8, 9\n\n[36] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and\nRaia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning\nRepresentations, 2019. 1, 3\n\n[37] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning\n\nwith memory-augmented neural networks. In International Conference on Machine Learning, 2016. 3\n\n[38] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. (on learning how to learn: The\n\nmeta-meta-... hook.). Diploma thesis, 1987. 2\n\n[39] J\u00fcrgen Schmidhuber, Jieyu Zhao, and Nicol N Schraudolph. Reinforcement learning with self-modifying\n\npolicies. In Learning to learn. 1998. 8\n\n[40] J\u00fcrgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm,\n\nadaptive levin search, and incremental self-improvement. Machine Learning, 1997. 8\n\n[41] Pranav Shyam, Shubham Gupta, and Ambedkar Dukkipati. Attentive recurrent comparators. In Interna-\n\ntional Conference on Machine Learning, 2017. 2\n\n11\n\n\f[42] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances\n\nin Neural Information Processing Systems. 2017. 2\n\n[43] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. Learning\nto compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2018. 2\n\n[44] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. 2\n\n[45] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In\n\nInternational Conference On Intelligent Robots and Systems, 2012. 8\n\n[46] Eleni Trianta\ufb01llou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles\nGelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of\ndatasets for learning to learn from few examples. In Meta-Learning Workshop at Neural Information\nProcessing Systems, 2018. 7\n\n[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, 2017. 4\n\n[48] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In Advances in Neural Information Processing Systems, 2016. 1, 2\n\n[49] Risto Vuorio, Dong-Yeon Cho, Daejoong Kim, and Jiwon Kim. Meta continual learning. arXiv preprint\n\narXiv:1806.06928, 2018. 2\n\n[50] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd\n\nbirds-200-2011 dataset. 2011. 6\n\n[51] Jane X Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo,\nDemis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system.\nNature neuroscience, 2018. 3\n\n[52] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles\nBlundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint\narXiv:1611.05763, 2016. 1, 8\n\n[53] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Learning embedding adaptation for few-shot\n\nlearning. arXiv preprint arXiv:1812.03664, 2018. 2\n\n12\n\n\f", "award": [], "sourceid": 5, "authors": [{"given_name": "Risto", "family_name": "Vuorio", "institution": "University of Michigan"}, {"given_name": "Shao-Hua", "family_name": "Sun", "institution": "University of Southern California"}, {"given_name": "Hexiang", "family_name": "Hu", "institution": "University of Southern California"}, {"given_name": "Joseph", "family_name": "Lim", "institution": "University of Southern California"}]}