{"title": "Bayesian Model-Agnostic Meta-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7332, "page_last": 7342, "abstract": "Due to the inherent model uncertainty, learning to infer Bayesian posterior from a few-shot dataset is an important step towards robust meta-learning. In this paper, we propose a novel Bayesian model-agnostic meta-learning method. The proposed method combines efficient gradient-based meta-learning with nonparametric variational inference in a principled probabilistic framework. Unlike previous methods, during fast adaptation, the method is capable of learning complex uncertainty structure beyond a simple Gaussian approximation, and during meta-update, a novel Bayesian mechanism prevents meta-level overfitting. Remaining a gradient-based method, it is also the first Bayesian model-agnostic meta-learning method applicable to various tasks including reinforcement learning. Experiment results show the accuracy and robustness of the proposed method in sinusoidal regression, image classification, active learning, and reinforcement learning.", "full_text": "Bayesian Model-Agnostic Meta-Learning\n\nJaesik Yoon\u22173, Taesup Kim\u2217\u20212, Ousmane Dia1, Sungwoong Kim4,\n\nYoshua Bengio2,5, Sungjin Ahn\u20216\n\n1Element AI, 2MILA Universit\u00e9 de Montr\u00e9al, 3SAP, 4Kakao Brain,\n\n5CIFAR Senior Fellow, 6Rutgers University\n\nAbstract\n\nDue to the inherent model uncertainty, learning to infer Bayesian posterior from a\nfew-shot dataset is an important step towards robust meta-learning. In this paper,\nwe propose a novel Bayesian model-agnostic meta-learning method. The proposed\nmethod combines ef\ufb01cient gradient-based meta-learning with nonparametric varia-\ntional inference in a principled probabilistic framework. Unlike previous methods,\nduring fast adaptation, the method is capable of learning complex uncertainty\nstructure beyond a simple Gaussian approximation, and during meta-update, a\nnovel Bayesian mechanism prevents meta-level over\ufb01tting. Remaining a gradient-\nbased method, it is also the \ufb01rst Bayesian model-agnostic meta-learning method\napplicable to various tasks including reinforcement learning. Experiment results\nshow the accuracy and robustness of the proposed method in sinusoidal regression,\nimage classi\ufb01cation, active learning, and reinforcement learning.\n\n1\n\nIntroduction\n\nTwo-year-old children can infer a new category from only one instance (Smith & Slone, 2017). This\nis presumed to be because during early learning, a human brain develops foundational structures such\nas the \u201cshape bias\u201d in order to learn the learning procedure (Landau et al., 1988). This ability, also\nknown as learning to learn or meta-learning (Biggs, 1985; Bengio et al., 1990), has recently obtained\nmuch attention in machine learning by formulating it as few-shot learning (Lake et al., 2015; Vinyals\net al., 2016). Because, initiating the learning from scratch, a neural network can hardly learn anything\nmeaningful from such a few data points, a learning algorithm should be able to extract the statistical\nregularity from past tasks to enable warm-start for subsequent tasks.\n\nLearning a new task from a few examples inherently induces a signi\ufb01cant amount of uncertainty. This\nis apparent when we train a complex model such as a neural network using only a few examples. It\nis also empirically supported by the fact that a challenge in existing few-shot learning algorithms\nis their tendency to over\ufb01t (Mishra et al., 2017). A robust meta-learning algorithm therefore must\nbe able to systematically deal with such uncertainty in order to be applicable to critical problems\nsuch as healthcare and self-driving cars. Bayesian inference provides a principled way to address\nthis issue. It brings us not only robustness to over\ufb01tting but also numerous bene\ufb01ts such as improved\nprediction accuracy by Bayesian ensembling (Balan et al., 2015), active learning (Gal et al., 2016), and\nprincipled/safe exploration in reinforcement learning (Houthooft et al., 2016). Therefore, developing\na Bayesian few-shot learning method is an important step towards robust meta-learning.\n\nMotivated by the above arguments, in this paper we propose a Bayesian meta-learning method,\ncalled Bayesian MAML. By introducing Bayesian methods for fast adaptation and meta-update, the\nproposed method learns to quickly obtain an approximate posterior of a given unseen task and thus\n\n\u2217Equal contribution, Correspondence to sungjin.ahn@rutgers.edu, \u2021Work done also while working at\n\nElement AI\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fprovides the bene\ufb01ts of having access to uncertainty. Being an ef\ufb01cient and scalable gradient-based\nmeta-learner which encodes the meta-level statistical regularity in the initial model parameters,\nour method is the \ufb01rst Bayesian model-agnostic meta-learning method which is thus applicable to\nvarious tasks including reinforcement learning. Combining an ef\ufb01cient nonparametric variational\ninference method with gradient-based meta-learning in a principled probabilistic framework, it can\nlearn complex uncertainty structures while remaining simple to implement.\n\nThe main contributions of the paper are as follows. We propose a novel Bayesian method for meta-\nlearning. The proposed method is based on a novel Bayesian fast adaptation method and a new\nmeta-update loss called the Chaser loss. To our knowledge, the Bayesian fast adaptation is the \ufb01rst in\nmeta-learning that provides \ufb02exible capability to capture the complex uncertainty curvature of the\ntask-posterior beyond a simple Gaussian approximation. Furthermore, unlike the previous methods,\nthe Chaser loss prevents meta-level over\ufb01tting. In experiments, we show that our method is ef\ufb01cient,\naccurate, robust, and applicable to various problems: sinusoidal regression, image classi\ufb01cation,\nreinforcement learning, and active learning.\n\n2 Preliminaries\n\nConsider a model y = f\u03b8(x) parameterized by and differentiable w.r.t. \u03b8. Task \u03c4 is speci\ufb01ed by a\nK-shot dataset D\u03c4 that consists of a small number of training examples, e.g., K pairs (xk, yk) per\nclass for classi\ufb01cation. We assume that tasks are sampled from a task distribution \u03c4 \u223c p(T ) such that\nthe sampled tasks share the statistical regularity of the task distribution. A meta-learning algorithm\nleverages this regularity to improve the learning ef\ufb01ciency of subsequent tasks. The whole dataset of\ntasks is divided into training/validation/test tasksets, and the dataset of each task is further divided\ninto task-training/task-validation/task-test datasets.\n\nModel-Agnostic Meta Learning (MAML) proposed by Finn et al. (2017) is a gradient-based meta-\nlearning framework. Because it works purely by gradient-based optimization without requiring\nadditional parameters or model modi\ufb01cation, it is simple and generally applicable to any model as\nlong as the gradient can be estimated.\n\n\u03c4 and task-validation data Dval\n\nIn Algorithm 1, we brie\ufb02y review MAML. At each meta-train iteration t, it performs: (i) Task-\nSampling: a mini-batch Tt of tasks is sampled from the task distribution p(T ). Each task \u03c4 \u2208 Tt\nprovides task-train data Dtrn\n\u03c4 . (ii) Fast Adaptation (or Inner-Update):\nthe parameter for each task \u03c4 in Tt is updated by starting from the current generic initial model \u03b80\nand then performing n gradient descent steps on the task-train loss, an operation which we denote by\nGDn(\u03b80; Dtrn\n\u03c4 , \u03b1) with \u03b1 being a step size. (iii) Meta-Update (or Outer-Update): the generic initial\nparameter \u03b80 is updated by gradient descent. The meta-loss is the summation of task-validation losses\n\u03c4 ) where the summation is over all \u03c4 \u2208 Tt. At meta-test time,\n0, we obtain a model\n\u00af\u03c4 . Then, the\n\ngiven an unseen test-task \u00af\u03c4 \u223c p(T ), starting from the optimized initial model \u03b8\u2217\n\u03b8\u00af\u03c4 by taking a small number of inner-update steps using K-shot task-training data Dtrn\nlearned model \u03b8\u00af\u03c4 is evaluated on the task-test dataset Dtst\n\u00af\u03c4 .\n\nfor all tasks in Tt, i.e., P L(\u03b8\u03c4 ; Dval\n\nStein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) is a recently proposed nonpara-\nmetric variational inference method. SVGD combines the strengths of MCMC and variational\ninference. Unlike traditional variational inference, SVGD does not con\ufb01ne the family of approximate\ndistributions within tractable parametric distributions while still remaining a simple algorithm. Also, it\nconverges faster than MCMC because its update rule is deterministic and leverages the gradient of the\ntarget distribution. Speci\ufb01cally, to obtain M samples from target distribution p(\u03b8), SVGD maintains\nM instances of model parameters, called particles. We denote the particles by \u0398 = {\u03b8m}M\nm=1. At\niteration t, each particle \u03b8t \u2208 \u0398t is updated by the following rule:\n\n\u03b8t+1 \u2190 \u03b8t + \u01ebt\u03c6(\u03b8t) where \u03c6(\u03b8t) =\n\n1\nM\n\nt , \u03b8t)\u2207\u03b8j\n\nt\n\nlog p(\u03b8j\n\nt ) + \u2207\u03b8j\n\nt\n\nk(\u03b8j\n\nt , \u03b8t)i ,\n\n(1)\n\nwhere \u01ebt is step-size and k(x, x\u2032) is a positive-de\ufb01nite kernel. We can see that a particle consults\nwith other particles by asking their gradients, and thereby determines its own update direction. The\nimportance of other particles is weighted according to the kernel distance, relying more on closer\nparticles. The last term \u2207\u03b8j k(\u03b8j, \u03b8m) enforces repulsive force between particles so that they do not\ncollapse to a point. The resulting particles can be used to obtain the posterior predictive distribution\n\np(y|x, D\u03c4 ) =R p(y|x, \u03b8)p(\u03b8|D\u03c4 )d\u03b8 \u2248 1\n\nM Pm p(y|x, \u03b8m) where \u03b8m \u223c p(\u03b8|D\u03c4 ).\n\nM\n\nXj=1hk(\u03b8j\n\n2\n\n\fAlgorithm 1 MAML\n\nAlgorithm 2 Bayesian Fast Adaptation\n\nSample a mini-batch of tasks Tt from p(T )\nfor each task \u03c4 \u2208 Tt do\n\u03b8\u03c4 \u2190 GDn(\u03b80; Dtrn\n\n\u03c4 , \u03b1)\n\nend for\n\n\u03b80 \u2190 \u03b80 \u2212 \u03b2\u2207\u03b80P\u03c4 \u2208Tt\n\nL(\u03b8\u03c4 ; Dval\n\u03c4 )\n\nSample a mini-batch of tasks Tt from p(T )\nfor each task \u03c4 \u2208 Tt do\n\n\u0398\u03c4 (\u03980) \u2190 SVGDn(\u03980; Dtrn\n\n\u03c4 , \u03b1)\n\nend for\n\n\u03980 \u2190 \u03980 \u2212 \u03b2\u2207\u03980P\u03c4 \u2208Tt\n\nLBFA(\u0398\u03c4 (\u03980); Dval\n\u03c4 )\n\nA few properties of SVGD are particularly relevant to the proposed method: (i) when the number of\nparticles M equals 1, SVGD becomes standard gradient ascent on the objective log p(\u03b8), (ii) under\na certain condition, an SVGD-update increasingly reduces the distance between the approximate\ndistribution de\ufb01ned by the particles and the target distribution, in the sense of Kullback-Leibler\n(KL) divergence (Liu & Wang, 2016), and \ufb01nally (iii) it is straightforward to apply to reinforcement\nlearning by using Stein Variational Policy Gradient (SVPG) (Liu et al., 2017).\n\n3 Proposed Method\n\n3.1 Bayesian Fast Adaptation\n\nOur goal is to learn to infer by developing an ef\ufb01cient Bayesian gradient-based meta-learning method\nto ef\ufb01ciently obtain the task-posterior p(\u03b8\u03c4 |Dtrn\n\u03c4 ) of a novel task. As our method is in the same\nclass as MAML \u2013 in the sense that it encodes the meta-knowledge in the initial model by gradient-\nbased optimization \u2013 we \ufb01rst consider the following probabilistic interpretation of MAML with one\ninner-update step,\n\np(Dval\n\nT | \u03b80, Dtrn\n\np(Dval\n\u03c4\n\n| \u03b8\u2032\n\n\u03c4 = \u03b80 + \u03b1\u2207\u03b80 log p(Dtrn\n\n\u03c4\n\n| \u03b80)),\n\n(2)\n\nT ) = Y\u03c4 \u2208T\n\n\u03c4 |\n\n\u03c4 |\u03b8\u2032\n\n\u03c4 ) =Q|Dval\n\ni=1 p(yi|xi, \u03b8\u2032\n\nwhere p(Dval\nT denotes all task-train sets in training taskset T , and Dval\nT\nhas the same meaning but for task-validation sets. From the above, we can see that the inner-update\nstep of MAML amounts to obtaining task model \u03b8\u2032\n\u03c4 from which the likelihood of the task-validation\nset Dval\nis computed. The meta-update step is then to perform maximum likelihood estimation of\n\u03c4\nthis model w.r.t. the initial parameter \u03b80. This probabilistic interpretation can be extended further to\napplying empirical Bayes to a hierarchical probabilistic model (Grant et al., 2018) as follows:\n\n\u03c4 ), Dtrn\n\np(Dval\n\nT | \u03b80, Dtrn\n\nT ) = Y\u03c4 \u2208T (cid:18)Z p(Dval\n\n\u03c4\n\n| \u03b8\u03c4 )p(\u03b8\u03c4 | Dtrn\n\n\u03c4 , \u03b80)d\u03b8\u03c4(cid:19) .\n\n(3)\n\nWe see that the probabilistic MAML model in Eq. (2) is a special case of Eq. (3) that approximates\nthe task-train posterior p(\u03b8\u03c4 |\u03b80, Dtrn\n\u03c4 (\u03b8\u03c4 ) where\n\u03b4y(x) = 1 if x = y, and 0 otherwise. To model the task-train posterior which also becomes the\nprior of task-validation set, Grant et al. (2018) used an isotropic Gaussian distribution with a \ufb01xed\nvariance.\n\n\u03c4 ) by a point estimate \u03b8\u2032\n\n\u03c4 . That is, p(\u03b8\u03c4 |Dtrn\n\n\u03c4 , \u03b80) = \u03b4\u03b8\u2032\n\n\u201cCan we use a more \ufb02exible task-train posterior than a point estimate or a simple Gaussian dis-\ntribution while maintaining the ef\ufb01ciency of gradient-based meta-learning?\u201d This is an important\nquestion because as discussed in Grant et al. (2018), the task-train posterior of a Bayesian neural\nnetwork (BNN) trained with a few-shot dataset would have a signi\ufb01cant amount of uncertainty which,\naccording to the Bayesian central limit theorem (Le Cam, 1986; Ahn et al., 2012), cannot be well\napproximated by a Gaussian distribution.\n\nOur \ufb01rst step for designing such an algorithm starts by noticing that SVGD performs deterministic\nupdates and thus gradients can be backpropagated through the particles. This means that we now\nmaintain M initial particles \u03980 and by obtaining samples from the task-train posterior p(\u03b8\u03c4 |Dtrn\n\u03c4 , \u03980)\nusing SVGD (which is now conditioned on \u03980 instead of \u03b80), we can optimize the following Monte\nCarlo approximation of Eq. (3) by computing the gradient of the meta-loss log p(Dval\nT ) w.r.t.\n\u03980,\n\nT |\u03980, Dtrn\n\np(Dval\n\nT | \u03980, Dtrn\n\nT ) \u2248 Y\u03c4 \u2208T 1\n\nM\n\np(Dval\n\u03c4\n\nM\n\nXm=1\n\n| \u03b8m\n\n\u03c4 )! where \u03b8m\n\n\u03c4 \u223c p(\u03b8\u03c4 | Dtrn\n\n\u03c4 , \u03980).\n\n(4)\n\n3\n\n\fBeing updated by gradient descent, it hence remains an ef\ufb01cient meta-learning method while providing\na more \ufb02exible way to capture the complex uncertainty structure of the task-train posterior than a\npoint estimate or a simple Gaussian approximation.\n\nAlgorithm 2 describes an implementation of the above model. Speci\ufb01cally, at iteration t, for each\ntask \u03c4 in a sampled mini-batch Tt, the particles initialized to \u03980 are updated for n steps by ap-\nplying the SVGD updater, denoted by SVGDn(\u03980; Dtrn\nt ) in\nEq. (1) is set to the task-train posterior p(\u03b8\u03c4 |Dtrn\n\u03c4 |\u03b8\u03c4 )p(\u03b8\u03c4 )1. This results in task-wise\nparticles \u0398\u03c4 for each task \u03c4 \u2208 Tt. Then, for the meta-update, we can use the following meta-loss,\nlog p(Dval\nTt\n\n\u03c4 ) \u2013 the target distribution (the p(\u03b8j\n\n\u03c4 ) \u221d p(Dtrn\n\n|\u03980, Dtrn\nTt\n\n)\n\n\u2248 X\u03c4 \u2208Tt\n\nLBFA(\u0398\u03c4 (\u03980); Dval\n\n\u03c4 ) where LBFA(\u0398\u03c4 (\u03980); Dval\n\n\u03c4 ) = log\" 1\n\nM\n\np(Dval\n\n\u03c4 |\u03b8m\n\n\u03c4 )# ,\n\nM\n\nXm=1\n\n(5)\n\nHere, we use \u0398\u03c4 (\u03980) to explicitly denote that \u0398\u03c4 is a function of \u03980. Note that, by the above model,\nall the initial particles in \u03980 are jointly updated in such a way as to \ufb01nd the best joint-formation\namong them. From this optimized initial particles, the task-posterior of a new task can be obtained\nquickly, i.e., by taking a small number of update steps, and ef\ufb01ciently, i.e, with a small number of\nsamples. We call this Bayesian Fast Adaptation (BFA). The method can be considered a Bayesian\nensemble in which, unlike non-Bayesian ensemble methods, the particles interact with each other to\n\ufb01nd the best formation representing the task-train posterior. Because SVGD with a single particle,\ni.e., M = 1, is equal to gradient ascent, Algorithm 2 reduces to MAML when M = 1.\n\nAlthough the above algorithm brings the power of Bayesian inference to fast adaptation, it can be\nnumerically unstable due to the product of the task-validation likelihood terms. More importantly, for\nmeta-update it is not performing Bayesian inference. Instead, it looks for the initial prior \u03980 such\nthat SVGD-updates lead to minimizing the empirical loss on task-validation sets. Therefore, like\nother meta-learning methods, the BFA model can still suffer from over\ufb01tting despite the fact that\nwe use a \ufb02exible Bayesian inference in the inner update. The reason is somewhat apparent. Because\nwe perform only a small number of inner-updates while the meta-update is based on empirical\nrisk minimization, the initial model \u03980 can be over\ufb01tted to the task-validation sets when we use\nhighly complex models like deep neural networks. Therefore, to become a fully robust meta-learning\napproach, it is desired for the method to retain the uncertainty during the meta-update as well while\nremaining an ef\ufb01cient gradient-based method.\n\n3.2 Bayesian Meta-Learning with Chaser Loss\n\nMotivated by the above observation, we propose a novel meta-loss. For this, we start by de\ufb01ning\nthe loss as the dissimilarity between approximate task-train posterior pn\n\u03c4 ; \u03980) and true\ntask-posterior p\u221e\n\u03c4 is obtained by taking n fast-adaptation steps\nfrom the initial model. Assuming that we can obtain samples \u0398n\n\u03c4 respectively from these\ntwo distributions, the new meta-learning objective can be written as\n\n\u03c4 ). Note that pn\n\n\u03c4 \u2261 pn(\u03b8\u03c4 |Dtrn\n\n\u03c4 \u2261 p(\u03b8\u03c4 |Dtrn\n\n\u03c4 and \u0398\u221e\n\n\u03c4 \u222a Dval\n\narg min\n\n\u03980 X\u03c4\n\ndp(pn\n\n\u03c4 k p\u221e\n\n\u03c4 ) \u2248 arg min\n\nds(\u0398n\n\n\u03c4 (\u03980) k \u0398\u221e\n\n\u03c4 ).\n\n(6)\n\n\u03980 X\u03c4\n\nHere, dp(pkq) is a dissimilarity between two distributions p and q, and ds(s1ks2) a distance between\ntwo sample sets. We then want to minimize this distance using gradient w.r.t. \u03980. This is to \ufb01nd\noptimized \u03980 from which the task-train posterior can be obtained quickly and closely to the true\ntask-posterior. However, this is intractable because we neither have access to the true posterior p\u221e\n\u03c4\nnor its samples \u0398\u221e\n\u03c4 .\n\n\u03c4 by \u0398n+s\n\nTo this end, we approximate \u0398\u221e\n\u03c4 ; \u03980)\nand then (ii) taking s additional SVGD steps with the updated target distribution p(\u03b8\u03c4 |Dtrn\n\u03c4 \u222a Dval\n\u03c4 ),\ni.e., augmented with additional observation Dval\n\u03c4 . Although it is valid in theory not to augment the\nleader with the validation set, to help fast convergence we take advantage of it like other meta-learning\nmethods. Note that, because SVGD-updates provide increasingly better approximations of the target\n\n. This is done by (i) obtaining \u0398n\n\n\u03c4 from pn(\u03b8\u03c4 |Dtrn\n\n\u03c4\n\n1In our experiments, we put hyperprior on the variance of the prior (mean is set to 0). Thus, the posterior of\nhyperparameter is automatically learned also by SVGD, i.e., the particle vectors include the prior parameters.\n\n4\n\n\fAlgorithm 3 Bayesian Meta-Learning with Chaser Loss (BMAML)\n\n1: Initialize \u03980\n2: for t = 0, . . . until converge do\n3:\n4:\n5:\n6:\n7:\n\nSample a mini-batch of tasks Tt from p(T )\nfor each task \u03c4 \u2208 Tt do\nCompute chaser \u0398n\nCompute leader \u0398n+s\n\n\u03c4 (\u03980) = SVGDn(\u03980; Dtrn\n\u03c4\n\n(\u03980) = SVGDs(\u0398n\n\nend for\n\n\u03c4 , \u03b1)\n\n\u03c4 (\u03980); Dtrn\n\n\u03c4 \u222a Dval\n\n\u03c4 , \u03b1)\n\n8: \u03980 \u2190 \u03980 \u2212 \u03b2\u2207\u03980P\u03c4 \u2208Tt\n\n9: end for\n\nds(\u0398n\n\n\u03c4 (\u03980) k stopgrad(\u0398n+s\n\n\u03c4\n\n(\u03980)))\n\ndistribution as s increases, the leader \u0398n+s\nchaser \u0398n\n\n\u03c4 . This gives us the following meta-loss:\n\n\u03c4\n\nbecomes closer to the target distribution \u0398\u221e\n\u03c4\n\nthan the\n\nLBMAML(\u03980) = X\u03c4 \u2208Tt\n\nds(\u0398n\n\n\u03c4 k \u0398n+s\n\n\u03c4\n\nM\n\n) = X\u03c4 \u2208Tt\n\nXm=1\n\nk\u03b8n,m\n\n\u03c4 \u2212 \u03b8n+s,m\n\n\u03c4\n\nk2\n2.\n\n(7)\n\nHere, to compute the distance between the two sample sets, we make a one-to-one mapping between\nthe leader particles and the chaser particles and compute the Euclidean distance between the paired\nparticles. Note that we do not back-propagate through the leader particles because we use them as\ntargets that the chaser particles follow. A more sophisticated method like maximum mean discrepancy\n(Borgwardt et al., 2006) can also be used here. In our experiments, setting n and s to a small number\nlike n = s = 1 worked well.\n\n\u03c4\n\nMinimizing the above loss w.r.t. \u03980 places \u03980 in a region where the chaser \u0398n\n\u03c4 can ef\ufb01ciently\nchase the leader \u0398n+s\nin n SVGD-update steps starting from \u03980. Thus, we call this meta-loss the\nChaser loss. Because the leader converges to the posterior distribution instead of doing empirical risk\nminimization, it retains a proper level of uncertainty and thus prevents from meta-level over\ufb01tting. In\nAlgorithm 3, we describe the algorithm for supervised learning. One limitation of the method is that,\nlike other ensemble methods, it needs to maintain M model instances. Because this could sometimes\nbe an issue when training a large model, in the Experiment section we introduce a way to share\nparameters among the particles.\n\n4 Related Works\n\nThere have been many studies in the past that formulate meta-learning and learning-to-learn from a\nprobabilistic modeling perspective (Tenenbaum, 1999; Fe-Fei et al., 2003; Lawrence & Platt, 2004;\nDaum\u00e9 III, 2009). Since then, the remarkable advances in deep neural networks (Krizhevsky et al.,\n2012; Goodfellow et al., 2016) and the introduction of new few-shot learning datasets (Lake et al.,\n2015; Ravi & Larochelle, 2017), have rekindled the interest in this problem from the perspective of\ndeep networks for few-shot learning (Santoro et al., 2016; Vinyals et al., 2016; Snell et al., 2017;\nDuan et al., 2016; Finn et al., 2017; Mishra et al., 2017). Among these, Finn et al. (2017) proposed\nMAML that formulates meta-learning as gradient-based optimization.\n\nGrant et al. (2018) reinterpreted MAML as a hierarchical Bayesian model, and proposed a way\nto perform an implicit posterior inference. However, unlike our proposed model, the posterior on\nvalidation set is approximated by local Laplace approximation and used a relatively complex 2nd-order\noptimization using K-FAC (Martens & Grosse, 2015). The fast adaptation is also approximated by a\nsimple isotropic Gaussian with \ufb01xed variance. As pointed by Grant et al. (2018), this approximation\nwould not work well for skewed distributions, which is likely to be the case of BNNs trained\non a few-shot dataset. The authors also pointed that their method is limited in that the predictive\ndistribution over new data-points is approximated by a point estimate. Our method resolves these\nlimitations. Although it can be expensive when training many large networks, we mitigate this cost\nby parameter sharing among the particles. In addition, Bauer et al. (2017) also proposed Gaussian\napproximation of the task-posterior and a scheme of splitting the feature network and the classi\ufb01er\nwhich is similar to what we used for the image classi\ufb01cation task. Lacoste et al. (2017) proposed\nlearning a distribution of stochastic input noise while \ufb01xing the BNN model parameter.\n\n5\n\n\fFigure 1: Sinusoidal regression experimental results (meta-testing performance) by varying the number of\nexamples (K-shot) given for each task and the number of tasks |T | used for meta-training.\n\n5 Experiments\n\nWe evaluated our proposed model (BMAML) in various few-shot learning tasks: sinusoidal regression,\nimage classi\ufb01cation, active learning, and reinforcement learning. Because our method is a Bayesian\nensemble, as a baseline model we used an ensemble of independent MAML models (EMAML)\nfrom which we can easily recover regular MAML by setting the number of particles to 1. In all\nour experiments, we con\ufb01gured BMAML and EMAML to have the same network architecture and\nused the RBF kernel. The experiments are designed in such a way to see the effects of uncertainty in\nvarious ways such as accuracy, robustness, and ef\ufb01cient exploration.\n\nRegression: The population of the tasks is de\ufb01ned by a sinusoidal function y = A sin(wx + b) + \u01eb\nwhich is parameterized by amplitude A, frequency w, and phase b, and observation noise \u01eb. To sample\na task, we sample the parameters uniformly randomly A \u2208 [0.1, 5.0], b \u2208 [0.0, 2\u03c0], w \u2208 [0.5, 2.0]\nand add observation noise from \u01eb \u223c N (0, (0.01A)2). The K-shot dataset is obtained by sampling x\nfrom [\u22125.0, 5.0] and then by computing its corresponding y with noise \u01eb. Note that, because of the\nhighly varying frequency and observation noise, this is a more challenging setting containing more\nuncertainty than the setting used in Finn et al. (2017). For the regression model, we used a neural\nnetwork with 3 layers each of which consists of 40 hidden units.\n\nIn Fig. 1, we show the mean squared error (MSE) performance on the test tasks. To see the effect of\nthe degree of uncertainty, we controlled the number of training tasks |T | to 100 and 1000, and the\nnumber of observation shots K to 5 and 10. The lower number of training tasks and observation shots\nis expected to induce a larger degree of uncertainty. We observe, as we claimed, that both MAML\n(which is EMAML with M = 1) and EMAML over\ufb01t severely in the settings with high uncertainty\nalthough EMAML with multiple particles seems to be slightly better than MAML. BMAML with the\nsame number of particles provides signi\ufb01cantly better robustness and accuracy for all settings. Also,\nhaving more particles tends to improve further.\n\nClassi\ufb01cation: To evaluate the proposed method on a more complex model, we test the performance\non the miniImagenet classi\ufb01cation task (Vinyals et al., 2016) involving task adaptation of 5-way\nclassi\ufb01cation with 1 shot. The dataset consists of 60,000 color images of 84\u00d784 dimension. The\nimages consist of total 100 classes and each of the classes contains 600 examples. The entire classes\nare split into 64, 12, and 24 classes for meta-train, meta-validation, and meta-test, respectively. We\ngenerated the tasks following the same procedure as in Finn et al. (2017).\n\nIn order to reduce the space and time complexity of the ensemble models (i.e., BMAML and EMAML)\nin this large network setting, we used the following parameter sharing scheme among the particles,\nsimilarly to Bauer et al. (2017). We split the network architecture into the feature extractor layers and\nthe classi\ufb01er. The feature extractor is a convolutional network with 5 hidden layers with 64 \ufb01lters.\nThe classi\ufb01er is a single-layer fully-connected network with softmax output. The output of the feature\nextractor which has 256 dimensions is input to the classi\ufb01er. We share the feature extractor across\nall the particles while each particle has its own classi\ufb01er. Therefore, the space complexity of the\nnetwork is O(|\u03b8feature| + M |\u03b8classi\ufb01er|). Both the classi\ufb01er and feature extractor are updated during\nmeta-update, but for inner-update only the classi\ufb01er is updated. The baseline models are updated in\nthe same manner. We describe more details of the setting in Appendix A.2.\n\nWe can see from Fig. 2 (a) that for both M = 5 and M = 10 BMAML provides more accurate\npredictions than EMAML. However, the performance of both BMAML and EMAML with 10\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Experimental results in miniImagenet dataset: (a) few-shot image classi\ufb01cation using different number\nof particles, (b) using different number of tasks for meta-training, and (c) active learning setting.\n\nparticles is slightly lower than having 5 particles2. Because a similar instability is also observed\nin the SVGD paper (Liu & Wang, 2016), we presume that one possible reason is the instability of\nSVGD such as sensitivity to kernel function parameters. To increase the inherent uncertainty further,\nin Fig. 2 (b), we reduced the number of training tasks |T | from 800K to 10K. We see that BMAML\nprovides robust predictions even for such a small number of training tasks while EMAML over\ufb01ts\neasily.\n\nActive Learning: In addition to the ensembled prediction accuracy, we can also evaluate the ef-\nfectiveness of the measured uncertainty by applying it to active learning. To demonstrate, we use\nthe miniImagenet classi\ufb01cation task. To do this, given an unseen task \u03c4 at test time, we \ufb01rst run\na fast adaptation from the meta-trained initial particles \u0398\u2217\n0 to obtain \u0398\u03c4 of the task-train poste-\nrior p(\u03b8\u03c4 |D\u03c4 ; \u0398\u2217\n0). For this we used 5-way 1-shot labeled dataset. Then, from a pool of unla-\nbeled data X\u03c4 = {x1, . . . , x20}, we choose an item x\u2217 that has the maximum predictive entropy\narg maxx\u2208X\u03c4\nmoved from X\u03c4 and added to D\u03c4 along with its label. We repeat this process until we consume all\nthe data in X\u03c4 . We set M to 5. As we can see from Fig. 2 (c), active learning using the Bayesian\nfast adaptation provides consistently better results than EMAML. Particularly, the performance gap\nincreases as more examples are added. This shows that the examples picked by BMAML so as\nto reduce the uncertainty, provides proper discriminative information by capturing a reasonable\napproximation of the task-posterior. We presume that the performance degradation observed in the\nearly stage might be due to the class imbalance induced by choosing examples without considering\nthe class balance.\n\nH[y|x, D\u03c4 ] = \u2212Py\u2032 p(y\u2032|x, D\u03c4 ) log p(y\u2032|x, D\u03c4 ). The chosen item x\u2217 is then re-\n\nReinforcement Learning: SVPG is a simple way to apply SVGD to policy optimization. Liu et al.\n(2017) showed that the maximum entropy policy optimization can be recast to Bayesian inference. In\nthis framework, the particle update rule (a particle is now parameters of a policy) is simply to\nreplace the target distribution log p(\u03b8) in Eq. (1) with the objective of the maximum entropy policy\nq(\u03b8)[J(\u03b8)] + \u03b7H[q]] where q(\u03b8) is a distribution of policies, J(\u03b8) is the expected\noptimization, i.e., E\nreturn of policy \u03b8, and \u03b7 is a parameter for exploration control. Deploying multiple agents (particles)\nwith a principled Bayesian exploration mechanism, SVPG encourages generating diverse policy\nbehaviours while being easy to parallelize.\n\nWe test and compare the models on the same MuJoCo continuous control tasks (Todorov et al., 2012)\nas are used in Finn et al. (2017). In the goal velocity task, the agent receives higher rewards as its\ncurrent velocity approaches the goal velocity of the task. In the goal direction task, the reward is the\nmagnitude of the velocity in either the forward or backward direction. We tested these tasks for two\nsimulated robots, the ant and the cheetah. The goal velocity is sampled uniformly at random from\n[0.0, 2.0] for the cheetah and from [0.0, 3.0] for the ant. As the goal velocity and the goal direction\nchange per task, a meta learner is required to learn a given unseen task after trying K episodes. We\nimplemented the policy network with two hidden-layers each with 100 ReLU units. We tested the\nnumber of particles for M \u2208 {1, 5, 10} with M = 1 only for non-ensembled MAML. We describe\nmore details of the experiment setting in Appendix C.1.\n\n2We found a similar instability in the relationship between the number of particles and the prediction accuracy\n\nfrom the original implementation by the authors of the SVGD paper.\n\n7\n\n\fFigure 3: Locomotion comparison results of SVPG-TRPO and VPG-TRPO\n\nFigure 4: Locomotion comparison results of SVPG-Chaser and VPG-Reptile\n\nFor meta-update, MAML uses TRPO (Schulman et al., 2015) which is designed with a special purpose\nto apply for reinforcement learning and uses a rather expensive 2nd-order optimization. However,\nthe meta-update by the chaser loss is general-purpose and based on 1st-order optimization3. Thus,\nfor a fair comparison, we consider the following two experiment designs. First, in order to evaluate\nthe performance of the inner updates using Bayesian fast adaptation, we compare the standard\nMAML, which uses vanilla policy gradient (REINFORCE, Williams (1992)) for inner-updates and\nTRPO for meta-updates, with the Bayesian fast adaptation with TRPO meta-update. We label the\nformer as VPG-TRPO and the later as SVPG-TRPO. Second, we compare SVPG-Chaser with\nVPG-Reptile. Because, similarly to the chaser loss, Reptile (Nichol et al., 2018) performs 1st-order\ngradient optimization based on the distance in the model parameter space, this provides us a fair\nbaseline to evaluate the chaser loss in RL. The VPG-TRPO and VPG-Reptile are implemented with\nindependent multiple agents. We tested the comparing methods for M = [1, 5, 10]. More details of\nthe experimental setting is provided in Appendix C.3.\n\nAs shown in Fig. 3 and Fig. 4, we can see that overall, BMAML shows superior performance to\nEMAML. In particular, BMAML performs signi\ufb01cantly and consistently better than EMAML in\nthe case of using TRPO meta-updater. In addition, we can see that BMAML performs much better\nthan EMAML for the goal direction tasks. We presume that this is because in the goal direction\ntask, there is no goal velocity and thus a higher reward can always be obtained by searching for\na better policy. This, therefore, demonstrates that BMAML can learn a better exploration policy\nthan EMAML. In contrast, in the goal velocity task, exploration becomes less effective because it\nis not desired once a policy reaches the given goal velocity. This thus explains the results on the\ngoal velocity task in which BMAML provides slightly better performance than EMAML. For some\nexperiments, we also see that having more particles do not necessarily provides further improvements.\nAs in the case of classi\ufb01cation, we hypothesize that one of the reasons could be due to the instability\nof SVGD. In Appendix C.4, we also provide the results on 2D Navigation task, where we observe\nsimilar superiority of BMAML to EMAML.\n\n6 Discussions\n\nIn this section, we discuss some of the issues underlying the design of the proposed method.\n\nBMAML is tied to SVGD? In principle, it could actually be more generally applicable to any inference\nalgorithm that can provide differentiable samples. Gradient-based MCMC methods like HMC (Neal\net al., 2011) or SGLD (Welling & Teh, 2011) are such methods. We however chose SVGD speci\ufb01cally\n\n3When considering the inner update together, TRPO, Chaser and Reptile are 3rd/2nd/1st-order, respectively.\n\n8\n\n\ffor BMAML because jointly updating the particles altogether is more ef\ufb01cient for capturing the\ndistribution quickly by a small number of update steps. In contrast, MCMC would require to wait for\nmuch more iterations until the chain mixes enough and a long backpropagation steps through the\nchain.\n\nParameter space v.s. prediction space? We de\ufb01ned the chaser loss by the distance in the model\nparameter space although it is also possible to de\ufb01ne it in the prediction distance, i.e., by prediction\nerror. We chose the parameter space because (1) we can save computation for the forward-pass\nfor predictions, and (2) it empirically showed better performance for RL and similar performance\nfor other tasks. The advantages of working in the parameter space is also discussed in Nichol et al.\n(2018).\n\nDo the small number of SVGD steps converge to the posterior? In our small-data-big-network setting,\na large area of a true task-posterior will be meaningless for other tasks. Thus, it is not desired to\nfully capture the task-posterior but instead we need to \ufb01nd an area which will be broadly useful for\nmany tasks. This is the goal of hierarchical Bayes which our method approximate by \ufb01nding such\narea and putting \u03980 there. In theory, the task-posterior can be fully captured with in\ufb01nite number of\nparticles and update-steps, and thus dilute the initialization effect. In practice, the full coverage would,\nhowever, not be achievable (and not desired) because SVGD or MCMC would have dif\ufb01culties in\ncovering all areas of the complex multimodal task-posterior like that of a neural network.\n\n7 Conclusion\n\nMotivated by the hierarchical probabilistic modeling perspective to gradient-based meta-learning,\nwe proposed a Bayesian gradient-based meta learning method. To do this, we combined the Stein\nVariational Gradient Descent with gradient-based meta learning in a probabilistic framework, and\nproposed the Bayesian Fast Adaptation and the Chaser loss for meta-update. As it remains a model-\nagnostic model, in experiments, we evaluated the method in various types of learning tasks including\nsupervised learning, active learning, and reinforcement learning, and showed its superior performance\nin prediction accuracy, robustness to over\ufb01tting, and ef\ufb01cient exploration.\n\nAs a Bayesian ensemble method, along with its advantages, the proposed method also inherits the\ngeneric shortcomings of ensemble methods, particularly the space/time complexity proportional\nto the number of particles. Although we showed that our parameter sharing scheme is effective to\nmitigate this issue, it would still be interesting to improve the ef\ufb01ciency further in this direction. In\naddition, because the performance of SVGD can be sensitive to the parameters of the kernel function,\nincorporating the fast-adaptation of the kernel parameter into a part of meta-learning would also be\nan interesting future direction.\n\nAcknowledgments\n\nJY thanks SAP and Kakao Brain for their support. TK thanks NSERC, MILA and Kakao Brain\nfor their support. YB thanks CIFAR, NSERC, IBM, Google, Facebook and Microsoft for their\nsupport. SA, Element AI Fellow, thanks Nicolas Chapados, Pedro Oliveira Pinheiro, Alexandre\nLacoste, Negar Rostamzadeh for helpful discussions and feedback.\n\nReferences\n\nSungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic\n\ngradient \ufb01sher scoring. arXiv preprint arXiv:1206.6380, 2012.\n\nAnoop Korattikara Balan, Vivek Rathod, Kevin Murphy, and Max Welling. Bayesian dark knowledge.\n\nCoRR, abs/1506.04416, 2015. URL http://arxiv.org/abs/1506.04416.\n\nMatthias Bauer, Mateo Rojas-Carulla, Jakub Bart\u0142omiej \u00b4Swi \u02dbatkowski, Bernhard Sch\u00f6lkopf, and\nRichard E Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint\narXiv:1706.00326, 2017.\n\nYoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Universit\u00e9\n\nde Montr\u00e9al, D\u00e9partement d\u2019informatique et de recherche op\u00e9rationnelle, 1990.\n\n9\n\n\fJohn B Biggs. The role of metalearning in study processes. British journal of educational psychology,\n\n55(3):185\u2013212, 1985.\n\nKarsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Sch\u00f6lkopf,\nand Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy.\nBioinformatics, 22(14):e49\u2013e57, 2006.\n\nHal Daum\u00e9 III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, pp. 135\u2013142. AUAI Press, 2009.\n\nYan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl\u03982: Fast\nreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\nLi Fe-Fei et al. A bayesian approach to unsupervised one-shot learning of object categories. In\nComputer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1134\u20131141.\nIEEE, 2003.\n\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of\n\ndeep networks. arXiv preprint arXiv:1703.03400, 2017.\n\nYarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data.\n\nIn Bayesian Deep Learning workshop, NIPS, 2016.\n\nIan Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1.\n\nMIT press Cambridge, 2016.\n\nErin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting gradient-\n\nbased meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.\n\nRein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:\nVariational information maximizing exploration. In Advances in Neural Information Processing\nSystems, pp. 1109\u20131117, 2016.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, pp. 1097\u20131105,\n2012.\n\nAlexandre Lacoste, Thomas Boquet, Negar Rostamzadeh, Boris Oreshki, Wonchang Chung, and\n\nDavid Krueger. Deep prior. arXiv preprint arXiv:1712.05016, 2017.\n\nBrenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning\n\nthrough probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\nBarbara Landau, Linda B Smith, and Susan S Jones. The importance of shape in early lexical learning.\n\nCognitive development, 3(3):299\u2013321, 1988.\n\nNeil D Lawrence and John C Platt. Learning to learn with the informative vector machine. In\nProceedings of the twenty-\ufb01rst international conference on Machine learning, pp. 65. ACM, 2004.\n\nL.M. Le Cam. Asymptotic methods in statistical decision theory. Springer, 1986.\n\nQiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference\n\nalgorithm. In Advances In Neural Information Processing Systems, pp. 2378\u20132386, 2016.\n\nYang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv\n\npreprint arXiv:1704.02399, 2017.\n\nJames Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate\n\ncurvature. In International conference on machine learning, pp. 2408\u20132417, 2015.\n\nNikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Meta-learning with temporal\n\nconvolutions. arXiv preprint arXiv:1707.03141, 2017.\n\n10\n\n\fRadford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2(11):2, 2011.\n\nA. Nichol, J. Achiam, and J. Schulman. On First-Order Meta-Learning Algorithms. ArXiv e-prints,\n\nMarch 2018.\n\nSachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In In International\n\nConference on Learning Representations (ICLR), 2017.\n\nAdam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-\nIn International conference on machine\n\nlearning with memory-augmented neural networks.\nlearning, pp. 1842\u20131850, 2016.\n\nJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\n\npolicy optimization. In International Conference on Machine Learning, pp. 1889\u20131897, 2015.\n\nLinda B Smith and Lauren K Slone. A developmental approach to machine learning? Frontiers in\n\npsychology, 8:2124, 2017.\n\nJake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nAdvances in Neural Information Processing Systems, pp. 4080\u20134090, 2017.\n\nJoshua Brett Tenenbaum. A Bayesian framework for concept learning. PhD thesis, Massachusetts\n\nInstitute of Technology, 1999.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.\nIn Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026\u2013\n5033. IEEE, 2012.\n\nOriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In Advances in Neural Information Processing Systems, pp. 3630\u20133638, 2016.\n\nMax Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681\u2013688,\n2011.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning, pp. 5\u201332. Springer, 1992.\n\n11\n\n\f", "award": [], "sourceid": 3652, "authors": [{"given_name": "Jaesik", "family_name": "Yoon", "institution": "SAP"}, {"given_name": "Taesup", "family_name": "Kim", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Ousmane", "family_name": "Dia", "institution": "Element AI"}, {"given_name": "Sungwoong", "family_name": "Kim", "institution": "Kakao Brain"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}, {"given_name": "Sungjin", "family_name": "Ahn", "institution": "Rutgers University"}]}