{"title": "Probabilistic Model-Agnostic Meta-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9516, "page_last": 9527, "abstract": "Meta-learning for few-shot learning entails acquiring a prior over previous tasks and experiences, such that new tasks be learned from small amounts of data. However, a critical challenge in few-shot learning is task ambiguity: even when a powerful prior can be meta-learned from a large number of prior tasks, a small dataset for a new task can simply be too ambiguous to acquire a single model (e.g., a classifier) for that task that is accurate. In this paper, we propose a probabilistic meta-learning algorithm that can sample models for a new task from a model distribution. Our approach extends model-agnostic meta-learning, which adapts to new tasks via gradient descent, to incorporate a parameter distribution that is trained via a variational lower bound. At meta-test time, our algorithm adapts via a simple procedure that injects noise into gradient descent, and at meta-training time, the model is trained such that this stochastic adaptation procedure produces samples from the approximate model posterior. Our experimental results show that our method can sample plausible classifiers and regressors in ambiguous few-shot learning problems. We also show how reasoning about ambiguity can also be used for downstream active learning problems.", "full_text": "Probabilistic Model-Agnostic Meta-Learning\n\nChelsea Finn\u2217, Kelvin Xu\u2217, Sergey Levine\n\nUC Berkeley\n\n{cbfinn,kelvinxu,svlevine}@eecs.berkeley.edu\n\nAbstract\n\nMeta-learning for few-shot learning entails acquiring a prior over previous tasks and\nexperiences, such that new tasks be learned from small amounts of data. However,\na critical challenge in few-shot learning is task ambiguity: even when a powerful\nprior can be meta-learned from a large number of prior tasks, a small dataset\nfor a new task can simply be too ambiguous to acquire a single model (e.g., a\nclassi\ufb01er) for that task that is accurate. In this paper, we propose a probabilistic\nmeta-learning algorithm that can sample models for a new task from a model\ndistribution. Our approach extends model-agnostic meta-learning, which adapts\nto new tasks via gradient descent, to incorporate a parameter distribution that is\ntrained via a variational lower bound. At meta-test time, our algorithm adapts via\na simple procedure that injects noise into gradient descent, and at meta-training\ntime, the model is trained such that this stochastic adaptation procedure produces\nsamples from the approximate model posterior. Our experimental results show that\nour method can sample plausible classi\ufb01ers and regressors in ambiguous few-shot\nlearning problems. We also show how reasoning about ambiguity can also be used\nfor downstream active learning problems.\n\n1\n\nIntroduction\n\nLearning from a few examples is a key aspect of human intelligence. One way to make it possible\nto acquire solutions to complex tasks from only a few examples is to leverage past experience to\nlearn a prior over tasks. The process of learning this prior entails discovering the shared structure\nacross different tasks from the same family, such as commonly occurring visual features or semantic\ncues. Structure is useful insofar as it yields ef\ufb01cient learning of new tasks \u2013 a mechanism known as\nlearning-to-learn, or meta-learning [3]. However, when the end goal of few-shot meta-learning is to\nlearn solutions to new tasks from small amounts of data, a critical issue that must be dealt with is task\nambiguity: even with the best possible prior, there might simply not be enough information in the\nexamples for a new task to resolve that task with high certainty. It is therefore quite desireable to\ndevelop few-shot meta-learning methods that can propose multiple potential solutions to an ambiguous\nfew-shot learning problem. Such a method could be used to evaluate uncertainty (by measuring\nagreement between the samples), perform active learning, or elicit direct human supervision about\nwhich sample is preferable. For example, in safety-critical applications, such as few-shot medical\nimage classi\ufb01cation, uncertainty is crucial for determining if the learned classi\ufb01er should be trusted.\nWhen learning from such small amounts of data, uncertainty estimation can also help predict if\nadditional data would be bene\ufb01cial for learning and improving the estimate of the rewards. Finally,\nwhile we do not experiment with this in this paper, we expect that modeling this ambiguity will be\nhelpful for reinforcement learning problems, where it can be used to aid in exploration.\n\nWhile recognizing and accounting for ambiguity is an important aspect of the few-shot learning prob-\nlem, it is challenging to model when scaling to high-dimensional data, large function approximators,\nand multimodal task structure. Representing distributions over functions is relatively straightforward\n\n\u2217First two authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhen using simple function approximators, such as linear functions, and has been done extensively\nin early few-shot learning approaches using Bayesian models [39, 7]. But this problem becomes\nsubstantially more challenging when reasoning over high-dimensional function approximators such\nas deep neural networks, since explicitly representing expressive distributions over thousands or\nmillions of parameters if often intractable. As a result, recent more scalable approaches to few-shot\nlearning have focused on acquiring deterministic learning algorithms that disregard ambiguity over\nthe underlying function. Can we develop an approach that has the bene\ufb01ts of both classes of few-shot\nlearning methods \u2013 scalability and uncertainty awareness? To do so, we build upon tools in amortized\nvariational inference for developing a probabilistic meta-learning approach.\n\nIn particular, our method builds on model-agnostic meta-learning (MAML) [9], a few shot meta-\nlearning algorithm that uses gradient descent to adapt the model at meta-test time to a new few-shot\ntask, and trains the model parameters at meta-training time to enable rapid adaptation, essentially\noptimizing for a neural network initialization that is well-suited for few shot learning. MAML can be\nshown to retain the generality of black-box meta-learners such as RNNs [8], while being applicable\nto standard neural network architectures. Our approach extends MAML to model a distribution\nover prior model parameters, which leads to an appealing simple stochastic adaptation procedure\nthat simply injects noise into gradient descent at meta-test time. The meta-training procedure then\noptimizes for this simple inference process to produce samples from an approximate model posterior.\n\nThe primary contribution of this paper is a reframing of MAML as a graphical model inference\nproblem, where variational inference can provide us with a principled and natural mechanism for\nmodeling uncertainty. Our approach enables sampling multiple potential solutions to a few-shot\nlearning problem at meta-test time, and our experiments show that this ability can be used to sample\nmultiple possible regressors for an ambiguous regression problem, as well as multiple possible\nclassi\ufb01ers for ambiguous few-shot attribute classi\ufb01cation tasks. We further show how this capability\nto represent uncertainty can be used to inform data acquisition in a few-shot active learning problem.\n\n2 Related Work\n\nHierarchical Bayesian models are a long-standing approach for few-shot learning that naturally allow\nfor the ability to reason about uncertainty over functions [39, 7, 25, 43, 12, 4, 41]. While these\napproaches have been demonstrated on simple few-shot image classi\ufb01cation datasets [24], they have\nyet to scale to the more complex problems, such as the experiments in this paper. A number of\nworks have approached the problem of few-shot learning from a meta-learning perspective [35, 19],\nincluding black-box [33, 5, 42] and optimization-based approaches [31, 9]. While these approaches\nscale to large-scale image datasets [40] and visual reinforcement learning problems [28], they typically\nlack the ability to reason about uncertainty.\n\nOur work is most related to methods that combine deep networks and probabilistic methods for\nfew-shot learning [6, 15, 23]. One approach that considers hierarchical Bayesian models for few-shot\nlearning is the neural statistician [6], which uses an explicit task variable to model task distributions.\nOur method is fully model agnostic, and directly samples model weights for each task for any\nnetwork architecture. Our experiments show that our approach improves on MAML [9], which\noutperforms the model by Edwards and Storkey [6]. Other work that considers model uncertainty in\nthe few-shot learning setting is the LLAMA method [15], which also builds on the MAML algorithm.\nLLAMA makes use of a local Laplace approximation for modeling the task parameters (post-update\nparameters), which introduces the need to approximate a high dimensional covariance matrix. We\ninstead propose a method that approximately infers the pre-update parameters, which we make\ntractable through a choice of approximate posterior parameterized by gradient operations.\n\nBayesian neural networks [27, 18, 29, 1] have been studied extensively as a way to incorporate\nuncertainty into deep networks. Although exact inference in Bayesian neural networks is impractical,\napproximations based on backpropagation and sampling [16, 32, 20, 2] have been effective in\nincorporating uncertainty into the weights of generic networks. Our approach differs from these\nmethods in that we explicitly train a hierarchical Bayesian model over weights, where a posterior\ntask-speci\ufb01c parameter distribution is inferred at meta-test time conditioned on a learned weight prior\nand a (few-shot) training set, while conventional Bayesian neural networks directly learn only the\nposterior weight distribution for a single task. Our method draws on amortized variational inference\nmethods [22, 21, 36] to make this possible, but the key modi\ufb01cation is that the model and inference\nnetworks share the same parameters. The resulting method corresponds structurally to a Bayesian\nversion of model-agnostic meta-learning [9].\n\n2\n\n\fFigure 1: Graphical models corresponding to our approach. The original graphical model (left) is transformed\ninto the center model after performing inference over \u03c6i. We \ufb01nd it bene\ufb01cial to introduce additional dependen-\ncies of the prior on the training data to compensate for using the MAP estimate to approximate p(\u03c6i), as shown\non the right.\n\n3 Preliminaries\n\nIn the meta-learning problem setting that we consider, the goal is to learn models that can learn new\ntasks from small amounts of data. To do so, meta-learning algorithms require a set of meta-training\nand meta-testing tasks drawn from some distribution p(T ). The key assumption of learning-to-learn\nis that the tasks in this distribution share common structure that can be exploited for faster learning of\nnew tasks. Thus, the goal of the meta-learning process is to discover that structure. In this section,\nwe will introduce notation and overview the model-agnostic meta-learning (MAML) algorithm [9].\n\nMeta-learning algorithms proceed by sampling data from a given task, and splitting the sampled\ndata into a set of a few datapoints, Dtr used for training the model and a set of datapoints for\nmeasuring whether or not training was effective, Dtest. This second dataset is used to measure\nfew-shot generalization drive meta-training of the learning procedure. The MAML algorithm trains\nfor few-shot generalization by optimizing for a set of initial parameters \u03b8 such that one or a few steps\nof gradient descent on Dtr achieves good performance on Dtest. Speci\ufb01cally, MAML performs the\nfollowing optimization:\n\nmin\n\n\u03b8 XTi\u223cp(T )\n\nL(\u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8, Dtr\n\nTi ), Dtest\n\nTi ) = min\n\n\u03b8 XTi\u223cp(T )\n\nL(\u03c6i, Dtest\nTi )\n\nwhere \u03c6i is used to denote the parameters updated by gradient descent and where the loss corresponds\nto negative log likelihood of the data. In particular, in the case of supervised classi\ufb01cation with inputs\n{xj}, their corresponding labels {yj}, and a classi\ufb01er f\u03b8, we will denote the negative log likelihood\n\nof the data under the classi\ufb01er as L(\u03b8, D) = \u2212P(xj ,yj )\u2208D log p(yj|xj, \u03b8). This corresponds to the\n\ncross entropy loss function.\n\n4 Method\n\nOur goal is to build a meta-learning method that can handle the uncertainty and ambiguity that occurs\nwhen learning from small amounts of data, while scaling to highly-expressive function approximators\nsuch as neural networks. To do so, we set up a graphical model for the few-shot learning problem.\nIn particular, we want a hierarchical Bayesian model that includes random variables for the prior\ndistribution over function parameters, \u03b8, the distribution over parameters for a particular task, \u03c6i, and\nthe task training and test datapoints. This graphical model is illustrated in Figure 1 (left), where tasks\nare indexed over i and datapoints are indexed over j. We will use the shorthand x\nto\ndenote the sets of datapoints {x\nto denote\n{x\n\ntr\ntr\ntest\ni , x\ni , y\ni\ni , Dtest\ni,j | \u2200 j} and Dtr\ntest\ni\n\ntest\ni,j | \u2200 j}, {y\n\ntr\ni,j| \u2200 j}, {y\n\ntr\ni,j| \u2200 j}, {x\n\ntest\ni\n\ntr\ni , y\n\ntr\ni } and {x\n\ntest\ni\n\n, y\n\ntest\ni }.\n\n, y\n\n4.1 Gradient-Based Meta-Learning with Variational Inference\n\nIn the graphical model in Figure 1, the predictions for each task are determined by the task-speci\ufb01c\nmodel parameters \u03c6i. At meta-test time, these parameters are in\ufb02uenced by the prior p(\u03c6i|\u03b8), as well\ntest are also observed, but the test outputs\nas by the observed training data x\ntest, but not of\ny\n\ntest, which need to be predicted, are not observed. Note that \u03c6i is thus independent of x\n\ntr. The test inputs x\n\ntr, y\n\n3\n\n\ftr, y\n\ntr. Therefore, posterior inference over \u03c6i must take into account both the evidence (training set)\nx\nand the prior imposed by p(\u03b8) and p(\u03c6i|\u03b8). Conventional MAML can be interpreted as approximating\nmaximum a posteriori inference under a simpli\ufb01ed model where p(\u03b8) is a delta function, and inference\ntr, \u03c6i) for a \ufb01xed number of iterations starting\nis performed by running gradient descent on log p(y\nfrom \u03c60\ni = E[\u03b8] [15]. The corresponding distribution p(\u03c6i|\u03b8) is approximately Gaussian, with a\nmean that depends on the step size and number of gradient steps. When p(\u03b8) is not deterministic, we\nmust make a further approximation to account for the random variable \u03b8.\n\ntr|x\n\nOne way we can do this is by using structured variational inference.\nIn structured variational\ninference, we approximate the distribution over the hidden variables \u03b8 and \u03c6i for each task with some\napproximate distribution qi(\u03b8, \u03c6i). There are two reasonable choices we can make for qi(\u03b8, \u03c6i). First,\nwe can approximate it as a product of independent marginals, according to qi(\u03b8, \u03c6i) = qi(\u03b8)qi(\u03c6i).\nHowever, this approximation does not permit uncertainty to propagate effectively from \u03b8 to \u03c6i. A\nmore expressive approximation is the structured variational approximation qi(\u03b8, \u03c6i) = qi(\u03b8)qi(\u03c6i|\u03b8).\nWe can further avoid storing a separate variational distribution qi(\u03c6i|\u03b8) and qi(\u03b8) for each task\nTi by employing an amortized variational inference technique [22, 21, 36], where we instead set\ntr\ntr\ntest\ntest\nqi(\u03c6i|\u03b8) = q\u03c8(\u03c6i|\u03b8, x\ni ), where q\u03c8 is de\ufb01ned by some function approximator with\ni , y\ni , x\ni\ntr\ntr\ni , y\ni as input, and the same q\u03c8 is used for all tasks. Similarly, we can\nparameters \u03c8 that takes x\ntest\ntest\ntr\ntr\nde\ufb01ne qi(\u03b8) as q\u03c8(\u03b8|x\ni ). We can now write down the variational lower bound on the\ni , y\ni , x\ni\nlog-likelihood as\n\n, y\n\n, y\n\nlog p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, x\n\ntr\ni , y\n\ntr\ni |x\n\ntr\ni , \u03c6i)+log p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\ntr\ni ) \u2265 E\n\n\u03b8,\u03c6i\u223cq\u03c8(cid:2)log p(y\n\nH(q\u03c8(\u03c6i|\u03b8, x\n\ntr\ni , y\n\ntr\ni , x\n\ntest\ni\n\n, y\n\ntest\ni )) + H(q\u03c8(\u03b8|x\n\ntr\ni , y\n\ntr\ni , x\n\ntest\ni\n\n, y\n\ntest\ni )).\n\n, \u03c6i)+log p(\u03c6i|\u03b8)+log p(\u03b8)(cid:3)+\n\n, y\n\ntest\ni\n\ntr\ni , y\n\ntr\ni , x\n\nline can be evaluated ef\ufb01ciently:\n\ngiven a sample\nThe likelihood terms on the \ufb01rst\ntest\n\u03b8, \u03c6i \u223c q(\u03b8, \u03c6i|x\ni ), the training and test likelihoods simply correspond to the loss\nof the network with parameters \u03c6i. The prior p(\u03b8) can be chosen to be Gaussian, with a learned mean\nand (diagonal) covariance to provide for \ufb02exibility to choose the prior parameters. This corresponds\n2\nto a Bayesian version of the MAML algorithm. We will de\ufb01ne these parameters as \u00b5\u03b8 and \u03c3\n\u03b8 . Lastly,\np(\u03c6i|\u03b8) must be chosen. This choice is more delicate. One way to ensure a tractable likelihood is to\nuse a Gaussian with mean \u03b8. This choice is reasonable, because it encourages \u03c6i to stay close to the\nprior parameters \u03c6i, but we will see in the next section how a more expressive implicit conditional\ncan be obtained using gradient descent, resulting in a procedure that more closely resembles the\noriginal MAML algorithm while still modeling the uncertainty. Lastly, we must choose a form for the\ntest\ninference networks q\u03c8(\u03c6i|\u03b8, x\ni ). They must be chosen so\nthat their entropies on the second line of the above equation are tractable. Furthermore, note that both\nof these distributions model very high-dimensional random variables: a deep neural network can\nhave hundreds of thousands or millions of parameters. So while we can use an arbitrary function\napproximator, we would like to \ufb01nd a scalable solution.\n\ntest\ni ) and q\u03c8(\u03b8|x\n\ntr\ni , x\n\ntr\ni , x\n\ntr\ni , y\n\ntr\ni , y\n\ntest\ni\n\ntest\ni\n\n, y\n\n, y\n\nOne convenient solution is to allow q\u03c8 to reuse the learned mean of the prior \u00b5\u03b8. We observe that\nadapting the parameters with gradient descent is a good way to update them to a given training set\n, a design decision similar to one made by Fortunato et al. [11]. We\nx\npropose an inference network of the form\n\ntr\ni and test set x\n\ntr\ni , y\n\ntest\ni\n\ntest\ni\n\n, y\n\nq\u03c8(\u03b8|x\n\ntr\ni , y\n\ntr\ni , x\n\ntest\ni\n\n, y\n\ntest\ni ) = N (\u00b5\u03b8 + \u03b3q\u2207\u00b5\u03b8 log p(y\n\ntr\ni |x\n\ntr\ni , \u00b5\u03b8) + \u03b3q\u2207\u00b5\u03b8 log p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, \u00b5\u03b8); vq),\n\nwhere vq is a learned (diagonal) covariance, and the mean has an additional parameter beyond \u00b5\u03b8,\nwhich is a \u201clearning rate\u201d vector \u03b3q that is pointwise multiplied with the gradient. While this choice\nmay at \ufb01rst seem arbitrary, there is a simple intuition: the inference network should produce a sample\ntest\nof \u03b8 that is close to the posterior p(\u03b8|x\ni ). A reasonable way to arrive at a value of \u03b8\nclose to this posterior is to adapt it to both the training set and test set.2 Note that this is only done\ntest\nduring meta-training. It remains to choose q\u03c8(\u03c6i|\u03b8, x\ni ), which can also be formulated\nas a conditional Gaussian with mean given by applying gradient descent.\n\ntr\ni , x\n\ntr\ni , x\n\ntr\ni , y\n\ntr\ni , y\n\ntest\ni\n\ntest\ni\n\n, y\n\n, y\n\nAlthough this variational distribution is substantially more compact in terms of parameters than a\nseparate neural network, it only provides estimates of the posterior during meta-training. At meta-test\ntime, we must obtain the posterior p(\u03c6i|x\n. We can train a separate\nset of inference networks to perform this operation, potentially also using gradient descent within\nthe inference network. However, these networks do not receive any gradient information during\n\ntest\ni ), without access to y\n\ntr\ni , x\n\ntr\ni , y\n\ntest\ni\n\n2In practice, we can use multiple gradient steps for the mean, but we omit this for notational simplicity.\n\n4\n\n\f2\n\n\u03b8 , vq, \u03b3p, \u03b3q}\n\nSample batch of tasks Ti \u223c p(T )\nfor all Ti do\n\nAlgorithm 1 Meta-training, differences from MAML in red\nRequire: p(T ): distribution over tasks\n1: initialize \u0398 := {\u00b5\u03b8, \u03c3\n2: while not done do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nDtr, Dtest = Ti\nEvaluate \u2207\u00b5\u03b8 L(\u00b5\u03b8, Dtest)\nSample \u03b8 \u223c q = N (\u00b5\u03b8 \u2212 \u03b3q\u2207\u00b5\u03b8 L(\u00b5\u03b8, Dtest), vq)\nEvaluate \u2207\u03b8L(\u03b8, Dtr)\nCompute adapted parameters with gradient descent:\n\u03c6i = \u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8, Dtr)\n\n10:\n11:\n\n12:\n\nLet p(\u03b8|Dtr) = N (\u00b5\u03b8 \u2212 \u03b3p\u2207\u00b5\u03b8 L(\u00b5\u03b8, Dtr), \u03c3\nCompute \u2207\u0398(cid:0) PTi\n\nL(\u03c6i, Dtest)\n+DKL(q(\u03b8|Dtest) || p(\u03b8|Dtr))(cid:1)\n\n2\n\n\u03b8 ))\n\nUpdate \u0398 using Adam\n\nAlgorithm 2 Meta-testing\nRequire: training data Dtr\nRequire: learned \u0398\n1: Sample \u03b8 from the prior p(\u03b8|Dtr)\n2: Evaluate \u2207\u03b8L(\u03b8, Dtr)\n3: Compute adapted parameters with gra-\n\nT for new task T\n\ndient descent:\n\u03c6i = \u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8, Dtr)\n\nmeta-training, and may not work well in practice. In the next section we propose an even simpler and\nmore practical approach that uses only a single inference network during meta-training, and none\nduring meta-testing.\n\n4.2 Probabilistic Model-Agnostic Meta-Learning Approach with Hybrid Inference\n\nTo formulate a simpler variational meta-learning procedure, we recall the probabilistic interpretation\nof MAML: as discussed by Grant et al. [15], MAML can be interpreted as approximate inference for\nthe posterior p(y\n\ntest\ni ) according to\n\ntr\ni , x\n\ntr\ni , y\n\ntest\ni\n\n|x\n\np(y\n\ntest\ni\n\n|x\n\ntr\ni , y\n\ntr\ni , x\n\ntest\n\ni ) = Z p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, \u03c6i)p(\u03c6i|x\n\ntr\ni , y\n\ntr\ni , \u03b8)d\u03c6i \u2248 p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, \u03c6\u22c6\n\ni ),\n\n(1)\n\ntr\ni , y\n\nwhere we use the maximum a posteriori (MAP) value \u03c6\u22c6\ni . It can be shown that, for likelihoods\ntr\nthat are Gaussian in \u03c6i, gradient descent for a \ufb01xed number of iterations using x\ni corresponds\nexactly to maximum a posteriori inference under a Gaussian prior p(\u03c6i|\u03b8) [34]. In the case of\nnon-Gaussian likelihoods, the equivalence is only locally approximate, and the exact form of the prior\np(\u03c6i|\u03b8) is intractable. However, in practice this implicit prior can actually be preferable to an explicit\n(and simple) Gaussian prior, since it incorporates the rich nonlinear structure of the neural network\nparameter manifold, and produces good performance in practice [9, 15]. We can interpret this MAP\napproximation as inferring an approximate posterior on \u03c6i of the form p(\u03c6i|x\ni ),\nwhere \u03c6\u22c6\ntr\ni starting from \u03b8. Incorporating\nthis approximate inference procedure transforms the graphical model in Figure 1 (a) into the one in\ntr\nFigure 1 (b), where there is now a factor over p(\u03c6i|x\ni , \u03b8). While this is a crude approximation\nto the likelihood, it provides us with an empirically effective and simple tool that greatly simpli\ufb01es\nthe variational inference procedure described in the previous section, in the case where we aim\nto model a distribution over the global parameters p(\u03b8). After using gradient descent to estimate\ntr\np(\u03c6i | x\ni , \u03b8), the graphical model is transformed into the model shown in the center of Figure 1.\ntr and are\nNote that, in this new graphical model, the global parameters \u03b8 are independent of x\ntest is not observed. Thus, we can now write down a variational lower\nindependent of x\nbound for the logarithm of the approximate likelihood, which is given by\n\ni is obtained via gradient descent on the training set x\n\ni , \u03b8) \u2248 \u03b4(\u03c6i = \u03c6\u22c6\n\ntest when y\n\ntr and y\n\ntr\ni , y\n\ntr\ni , y\n\ntr\ni , y\n\ntr\ni , y\n\ntr\n\nlog p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, x\n\ntr\ni , y\n\ntr\n\ni ) \u2265 E\u03b8\u223cq\u03c8 (cid:2)log p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, \u03c6\u22c6\n\ni ) + log p(\u03b8)(cid:3) + H(q\u03c8(\u03b8|x\n\ntest\ni\n\n, y\n\ntest\ni )).\n\nIn this bound, we essentially perform approximate inference via MAP on \u03c6i to obtain p(\u03c6i|x\nand use the variational distribution for \u03b8 only. Note that q\u03c8(\u03b8|x\ntraining set x\nthe previous section, the inference network is given by\n\ntr\ni , \u03b8),\ntest\ni ) is not conditioned on the\ntr\ni since \u03b8 is independent of it in the transformed graphical model. Analogously to\n\ntr\ni , y\n\ntr\ni , y\n\ntest\ni\n\n, y\n\nq\u03c8(\u03b8|x\n\ntest\ni\n\n, y\n\ntest\ni ) = N (\u00b5\u03b8 + \u03b3q\u2207 log p(y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, \u00b5\u03b8); vq).\n\nTo evaluate the variational lower bound during training, we can use the following procedure:\n\ufb01rst, we evaluate the mean by starting from \u00b5\u03b8 and taking one (or more) gradient steps on\nlog p(y\n, \u03b8current), where \u03b8current starts at \u00b5\u03b8. We then add noise with variance vq, which\n\n|x\n\ntest\ni\n\ntest\ni\n\n5\n\n\fis made differentiable via the reparameterization trick [22]. We then take additional gradient steps\ntr\non the training likelihood log p(y\ni , \u03b8current). This accounts for the MAP inference procedure\n2\non \u03c6i. Training of \u00b5\u03b8, \u03c3\n\u03b8 , and vq is performed by backpropagating gradients through this entire\nprocedure with respect to the variational lower bound, which includes a term for the likelihood\ni ) and the KL-divergence between the sample \u03b8 \u223c q\u03c8 and the prior p(\u03b8).\nlog p(y\nThis meta-training procedure is detailed in Algorithm 1.\n\ntr, \u03c6\u22c6\n\ntr\ni |x\n\ntr, y\n\ntest\ni\n\n|x\n\ntest\ni\n\n, x\n\nAt meta-test time, the inference procedure is much simpler. The test labels are not available, so we\nsimply sample \u03b8 \u223c p(\u03b8) and perform MAP inference on \u03c6i using the training set, which corresponds\ntr\nto gradient steps on log p(y\ni , \u03b8current), where \u03b8current starts at the sampled \u03b8. This meta-testing\nprocedure is detailed in Algorithm 2.\n\ntr\ni |x\n\n4.3 Adding Additional Dependencies\n\ntr\ni and the prior \u03b8 are conditionally inde-\nIn the transformed graphical model, the training data x\ntr\npendent. However, since we have only a crude approximation to p(\u03c6i | x\ni , \u03b8), this independence\noften doesn\u2019t actually hold. We can allow the model to compensate for this approximation by ad-\nditionally conditioning the learned prior p(\u03b8) on the training data. In this case, the learned \u201cprior\u201d\n2\nhas the form p(\u03b8i|x\n\u03b8 . We\nthus obtain the modi\ufb01ed graphical model in Figure 1 (c). Similarly to the inference network q\u03c8, we\nparameterize the learned prior as follows:\n\ntr\ni ), where \u03b8i is now task-speci\ufb01c, but with global parameters \u00b5\u03b8 and \u03c3\n\ntr\ni , y\n\ntr\ni , y\n\ntr\ni , y\n\np(\u03b8i|x\n\ntr\ni , y\n\ntr\ni ) = N (\u00b5\u03b8 + \u03b3p\u2207 log p(y\n\ntr\ni |x\n\ntr\ni , \u00b5\u03b8); \u03c3\n\n2\n\u03b8 ).\n\ntr\ni , y\ntr\ni , y\n\nWith this new form for distribution over \u03b8, the variational training objective uses the likelihood term\ntr\nlog p(\u03b8i|x\ni ) in place of log p(\u03b8), but otherwise is left unchanged. At test time, we sample from\ntr\ntr\n\u03b8 \u223c p(\u03b8|x\ni ) by \ufb01rst taking gradient steps on log p(y\ni , \u03b8current), where \u03b8current is initialized at\n2\n\u03b8 . Then, we proceed as before, performing MAP inference\n\u00b5\u03b8, and then adding noise with variance \u03c3\ntr\non \u03c6i by taking additional gradient steps on log p(y\ni , \u03b8current) initialized at the sample \u03b8. In our\nexperiments, we \ufb01nd that this more expressive distribution often leads to better performance.\n\ntr\ni |x\n\ntr\ni |x\n\n5 Experiments\nThe goal of our experimental evaluation is to answer the following questions: (1) can our approach\nenable sampling from the distribution over potential functions underlying the training data?, (2)\ndoes our approach improve upon the MAML algorithm when there is ambiguity over the class\nof functions?, and (3) can our approach scale to deep convolutional networks? We study two\nillustrative toy examples and a realistic ambiguous few-shot image classi\ufb01cation problem. For the\nboth experimental domains, we compare MAML to our probabilistic approach. We will refer to\nour version of MAML as a PLATIPUS (Probabilistic LATent model for Incorporating Priors and\nUncertainty in few-Shot learning), due to its unusual combination of two approximate inference\nmethods: amortized inference and MAP. Both PLATIPUS and MAML use the same neural network\narchitecture and the same number of inner gradient steps. We additionally provide a comparison on\nthe MiniImagenet benchmark and specify the hyperparameters in the supplementary appendix.\n\nIllustrative 5-shot regression.\nIn this 1D regression problem, different tasks correspond to different\nunderlying functions. Half of the functions are sinusoids, and half are lines, such that the task\ndistribution is clearly multimodal. The sinusoids have amplitude and phase uniformly sampled from\nthe range [0.1, 5] and [0, \u03c0], and the lines have the slope and intercept sampled in the range [\u22123, 3].\nThe input domain is uniform on [\u22125, 5], and Gaussian noise with a standard deviation of 0.3 is added\nto the labels. We trained both MAML and PLATIPUS for 5-shot regression. In Figure 2, we show\nthe qualitative performance of both methods, where the ground truth underlying function is shown\nin gray and the datapoints in Dtr are shown as purple triangles. We show the function f\u03c6i learned\nby MAML in black. For PLATIPUS, we sample 10 sets of parameters from p(\u03c6i|\u03b8) and plot the\nresulting functions in different colors. In the top row, we can see that PLATIPUS allows the model\nto effectively reason over the set of functions underlying the provided datapoints, with increased\nvariance in parts of the function where there is more uncertainty. Further, we see that PLATIPUS is\nable to capture the multimodal structure, as the curves are all linear or sinusoidal.\n\nA particularly useful application of uncertainty estimates in few-shot learning is estimating when\nmore data would be helpful. In particular, seeing a large variance in a particular part of the input\nspace suggests that more data would be helpful for learning the function in that part of the input space.\nOn the bottom of Figure 2, we show the results for a single task at meta-test time with increasing\nnumbers of training datapoints. Even though the model was only trained on training set sizes of 5\n\n6\n\n\fFigure 2: Samples from PLATIPUS trained for 5-shot regression, shown as colored dotted lines. The tasks\nconsist of regressing to sinusoid and linear functions, shown in gray. MAML, shown in black, is a deterministic\nprocedure and hence learns a single function, rather than reasoning about the distribution over potential functions.\nAs seen on the bottom row, even though PLATIPUS is trained for 5-shot regression, it can effectively reason\nover its uncertainty when provided variable numbers of datapoints at test time (left vs. right).\n\nFigure 3: Qualitative examples from active learning experiment where the 5 provided datapoints are from\na small region of the input space (shown as purple triangles), and the model actively asks for labels for new\ndatapoints (shown as blue circles) by choosing datapoints with the largest variance across samples. The model is\nable to effectively choose points that leads to accurate predictions with only a few extra datapoints.\n\ndatapoints, we observe that PLATIPUS is able to effectively reduce its uncertainty as more and more\ndatapoints are available. This suggests that the uncertainty provided by PLATIPUS can be used for\napproximately gauging when more data would be helpful for learning a new task.\n\nActive learning with regression. To further evalu-\nate the bene\ufb01t of modeling ambiguity, we now con-\nsider an active learning experiment. In particular, the\nmodel can choose the datapoints that it wants labels\nfor, with the goal of reaching good performance with\na minimal number of additional datapoints. We per-\nformed this evaluation in the simple regression setting\ndescribed previously. Models were given \ufb01ve initial\ndatapoints within a constrained region of the input\nspace. Then, each model selects up to 5 additional\ndatapoints to be labeled. PLATIPUS chose each dat-\napoint sequentially, choosing the point with maximal\nFigure 4: Active learning performance on regres-\nvariance across the sampled regressors; MAML se-\nsion after up to 5 selected datapoints. PLATIPUS\nlected datapoints randomly, as it has no mechanism to\ncan use it\u2019s uncertainty estimation to quickly de-\nmodel ambiguity. As seen in Figure 4, PLATIPUS is\ncrease the error, while selecting datapoints ran-\ndomly and using MAML leads to slower learning.\nable to reduce its regression error to a much greater\nextent when given one to three additional queries, compared to MAML. We show qualitative results\nin Figure 3.\n\n7\n\n\fFigure 5: Samples from PLATIPUS for 1-shot classi\ufb01cation, shown as colored dotted lines. The 2D classi\ufb01cation\ntasks all involve circular decision boundaries of varying size and center, shown in gray. MAML, shown in black,\nis a deterministic procedure and hence learns a single function, rather than reasoning about the distribution over\npotential functions.\n\nIllustrative 1-Shot 2D classi\ufb01cation. Next, we study a simple binary classi\ufb01cation task, where\nthere is a particularly large amount of ambiguity surrounding the underlying function: learning to\nlearn from a single positive example. Here, the tasks consist of classifying datapoints in 2D within\nthe range [0, 5] with a circular decision boundary, where points inside the decision boundary are\npositive and points outside are negative. Different tasks correspond to different locations and radii\nof the decision boundary, sampled at uniformly at random from the ranges [1.0, 4.0] and [0.1, 2.0]\nrespectively. Following Grant et al. [14], we train both MAML and PLATIPUS with Dtr consisting\nof a single positive example and Dtest consisting of both positive and negative examples. We plot the\nresults using the same scheme as before, except that we plot the decision boundary (rather than the\nregression function) and visualize the single positive datapoint with a green plus. As seen in Figure 5,\nwe see that PLATIPUS captures a broad distribution over possible decision boundaries, all of which\nare roughly circular. MAML provides a single decision boundary of average size.\n\nAmbiguous image classi\ufb01cation. The ambiguity illustrated in the previous settings is common in\nreal world tasks where images can share multiple attributes. We study an ambiguous extension to the\ncelebA attribute classi\ufb01cation task. Our meta-training dataset is formed by sampling two attributes\nat random to form a positive class and taking the same number of random examples without either\nattribute to from the negative classes. To evaluate the ability to capture multiple decision boundaries\nwhile simultaneously obtaining good performance, we evaluate our method as follows: We sample\nfrom a test set of three attributes and a corresponding set of images with those attributes. Since the\ntasks involve classifying images that have two attributes, this task is ambiguous, and there are three\npossible combinations of two attributes that explain the training set. We sample models from our\nprior as described in Section 4 and assign each of the sampled models to one of the three possible\ntasks based on its log-likelihood. If each of the three possible tasks is assigned a nonzero number\nof samples, this means that the model effectively covers all three possible modes that explain the\nambiguous training set. We can measure coverage and accuracy from this protocol. The coverage\nscore indicates the average number of tasks (between 1 and 3) that receive at least one sample for\neach ambiguous training set, and the accuracy score is the average number of correct classi\ufb01cations\non these tasks (according to the sampled models assigned to them). A highly random method will\nachieve good coverage but poor accuracy, while a deterministic method will have a coverage of 1. We\nadditionally compute the log-likelihood across the ambiguous tasks which compares each method\u2019s\nability to model all of the \u201cmodes\u201d. As is standard in amortized variational inference (e.g., with\nVAEs), we put a multiplier \u03b2 in front of the KL-divergence against the prior [17] in Algorithm 1. We\n\ufb01nd that larger values result in more diverse samples, at a modest cost in performance, and therefore\nreport two different values of \u03b2 to illustrate this tradeoff.\n\nOur results are summarized in Table 5 and Fig. 6. Our method attains better log-likelihood, and a\ncomparable accuracy compared to standard MAML. More importantly, deterministic MAML only\never captures one mode for each ambiguous task, where the maximum is three. Our method on\naverage captures closer to two modes on average. The qualitative analysis in Figure 6 illustrates3 an\nexample ambiguous training set, example images for the three possible two-attribute pairs that can\ncorrespond to this training set, and the classi\ufb01cations made by different sampled classi\ufb01ers trained on\nthe ambiguous training set. Note that the different samples each pay attention to different attributes,\nindicating that PLATIPUS is effective at capturing the different modes of the task.\n\n6 Discussion and Future Work\nWe introduced an algorithm for few-shot meta-learning that enables simple and effective sampling of\nmodels for new tasks at meta-test time. Our algorithm, PLATIPUS, adapts to new tasks by running\n\n3Additional qualitative results and code can be found at https://sites.google.com/view/probabilistic-maml/\n\n8\n\n\fFigure 6: Sampled classi\ufb01ers for an ambiguous meta-test task. In the meta-test training set (a), PLATIPUS\nobserves \ufb01ve positives that share three attributes, and \ufb01ve negatives. A classi\ufb01er that uses any two attributes\ncan correctly classify the training set. On the right (b), we show the three possible two-attribute tasks that the\ntraining set can correspond to, and illustrate the labels (positive indicated by purple border) predicted by the\nbest sampled classi\ufb01er for that task. We see that different samples can effectively capture the three possible\nexplanations, with some samples paying attention to hats (2nd and 3rd column) and others not (1st column).\n\nAmbiguous celebA (5-shot)\n\nMAML\nMAML + noise\nPLATIPUS (ours) (KL weight = 0.05)\nPLATIPUS (ours) (KL weight = 0.15)\n\nAccuracy\n\n89.00 \u00b1 1.78%\n84.3 \u00b1 1.60 %\n88.34 \u00b1 1.06 %\n87.8 \u00b1 1.03 %\n\nCoverage (max=3) Average NLL\n0.73 \u00b1 0.06\n0.68 \u00b1 0.05\n0.67\u00b1 0.05\n0.56 \u00b1 0.04\n\n1.00 \u00b1 0.0\n1.89 \u00b1 0.04\n1.59 \u00b1 0.03\n1.94 \u00b1 0.04\n\nTable 1: Our method covers almost twice as many tasks compared to MAML, with comparable\naccuracy. MAML + noise is a method that adds noise to the gradient, but does not perform variational\ninference. This improves coverage, but results in lower accuracy average log likelihood. We bold\nresults above the highest con\ufb01dence interval lowerbound.\n\ngradient descent with injected noise. During meta-training, the model parameters are optimized with\nrespect to a variational lower bound on the likelihood for the meta-training tasks, so as to enable\nthis simple adaptation procedure to produce approximate samples from the model posterior when\nconditioned on a few-shot training set. This approach has a number of bene\ufb01ts. The adaptation\nprocedure is exceedingly simple, and the method can be applied to any standard model architecture.\nThe algorithm introduces a modest number of additional parameters: besides the initial model weights,\nwe must learn a variance on each parameter for the inference network and prior, and the number of\nparameters scales only linearly with the number of model weights. Our experimental results show that\nour method can be used to effectively sample diverse solutions to both regression and classi\ufb01cation\ntasks at meta-test time, including with task families that have multi-modal task distributions. We\nadditionally showed how our approach can be applied in settings where uncertainty can directly guide\ndata acquisition, leading to better few-shot active learning.\n\nAlthough our approach is simple and broadly applicable, it has potential limitations that could be\naddressed in future work. First, the current form of the method provides a relatively impoverished\nestimator of posterior variance, which might be less effective at gauging uncertainty in settings where\ndifferent tasks have different degrees of ambiguity. In such settings, making the variance estimator\ndependent on the few-shot training set might produce better results, and investigating how to do this\nin a parameter ef\ufb01cient manner would be an interesting direction for future work. Another exciting\ndirection for future research would be to study how our approach could be applied in RL settings for\nacquiring structured, uncertainty-guided exploration strategies in meta-RL problems.\n\nAcknowledgments\n\nWe thank Marvin Zhang and Dibya Ghosh for feedback on an early draft of this paper. This research\nwas supported by an NSF Graduate Research Fellowship, NSF IIS-1651843, the Of\ufb01ce of Naval\nResearch, and NVIDIA.\n\n9\n\nexampleexample+'ve-'veexampleexample+'ve-'veexampleexample+'ve-'veexampleexample+'ve-'ve(a)(b)Mouth OpenYoungWearing HatMouth OpenYoungWearing HatMouth OpenYoungWearing HatMouth OpenYoungWearing Hat\fReferences\n\n[1] D. Barber and C. M. Bishop. Ensemble learning for multi-layer networks. In neural information\n\nprocessing systems (NIPS), 1998.\n\n[2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural\n\nnetworks. arXiv preprint arXiv:1505.05424, 2015.\n\n[3] D. A. Braun, C. Mehring, and D. M. Wolpert. Structure learning in action. Behavioural brain\n\nresearch, 2010.\n\n[4] H. Daum\u00e9 III. Bayesian multitask learning with latent hierarchies. In Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), 2009.\n\n[5] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl \u03982: Fast reinforce-\n\nment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\n[6] H. Edwards and A. Storkey. Towards a neural statistician. In International Conference on\n\nLearning Representations (ICLR), 2017.\n\n[7] L. Fei-Fei et al. A Bayesian approach to unsupervised one-shot learning of object categories. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2003.\n\n[8] C. Finn and S. Levine. Meta-learning and universality: Deep representations and gradient\n\ndescent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.\n\n[9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In International Conference on Machine Learning (ICML), 2017.\n\n[10] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via\n\nmeta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[11] M. Fortunato, C. Blundell, and O. Vinyals. Bayesian recurrent neural networks. arXiv preprint\n\narXiv:1704.02798, 2017.\n\n[12] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure\nmapping. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).\nACM, 2008.\n\n[13] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint\n\narXiv:1711.04043, 2017.\n\n[14] E. Grant, C. Finn, J. Peterson, J. Abbott, S. Levine, T. Darrell, and T. Grif\ufb01ths. Concept\nIn NIPS Workshop on Cognitively Informed Arti\ufb01cial\n\nacquisition through meta-learning.\nIntelligence, 2017.\n\n[15] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Grif\ufb01ths. Recasting gradient-based meta-learning\nas hierarchical bayes. In International Conference on Learning Representations (ICLR), 2018.\n\n[16] A. Graves. Practical variational inference for neural networks. In Neural Information Processing\n\nSystems (NIPS), 2011.\n\n[17] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner.\nEarly visual concept learning with unsupervised deep learning. International Conference on\nLearning Representations (ICLR), 2017.\n\n[18] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the\n\ndescription length of the weights. In Conference on Computational learning theory, 1993.\n\n[19] S. Hochreiter, A. Younger, and P. Conwell. Learning to learn using gradient descent. Interna-\n\ntional Conference on Arti\ufb01cial Neural Networks (ICANN), 2001.\n\n[20] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 2013.\n\n10\n\n\f[21] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical\nIn Neural\n\nmodels with neural networks for structured representations and fast inference.\nInformation Processing Systems (NIPS), 2016.\n\n[22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[23] A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshki, W. Chung, and D. Krueger. Deep prior.\n\narXiv preprint arXiv:1712.05016, 2017.\n\n[24] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 2015.\n\n[25] N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In\n\nInternational Conference on Machine Learning (ICML), page 65, 2004.\n\n[26] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning.\n\narXiv preprint arXiv:1707.09835, 2017.\n\n[27] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural compu-\n\ntation, 1992.\n\n[28] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner.\n\nInternational Conference on Learning Representations, 2018.\n\n[29] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n\n[30] A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint\n\narXiv:1803.02999, 2018.\n\n[31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[32] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[33] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nIn International Conference on Machine Learning\n\nmemory-augmented neural networks.\n(ICML), 2016.\n\n[34] R. J. Santos. Equivalence of regularization and truncated iteration for general ill-posed problems.\n\nLinear Algebra and its Applications, 1996.\n\n[35] J. Schmidhuber. Evolutionary principles in self-referential learning. PhD thesis, Institut f\u00fcr\n\nInformatik, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[36] R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Ermon. Amortized inference regulariza-\n\ntion. arXiv preprint arXiv:1805.08913, 2018.\n\n[37] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Neural\n\nInformation Processing Systems (NIPS), 2017.\n\n[38] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to\ncompare: Relation network for few-shot learning. CoRR, abs/1711.06025, 2017. URL http:\n//arxiv.org/abs/1711.06025.\n\n[39] J. B. Tenenbaum. A Bayesian framework for concept learning. PhD thesis, Massachusetts\n\nInstitute of Technology, 1999.\n\n[40] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning.\n\nIn Neural Information Processing Systems (NIPS), 2016.\n\n[41] J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin,\nand L. Shen. Sparse Bayesian multi-task learning for predicting cognitive outcomes from\nneuroimaging measures in Alzheimer\u2019s disease. In Conference on Computer Vision and Pattern\nRecognition (CVPR), 2012.\n\n11\n\n\f[42] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample\n\nlearning. In European Conference on Computer Vision (ECCV), 2016.\n\n[43] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In\n\nInternational Conference on Machine Learning (ICML), 2005.\n\n12\n\n\f", "award": [], "sourceid": 5781, "authors": [{"given_name": "Chelsea", "family_name": "Finn", "institution": "UC Berkeley"}, {"given_name": "Kelvin", "family_name": "Xu", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}