{"title": "Reconciling meta-learning and continual learning with online mixtures of tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 9122, "page_last": 9133, "abstract": "Learning-to-learn or meta-learning leverages data-driven inductive bias to increase the efficiency of learning on a novel task. This approach encounters difficulty when transfer is not advantageous, for instance, when tasks are considerably dissimilar or change over time. We use the connection between gradient-based meta-learning and hierarchical Bayes to propose a Dirichlet process mixture of hierarchical Bayesian models over the parameters of an arbitrary parametric model such as a neural network. In contrast to consolidating inductive biases into a single set of hyperparameters, our approach of task-dependent hyperparameter selection better handles latent distribution shift, as demonstrated on a set of evolving, image-based, few-shot learning benchmarks.", "full_text": "Reconciling meta-learning and continual learning\n\nwith online mixtures of tasks\n\nGhassen Jerfel \u21e4\ngj47@duke.edu\nDuke University\n\nThomas L. Grif\ufb01ths\ntomg@princeton.edu\nPrinceton University\n\nErin Grant \u21e4\n\neringrant@berkeley.edu\n\nUC Berkeley\n\nKatherine Heller\n\nkheller@stat.duke.edu\n\nDuke University\n\nAbstract\n\nLearning-to-learn or meta-learning leverages data-driven inductive bias to increase\nthe ef\ufb01ciency of learning on a novel task. This approach encounters dif\ufb01culty when\ntransfer is not advantageous, for instance, when tasks are considerably dissimilar\nor change over time. We use the connection between gradient-based meta-learning\nand hierarchical Bayes to propose a Dirichlet process mixture of hierarchical\nBayesian models over the parameters of an arbitrary parametric model such as a\nneural network. In contrast to consolidating inductive biases into a single set of\nhyperparameters, our approach of task-dependent hyperparameter selection better\nhandles latent distribution shift, as demonstrated on a set of evolving, image-based,\nfew-shot learning benchmarks.\n\n1\n\nIntroduction\n\nMeta-learning algorithms aim to increase the ef\ufb01ciency of learning by treating task-speci\ufb01c learning\nepisodes as examples from which to generalize [47]. The central assumption of a meta-learning\nalgorithm is that some tasks are inherently related and so inductive transfer can improve sample\nef\ufb01ciency and generalization [8, 9, 5]. In learning a single set of domain-general hyperparameters\nthat parameterize a metric space [53] or an optimizer [40, 14], recent meta-learning algorithms\nmake the assumption that tasks are equally related, and therefore non-adaptive, mutual transfer is\nappropriate. This assumption has been cemented in recent few-shot learning benchmarks, which\ncomprise a set of tasks generated in a uniform manner [e.g., 53, 14].\nHowever, the real world often presents scenarios in which an agent must decide what degree of\ntransfer is appropriate. In some cases, a subset of tasks are more strongly related to each other, and\nso non-uniform transfer provides a strategic advantage. On the other hand, transfer in the presence\nof dissimilar or outlier tasks worsens generalization performance [44, 12]. Moreover, when the\nunderlying task distribution is non-stationary, inductive transfer to previously observed tasks should\nexhibit graceful degradation to address the catastrophic forgetting problem [28]. In these settings, the\nconsolidation of all inductive biases into a single set of hyperparameters is not well-posed to deal\nwith changing or diverse tasks. In contrast, in order to account for this degree of task heterogeneity,\nhumans detect and adapt to novel contexts by attending to relationships between tasks [10].\nIn this work, we learn a mixture of hierarchical models that allows a meta-learner to adaptively\nselect over a set of learned parameter initializations for gradient-based adaptation to a new task. The\nmethod is equivalent to clustering task-speci\ufb01c parameters in the hierarchical model induced by\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frecasting gradient-based meta-learning as hierarchical Bayes [21] and generalizes the model-agnostic\nmeta-learning (MAML) algorithm introduced in [14]. By treating the assignment of task-speci\ufb01c\nparameters to clusters as latent variables, we can directly detect similarities between tasks on the\nbasis of the task-speci\ufb01c likelihood, which may be parameterized by an expressive model such as a\nneural network. Our approach, therefore, alleviates the need for explicit geometric or probabilistic\nmodeling assumptions about the weights of a complex parametric model and provides a scalable\nmethod to regulate information transfer between episodes.\nWe additionally consider the setting of a non-stationary or evolving task distribution, which neces-\nsitates a meta-learning method that possesses adaptive complexity. We translate stochastic point\nestimation in an in\ufb01nite mixture [39] over model parameters into a gradient-based meta-learning\nalgorithm that is compatible with any differentiable likelihood model and requires no distributional\nassumptions. We demonstrate the unexplored ability of non-parametric priors over neural network\nparameters to automatically detect and adapt to task distribution shift in a naturalistic image dataset;\naddressing the non-trivial setting of task-agnostic continual learning in which the task change is\nunobserved [c.f., task-aware settings such as 28].\n\n2 Gradient-based meta-learning as hierarchical Bayes\n\nSince our approach is grounded in the probabilistic formulation of meta-learning as hierarchical\nBayes [4], our approach can be applied to any probabilistic meta-learner. In this work, we focus\non model-agnostic meta-learning (MAML) [14], a gradient-based meta-learning approach that\nestimates global parameters to be shared among task-speci\ufb01c models as an initialization for a few\nsteps of gradient descent. MAML admits a natural interpretation as parameter estimation in a\nhierarchical probabilistic model, where the learned initialization acts as data-driven regularization for\nthe estimation of task-speci\ufb01c parameters \u02c6j.\nIn particular, [21] cast MAML as posterior inference for task-speci\ufb01c parameters j given some\nsamples of task-speci\ufb01c data xj1:N and a prior over j that is induced by the early stopping of an iter-\native optimization procedure; truncation at K steps of gradient descent on the negative log-likelihood\n log p( xj1:N | j ) starting from j(0) = \u2713 can be then understood as mode estimation of the\nposterior p( j | xj1:N , \u2713 ) . The mode estimates \u02c6j = j(0) + \u21b5PK\nk=1 r log p( xj1:N | j(k1) )\nare then combined to evaluate the marginal likelihood for each task as\n\np xjN+1:N+M | \u2713 =Z pxjN+1:N+M | jpj | \u2713 dj \u21e1 pxjN+1:N+M | \u02c6j ,\n\n(1)\n\nwhere xjN+1:N+M is another set of samples from the jth task. A training dataset can then be summa-\nrized in an empirical Bayes point estimate of \u2713 computed by gradient-based optimization of the joint\nmarginal likelihood in (1) in across tasks, so that the likelihood of a datapoint sampled from a new\ntask can be computed using only \u2713 and without storing the task-speci\ufb01c parameters.\n\n3\n\nImproving meta-learning by modeling latent task structure\n\nIf the task distribution is heterogeneous, assuming a single parameter initialization \u2713 for gradient-\nbased meta-learning is not suitable because it is unlikely that the point estimate computed by a few\nsteps of gradient descent will suf\ufb01ciently adapt the task-speci\ufb01c parameters to a diversity of tasks.\nMoreover, explicitly estimating relatedness between tasks has the potential to aid the ef\ufb01cacy of a\nmeta-learning algorithm by modulating both positive and negative transfer [52, 60, 45, 61], and by\nidentifying outlier tasks that require a more signi\ufb01cant degree of adaptation [56, 23]. Nonetheless,\nde\ufb01ning an appropriate notion of task relatedness is a dif\ufb01cult problem in the high-dimensional\nparameter or activation space of models such as neural networks.\nUsing the probabilistic interpretation of Section 2, we deal with the variability in the tasks by\nassuming that each set of task-speci\ufb01c parameters j is drawn from a mixture of base distributions,\neach of which is parameterized by a hyperparameter \u2713(`). Accordingly, we capture task relatedness\nby estimating the likelihood of assigning each task to a mixture component based simply on the\ntask-speci\ufb01c negative log-likelihood after a few steps of gradient-based adaptation. The result is\na scalable meta-learning algorithm that jointly learns task-speci\ufb01c cluster assignments and model\nparameters, and is capable of modulating the transfer of information across tasks by clustering\ntogether related task-speci\ufb01c parameter settings.\n\n2\n\n\freturn {\u2713(1), . . .}\nE-STEP( {xji}N\nreturn \u2327 -softmax`(Pi log p( xji| \u02c6(`)\n\ni=1,{ \u02c6(`)\n\nj }L\n\n`=1)\n\nj\n\n))\n\ni=1, \u02c6(`)\n\nM-STEP({xji}M\nreturn r\u2713[Pj,i j log p( xji| \u02c6(`)\n\n, j)\n\nj\n\nj\n\n)]\n\nAlgorithm 1 Stochastic gradient-based EM for finite and infinite mixtures(\ndataset D, meta-learning rate , adaptation rate \u21b5, temperature \u2327, initial cluster count L0, meta-batch\nsize J, training batch size N, validation batch size M, adaptation iteration count K, global prior G0)\nInitialize cluster count L L0 and meta-level parameters \u2713(1), . . . , \u2713(L) \u21e0 G0\nwhile not converged do\n\nDraw tasks T1, . . . ,TJ \u21e0 pD(T )\nfor j in 1, . . . , J do\n\nInitialize \u02c6(`)\nCompute task-speci\ufb01c mode estimate, \u02c6(`)\n\nDraw task-speci\ufb01c datapoints, xj1 . . . xjN +M \u21e0 pTj (x)\nDraw a parameter initialization for a new cluster from the global prior, \u2713(L+1) \u21e0 G0\nfor ` in {1, . . . , L, L + 1} do\nj + \u21b5Pk r log p( xj1:N | \u02c6(`)\n, j}J\n\nCompute assignment of tasks to clusters, j E-STEP (xj1:N , \u02c6(1:L)\n\nUpdate each component ` in 1, . . . , L, \u2713(`) \u2713(`)+ M-STEP ({xjN+1:N+M , \u02c6(`)\nSummarize {\u27131, . . .} to update global prior G0\n\nj \u02c6(`)\n\nj \u2713(`)\n\n)\n\nj\n\nj\n\nj\n\nj=1)\n\n)\n\nTop: Algorithm 1: Stochastic gradient-based expectation maximization (EM) for probabilistic clustering of\ntask-speci\ufb01c parameters in a meta-learning setting. Bottom: Subroutine 2: The E-STEP and M-STEP for a \ufb01nite\nmixture of hierarchical Bayesian models implemented as gradient-based meta-learners.\nFormally, let zj be the categorical latent variable indicating the cluster assignment of each task-\nspeci\ufb01c parameter j. Direct maximization of the mixture model likelihood is a combinatorial\noptimization problem that can grow intractable. This intractability is equally problematic for the\nposterior distribution over the cluster assignment variables zj and the task-speci\ufb01c parameters j,\nwhich are both treated as latent variables in the probabilistic formulation of meta-learning. A scalable\napproximation involves representing the conditional distribution for each latent variable with a\nmaximum a posteriori (MAP) estimate. In our meta-learning setting of a mixture of hierarchical\nBayes (HB) models, this suggests an augmented expectation maximization (EM) procedure [13]\nalternating between an E-STEP that computes an expectation of the task-to-cluster assignments zj,\nwhich itself involves the computation of a conditional mode estimate for the task-speci\ufb01c parameters\nj, and an M-STEP that updates the hyperparameters \u2713(1:L) (see Subroutine 2).\nTo ensure scalability, we use the minibatch variant of stochastic optimization [43] in both the E-STEP\nand the M-STEP; such approaches to EM are motivated by a view of the algorithm as optimizing a\nsingle free energy at both the E-STEP and the M-STEP [37]. In particular, for each task j and cluster\n`, we follow the gradients to minimize the negative log-likelihood on the training data points xj1:N ,\nusing the cluster parameters \u2713(`) as initialization. This allows us to obtain a modal point estimate\nof the task-speci\ufb01c parameters \u02c6(`)\n. The E-STEP in Subroutine 2 leverages the connection between\nj\ngradient-based meta-learning and HB [21] and the differentiability of our clustering procedure to\nemploy the task-speci\ufb01c parameters to compute the posterior probability of cluster assignment.\nAccordingly, based on the likelihood of the same training data points under the model parameterized\nby \u02c6(`)\nj\n\n, we compute the cluster assignment probabilities as\n\n(`)\nj\n\n:= pzj = ` | xj1:N , \u2713(1:L) /Z p(xj1:N | j) p(j | \u2713(`)) dj \u21e1 p(xj1:N | \u02c6(`)\n\nj ) .\n\n(2)\n\nj\n\n.\n\nThe cluster means \u2713(`) are then updated by gradient descent on the validation loss in the M-STEP in\nSubroutine 2; this M-STEP is analogous to the MAML algorithm in [14] with the addition of mixing\nweights (`)\nNote that, unlike other recent approaches to probabilistic clustering [e.g., 3] we adhere to the episodic\nmeta-learning setup for both training and testing since only the task support set xj1:N is used to\ncompute both the point estimate \u02c6(`)\n. See Algorithm 1 for the\nj\nfull algorithm, whose high-level structure is shared with the non-parametric variant of our method\ndetailed in Section 5.\n\nand the cluster responsibilities (`)\n\nj\n\n3\n\n\fTable 1: Meta-test set accuracy on the miniImageNet 5-way, 1- and 5-shot classi\ufb01cation benchmarks from [53]\namong methods using a comparable architecture (the 4-layer convolutional network from [53]). For methods on\nwhich we report results in later experiments, we additionally report the total number of parameters optimized\nby the meta-learning algorithm.\na Results reported by [40]. b We report test accuracy for models matching train and test \u201cshot\u201d and\n\u201cway\u201d. c We report test accuracy for a comparable base (task-speci\ufb01c network) architecture.\n\nModel\nmatching network [53] a\nmeta-learner LSTM [40]\nprototypical networks [49] b\nMAML [14]\nMT-net [30]\nPLATIPUS [15]\nVERSA [20] c\nOur method: 2 components\n3 components\n4 components\n5 components\n\nNum. param.\n\n38, 907\n65, 546\n807, 938\n\n65, 546\n98, 319\n131, 092\n163, 865\n\n1-shot (%)\n43.56 \u00b1 0.84\n43.44 \u00b1 0.77\n46.61 \u00b1 0.78\n48.70 \u00b1 1.84\n51.70 \u00b1 1.84\n50.13 \u00b1 1.86\n48.53 \u00b1 1.84\n49.60 \u00b1 1.50\n51.20 \u00b1 1.52\n50.49 \u00b1 1.46\n51.46 \u00b1 1.68\n\n5-shot (%)\n55.31 \u00b1 0.73\n60.60 \u00b1 0.71\n65.77 \u00b1 0.70\n63.11 \u00b1 0.92\n\n64.60 \u00b1 0.92\n65.00 \u00b1 0.96\n64.78 \u00b1 1.43\n\n4 Experiment: miniImageNet few-shot classi\ufb01cation\n\nClustering task-speci\ufb01c parameters provides a way for a meta-learner to deal with task heterogeneity\nas each cluster can be associated with a subset of the tasks that would bene\ufb01t most from mutual\ntransfer. While we do not expect existing tasks to present a signi\ufb01cant degree of heterogeneity given\nthe uniform sampling assumptions behind their design, we nevertheless conduct an experiment to\nvalidate that our method gives an improvement on a standard benchmark for few-shot learning.\nWe apply Algorithm 3 with Subroutine 2 and L 2{ 2, 3, 4, 5} components to the 1-shot and 5-shot,\n5-way, miniImageNet few-shot classi\ufb01cation benchmarks [53]; Appendix C.2.1 contains additional\nexperimental details. We demonstrate in Table 1 that a mixture of meta-learners improves the\nperformance of gradient-based meta-learning on this task for any number of components. However,\nthe performance of the parametric mixture does not improve monotonically with the number of\ncomponents L. This leads us to the development of non-parametric clustering for continual meta-\nlearning, where enforcing specialization to subgroups of tasks and increasing model complexity is, in\nfact, necessary to preserve performance on prior tasks due to signi\ufb01cant heterogeneity.\n\n5 Scalable online mixtures for task-agnostic continual learning\n\nThe mixture of meta-learners developed in Section 3 addresses a drawback of meta-learning ap-\nproaches such as MAML that consolidate task-general information into a single set of hyperparame-\nters. However, the method adds another dimension to model selection in the form of identifying the\ncorrect number of mixture components. While this may be resolved by cross-validation if the dataset\nis static and therefore the number of components can remain \ufb01xed, adhering to a \ufb01xed number of\ncomponents throughout training is not appropriate in the non-stationary regime, where the underlying\ntask distribution changes as different types of tasks are presented sequentially in a continual learning\nsetting. In this regime, it is important to incrementally introduce more components that can each\nspecialize to the distribution of tasks observed at the time of spawning.\nTo address this, we derive a scalable stochastic estimation procedure to compute the expectation\nof task-to-cluster assignments (E-STEP) for a growing number of task clusters in a non-parametric\nmixture model [39] called the Dirichlet process mixture model (DPMM). The formulation of the\nDirichlet process mixture model (DPMM) that is most appropriate for incremental learning is the\nsequential draws formulation that corresponds to an instantiation of the Chinese restaurant process\n(CRP) [39]. A CRP prior over zj allows some probability to be assigned to a new mixture component\nwhile the task identities are inferred in a sequential manner, and has therefore been key to recent\nonline and stochastic learning of the DPMM [31]. A draw from a CRP proceeds as follows: For a\nsequence of tasks, the \ufb01rst task is assigned to the \ufb01rst cluster and the jth subsequent task is then\nassigned to the `th cluster with probability\n\np zj = ` | z1:j1,\u21e3 =( n(`)/n + \u21e3\n\n\u21e3/n + \u21e3\n\nfor ` \uf8ff L\nfor ` = L + 1 ,\n\n(3)\n\n4\n\n\fj\n\nE-STEP( xj1:N , \u02c6(1:L)\n\n, concentration \u21e3, threshold \u270f)\nDPMM log-likelihood for all ` in 1, . . . , L, \u21e2(`)\nDPMM log-likelihood for new component, \u21e2(L+1)\nDPMM assignments, j \u2327 -softmax(\u21e2(1)\nif (L+1)\n\n>\u270f then\n\nj\n\nj\n\nelse\n\nExpand the model by incrementing L L + 1\nRenormalize j \u2327 -softmax(\u21e2(1)\n, . . . ,\u21e2 (L)\n)\n\nj\n\nj\n\nreturn j\n\nj Pi log p( xji | \u02c6(`)\n\n Pi log p( xji | \u02c6(L+1)\n\nj\n, . . . ,\u21e2 (L+1)\n\n)\n\nj\n\nj\n\nj\n\n) + log n(`)\n\n) + log \u21e3\n\ni=1, \u02c6(`)\n\nM-STEP( {xji}M\nreturn r\u2713[Pj,i j log p( xji | \u02c6(`)\n\n, j, concentration \u21e3)\n\nj\n\nj\n\n) + log n(`)]\n\nSubroutine 3: The E-STEP and M-STEP for an in\ufb01nite mixture of hierarchical Bayesian models.\n\nwhere L is the number of non-empty clusters, n(`) is the number of tasks already occupying a cluster\n`, and \u21e3 is a \ufb01xed positive concentration parameter. The prior probability associated with a new\nmixture component is therefore p( zj = L + 1 | z1:j1,\u21e3 ).\nIn a similar spirit to Section 3, we develop a stochastic EM procedure for the estimation of the latent\ntask-speci\ufb01c parameters 1:J and the meta-level parameters \u2713(1:L) in the DPMM, which allows\nthe number of observed task clusters to grow in an online manner with the diversity of the task\ndistribution. While computation of the mode estimate of the task-speci\ufb01c parameters j is mostly\nunchanged from the \ufb01nite variant, the estimation of the cluster assignment variables z in the E-STEP\nrequires revisiting the Gibbs conditional distributions due to the potential addition of a new cluster at\neach step. For a DPMM, the conditional distributions for zj are\n\np zj = `| xj1:M , z1:j1 /8<:\n\nn(`)R p(xj1:M|(`)\n\u21e3R p(xj1:M|(0)\n\nj )p((`)\nj )p((0)\n\nj\n\nj\n\n|\u2713) dj dG`(\u2713)\n|\u2713) dj dG0(\u2713)\n\nfor ` \uf8ff L\nfor ` = L + 1\n\n(4)\n\nwith G0 as the base measure or global prior over the components of the CRP, G` is the prior over each\ncluster\u2019s parameters, initialized with a draw from a Gaussian centered at G0 with a \ufb01xed variance and\nupdated over time using summary statistics from the set of active components {\u2713(0), . . . , \u2713(L)}.\nTaking the logarithm of the posterior over task-to-cluster assignments zj in (4) and using a mode\nestimate \u02c6(`)\nfor task-speci\ufb01c parameters j as drawn from the `th cluster gives the E-STEP in\nj\nSubroutine 3. We may also omit the prior term log p( \u02c6(`)\n| \u2713(`) ) as it arises as an implicit prior\nj\nresulting from truncated gradient descent, as explained in Section 3 of [21].\n\n6 Experiments: Task-agnostic continual few-shot regression & classi\ufb01cation\n\nBy treating the assignment of tasks to clusters as latent variables, the algorithm of Section 5 can\nadapt to a changing distribution of tasks, without any external information to signal distribution shift\n(i.e., in a task-agnostic manner). Here, we present our main experimental results on both a novel\nsynthetic regression benchmark as well as a novel evolving variant of miniImageNet, and con\ufb01rm the\nalgorithm\u2019s ability to adapt to distribution shift by spawning a newly specialized cluster.\n\nHigh-capacity baselines. As an ablation, we compare to the non-uniform parametric mixture\nproposed in Section 3 with the number of components \ufb01xed at the total number of task distributions\nin the dataset (3). We also consider a uniform parametric mixture in which each component receives\nequal assignments; this can also be seen as the non-uniform mixture in the in\ufb01nite temperature\n(\u2327) limit. Note that our meta-learner has a lower capacity than these two baselines for most of the\ntraining procedure, as it may decide to expand its capacity past one component only when the task\ndistribution changes. Finally, for the large-scale experiment in Section 6.2, we compare with three\nrecent meta-learning algorithms that report improved performance on the standard miniImageNet\nbenchmark of Section 3, but are not explicitly posed to address the continual learning setting of\nevolving tasks: MT-net [30], PLATIPUS [15], and VERSA [20].\n\n5\n\n\f(b) sinusoid\n\n(a) polynomial\nFigure 4: The diverse set of periodic functions used\nfor few-shot regression in Section 6.1.\n\n(c) sawtooth\n\n(b) blur\n\n(c) night\n\n(a) plain\nFigure 5: Artistic \ufb01lters (b-d) applied to miniImageNet\n(a) to ensure non-homogeneity of tasks in Section 6.2.\n\n(d) pencil\n\n6.1 Continual few-shot regression\nWe \ufb01rst consider an explanatory experiment in which three regression tasks are presented sequentially\nwith no overlap. For input x sampled uniformly from [5, 5], each regression task is generated, in a\nsimilar spirit to the sinusoidal regression setup in [14], from one of a set of simple but distinct one-\ndimensional functions (polynomial Figure 4a, sinusoid wave Figure 4b, and sawtooth wave Figure 4c).\nFor the experiment in Figure 6 and Figure 7, we presented the polynomial tasks for 4000 iterations,\nfollowed by sinusoid tasks for 3000 iterations, and \ufb01nally sawtooth tasks. Additional details on the\nexperimental setup can be found in Appendix C.2.2.\nResults: Distribution shift detection. The cluster responsibilities in Figure 7 on the meta-test\ndataset of tasks, from each of the three regression types in Figure 4, indicate that the non-parametric\nalgorithm recognizes a change in the task distribution and spawns a new cluster at iteration 4000\nand promptly after iteration 7000. Each newly created cluster is specialized to the task distribution\nobserved at the time of spawning and remains as such throughout training, since the majority of\nassignments for each type of regression remains under a given cluster from the time of its introduction.\nResults: Improved generalization and slower degradation of performance. We investigate the\nprogression of the meta-test mean-squared error (MSE) for the three regression task distributions in\nFigure 6. We \ufb01rst note the clear advantage of non-uniform assignment both in improved generalization,\nwhen testing on the active task distribution, and in slower degradation, when testing on previous\ndistributions. This is due to the ability of these methods to modulate the transfer of information in\norder to limit negative transfer. In contrast, the uniform method cannot selectively adapt speci\ufb01c\nclusters to be responsible for any given task, and thus inevitably suffers from catastrophic forgetting.\nThe adaptive capacity of our non-parametric method allows it to spawn clusters that specialize to\nnewly observed tasks. Accordingly, even if the overall capacity is lower than that of the comparable\nnon-uniform parametric method, our method achieves similar or better generalization, at any given\ntraining iteration. More importantly, specialization allows our method to better modulate information\ntransfer as the clusters are better differentiated. Consequently, each cluster does not account for many\nassignments from more than a single task distribution throughout training. Therefore, we observed a\nsigni\ufb01cantly slower rate of degradation of the MSE on previous task distributions as new tasks are\nintroduced. This is especially evident from the performance on the \ufb01rst task in Figure 6.\n\n6.2 Continual few-shot classi\ufb01cation\nNext, we consider an evolving variant of the miniImageNet few-shot classi\ufb01cation task. In this\nvariant, one of a set of artistic \ufb01lters are applied to the images during the meta-training procedure\nto simulate a changing distribution of few-shot classi\ufb01cation tasks. For the experiment in Figure 8\nand Figure 9 we \ufb01rst train using images with a \u201cblur\u201d \ufb01lter (Figure 5b) for 7500 iterations, then with\na \u201cnight\u201d \ufb01lter (Figure 5c) for another 7500 iterations, and \ufb01nally with a \u201cpencil\u201d \ufb01lter (Figure 5d).\nAdditional details on the experimental setup can be found in Appendix C.2.3.\nResults: Meta-test accuracy.\nIn Figure 9, we report the evolution of the meta-test accuracy for\ntwo variants of our non-parametric meta-learner in comparison to the parametric baselines introduced\nin Section 6, high-capacity baselines. The task-agnostic variant is the core algorithm described\nin previous sections, as used for the regression tasks. The task-aware variant augments the core\nalgorithm with a cool-down period that prevents over-spawning for the duration of a training phase.\nThis requires some knowledge of the duration which is external to the meta-learner, thus the task-\naware nomenclature (note that this does not correspond to a true oracle, as we do not enforce spawning\nof a cluster; see Appendix D.1 for further details).\n\n6\n\n\f1\nk\ns\na\nT\n\n)\nl\na\ni\nm\no\nn\ny\nl\no\nP\n(\n\n2\nk\ns\na\nT\n\n)\nd\n\ni\no\ns\nu\nn\n\ni\n\nS\n(\n\ns\ns\no\nl\n\nt\ns\ne\nt\n-\na\nt\ne\n\n)M\n\n3\nk\ns\na\nT\n\nh\nt\no\no\nt\nw\na\nS\n(\n\nPhase 1 (Poly)\n\nPhase 2 (Sinusoid)\n\nPhase 3 (Sawtooth)\n\nUniform Mixture\n[Ablation]\n\nNonuniform Mixture\n[Ablation]\n\nNon-parametric\nMixture [Ours]\n\n16\n12\n8\n70\n60\n50\n\n50\n40\n30\n\n0\n\n1000 2000 3000 4000 5000 6000 7000 8000 9000\n\nMeta-training iteration\n\nFigure 6: Results on the evolving dataset of few-shot regression tasks (lower is better). Each panel (row)\npresents, for a speci\ufb01c task type (polynomial, sinusoid or sawtooth), the average meta-test set accuracy of each\nmethod over cumulative number of few-shot episodes. We additionally report the degree of loss in backward\ntransfer (i.e., catastrophic forgetting) to the tasks in each meta-test set in the legend; all methods but the\nnon-parametric method experience a large degree of catastrophic forgetting during an inactive phase.\n\nPhase 1 (Poly)\n\nPhase 2 (Sinusoid)\n\nPhase 3 (Sawtooth)\n\ny\nt\ni\nl\ni\n\nb\n\ni\ns\nn\no\np\ns\ne\nr\n\nt\ns\ne\nt\n-\na\nt\ne\n\nM\n\n)\nl\na\ni\nm\no\nn\ny\nl\no\nP\n(\n)\nd\ni\no\ns\nu\nn\n\ni\n\nS\n(\n\n1\nk\ns\na\nT\n\n2\nk\ns\na\nT\n\n3\nk\ns\na\nT\n\n)\nh\nt\no\no\nt\nw\na\nS\n(\n\n1\n0.5\n\n1\n0.5\n\n1\n0.5\n\nCluster 1\nCluster 2\nCluster 3\n\n0\n\n1000 2000 3000 4000 5000 6000 7000 8000 9000\n\nMeta-training iteration number\n\nFigure 7: Task-speci\ufb01c per-cluster meta-test responsibilities (`) for both active and unspawned clusters. Higher\nresponsibility implies greater specialization of a particular cluster (color) to a particular task distribution (row).\n\nIt is clear from Figure 8 that neither of our algorithms suffer from catastrophic forgetting to the same\ndegree as the parametric baselines. In fact, at the end of training, both of our methods outperform all\nthe parametric baselines on the \ufb01rst and second task.\nResults: Specialization. Given the higher capacity of the parametric baselines and the inherent\ndegree of similarity between the \ufb01ltered miniImageNet task distributions (unlike the regression tasks\nin the previous section), the parametric baselines perform better on each task distribution while during\nits active phase. However, they quickly suffer from degradation once the task distribution shifts. Our\napproach does not suffer from this phenomenon and can handle non-stationarity owing to the credit\nassignment of a single task distribution to a specialized cluster. This specialization is illustrated in\nFigure 9, where we track the evolution of the average cluster responsibilities on the meta-test dataset\nfrom each of the three miniImageNet few-shot classi\ufb01cation tasks. Each cluster is specialized so as to\nacquire the majority of a single task distribution\u2019s test set assignments, despite the degree of similarity\nbetween tasks originating from the same source (miniImageNet). We observed this dif\ufb01culty with the\nnon-monotone improvement of parametric clustering as a function of components in Section 4.\n7 Related Work\nMeta-learning.\nIn this work, we show how changes to the hierarchical Bayesian model assumed in\nmeta-learning [21, Fig. 1(a)] can be realized as changes to a meta-learning algorithm. In contrast,\nfollow-up approaches to improving the performance of meta-learning algorithms [e.g., 30, 15, 20] do\nnot change the underlying probabilistic model; what differs is the inference procedure to infer values\nof the global (shared across tasks) and local (task-speci\ufb01c) parameters; for example, [20] consider\nfeedforward conditioning while [15] employ variational inference. Due to consolidation into one set\nof global parameters shared uniformly across tasks, none of these methods inherently accommodate\nheterogeneity or non-stationarity.\n\n7\n\n\f)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\na\n\nt\ns\ne\nt\n-\na\nt\ne\n\nM\n\n1\nk\ns\na\nT\n\n)\nr\nu\nB\n\nl\n\n(\n\n2\nk\ns\na\nT\n\n)\nt\nh\ng\ni\nN\n\n(\n\n3\nk\ns\na\nT\n\n)\nl\ni\nc\nn\ne\nP\n(\n\n40\n35\n30\n\n40\n35\n30\n\n40\n35\n30\n\nPhase 1 (Blur)\n\nPhase 2 (Night)\n\nPhase 3 (Pencil)\n\nMT-net [2018, 30]:\nCF: 8.46%\n\nPLATIPUS [2018, 15]\nCF: 8.60%\n\nVERSA [2019, 20]\nCF: 7.89%\nUniform [Ablation]\nCF: 8.83%\nTask-agnostic [Ours]\nCF: 1.12%\nTask-aware [U.B.]\nCF: 0.768%\n\n15000\n\n17500\n\n0\n\n2500\n\n5000\n\n7500\n\n10000\n\n12500\nMeta-training iteration\n\nFigure 8: Results on the evolving dataset of \ufb01ltered miniImageNet few-shot classi\ufb01cation tasks (higher is better).\nEach panel (row) presents, for a speci\ufb01c task type (\ufb01lter), the average meta-test set accuracy over cumulative\nnumber of few-shot episodes. We additionally report the degree of loss in backward transfer (catastrophic\nforgetting, CF) in the legend. This is calculated for each method as the average drop in accuracy on the \ufb01rst two\ntasks at the end of training (lower is better; U.B.: upper bound).\n\nPhase 1 (Blur)\n\nPhase 2 (Night)\n\nPhase 3 (Pencil)\n\ny\nt\ni\nl\ni\n\nb\n\ni\ns\nn\no\np\ns\ne\nr\n\nt\ns\ne\nt\n-\na\nt\ne\n\nM\n\n1\nk\ns\na\nT\n\n)\nr\nu\nB\n\nl\n\n(\n\n2\nk\ns\na\nT\n\n)\nt\nh\ng\ni\nN\n\n(\n\n3\nk\ns\na\nT\n\n)\nl\ni\nc\nn\ne\nP\n(\n\n1\n0.5\n\n1\n0.5\n\n1\n0.5\n\n0\n\n2500\n\n5000\n\nCluster 1\nCluster 2\nCluster 3\n\n15000\n\n17500\n\n7500\n\n10000\n\n12500\nMeta-training iteration\n\nFigure 9: Task-speci\ufb01c per-cluster meta-test responsibilities (`) for both active and unspawned clusters. Higher\nresponsibility implies greater specialization of a particular cluster (color) to a particular task distribution (row).\nContinual learning. Techniques developed to address the catastrophic forgetting problem in con-\ntinual learning, such as elastic weight consolidation (EWC) [28], synaptic intelligence (SI) [58],\nvariational continual learning (VCL) [38], and online Laplace approximation [42] require access to\nan explicit delineation between tasks that acts as a catalyst to grow model size, which we refer to as\ntask-aware. In contrast, our non-parametric algorithm tackles the task-agnostic setting in which the\nmeta-learner recognizes a latent shift in the task distribution and adapts accordingly.\n8 Conclusion\n\nMeta-learning is a source of learned inductive bias. Occasionally, this inductive bias is harmful\nbecause the experience gained from solving a task does not transfer. Here, we present an approach\nthat allows a probabilistic meta-learner to explicitly modulate the amount of transfer between tasks,\nas well as to adapt its parameter dimensionality when the underlying task distribution evolves. We\nformulate this as probabilistic inference in a mixture model that de\ufb01nes a clustering of task-speci\ufb01c\nparameters. To ensure scalability, we make use of the recent connection between gradient-based\nmeta-learning and hierarchical Bayes [21] to perform approximate maximum a posteriori (MAP)\ninference in both a \ufb01nite and an in\ufb01nite mixture model. Our work is a \ufb01rst step towards more realistic\nsettings of diverse task distributions, and crucially, task-agnostic continual learning. The approach\nstands to bene\ufb01t from orthogonal improvements in posterior inference beyond MAP estimation\n(e.g., variational inference [27], Laplace approximation [32], or stochastic gradient Markov chain\nMonte Carlo [33]) as well as scaling up the neural network architecture.\n\n8\n\n\fReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In 12th {USENIX}\nSymposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265\u2013283,\n2016.\n\n[2] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. Journal\n\nof Machine Learning Research, 4(May):83\u201399, 2003.\n\n[3] M. Bauer, M. Rojas-Carulla, J. B. \u00b4Swikatkowski, B. Sch\u00f6lkopf, and R. E. Turner. Discriminative\n\nk-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.\n\n[4] J. Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling.\n\nMachine learning, 28(1):7\u201339, 1997.\n\n[5] J. Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research,\n\n12(1):149\u2013198, 2000.\n\n[6] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 259\u2013302, 1986.\n\n[7] D. M. Blei, M. I. Jordan, et al. Variational inference for Dirichlet process mixtures. Bayesian\n\nanalysis, 1(1):121\u2013143, 2006.\n\n[8] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of\n\nthe 10th International Conference on Machine Learning (ICML), 1993.\n\n[9] R. Caruana. Multitask learning. In Learning to learn, pages 95\u2013133. Springer, 1998.\n\n[10] A. G. Collins and M. J. Frank. Cognitive control over learning: Creating, clustering, and\n\ngeneralizing task-set structure. Psychological review, 120(1):190, 2013.\n\n[11] H. Daum\u00e9 III. Bayesian multitask learning with latent hierarchies. In Proceedings of the 25th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 135\u2013142, 2009.\n\n[12] T. Deleu and Y. Bengio. The effects of negative adaptation in model-agnostic meta-learning.\n\narXiv preprint arXiv:1812.02159, 2018.\n\n[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages\n1\u201338, 1977.\n\n[14] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning (ICML),\n2017.\n\n[15] C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. In Advances in\n\nNeural Information Processing Systems, pages 9516\u20139527, 2018.\n\n[16] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure\nmapping. In Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and\nData Mining (KDD), pages 283\u2013291. ACM, 2008.\n\n[17] S. J. Gershman and D. M. Blei. A tutorial on Bayesian nonparametric models. Journal of\n\nMathematical Psychology, 56(1):1\u201312, 2012.\n\n[18] Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers.\n\nIn Advances in neural information processing systems, pages 449\u2013455, 2000.\n\n[19] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence\nand statistics, pages 249\u2013256, 2010.\n\n9\n\n\f[20] J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. Turner. Meta-learning probabilistic\n\ninference for prediction. In ICLR, 2019.\n\n[21] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Grif\ufb01ths. Recasting gradient-based meta-\nlearning as hierarchical Bayes. In Proceedings of the 6th International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[22] K. Greff, S. van Steenkiste, and J. Schmidhuber. Neural expectation maximization. In Advances\n\nin Neural Information Processing Systems, pages 6694\u20136704, 2017.\n\n[23] S. Gupta, D. Phung, and S. Venkatesh. Factorial multi-task learning: a Bayesian nonparametric\n\napproach. In International conference on machine learning, pages 657\u2013665, 2013.\n\n[24] T. Heskes. Solving a huge number of simular tasks: A combination of multi-task learning and a\n\nhierarchical Bayesian approach. 1998.\n\n[25] M. C. Hughes, E. Fox, and E. B. Sudderth. Effective split-merge Monte Carlo methods for\nnonparametric models of sequential data. In Advances in neural information processing systems,\npages 1295\u20131303, 2012.\n\n[26] S. Jain and R. M. Neal. A split-merge markov chain monte carlo procedure for the Dirichlet\nprocess mixture model. Journal of computational and Graphical Statistics, 13(1):158\u2013182,\n2004.\n\n[27] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[28] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in\nneural networks. Proceedings of the National Academy of Sciences, 114(13):3521\u20133526, 2017.\n\n[29] N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In\nProceedings of the 21st International Conference on Machine Learning (ICML), page 65, 2004.\n\n[30] Y. Lee and S. Choi. Gradient-based meta-learning with learned layerwise metric and subspace.\n\nIn ICML, 2018.\n\n[31] D. Lin. Online learning of nonparametric mixture models via sequential variational approx-\nIn Advances in Neural Information Processing Systems (NIPS), pages 395\u2013403,\n\nimation.\n2013.\n\n[32] D. MacKay. A practical Bayesian framework for backpropagation networks. Neural computa-\n\ntion, 4(3):448\u2013472, 1992.\n\n[33] N. Metropolis and S. Ulam. The monte carlo method. Journal of the American statistical\n\nassociation, 44(247):335\u2013341, 1949.\n\n[34] P. M\u00fcller and D. R. Insua. Issues in Bayesian analysis of neural network models. Neural\n\nComputation, 10(3):749\u2013770, 1998.\n\n[35] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\ncomputational and graphical statistics, 9(2):249\u2013265, 2000.\n\n[36] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business\n\nMedia, 2012.\n\n[37] R. M. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[38] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International\n\nConference on Learning Representations (ICLR), 2018.\n\n[39] C. E. Rasmussen. The in\ufb01nite gaussian mixture model. In Advances in neural information\n\nprocessing systems, pages 554\u2013560, 2000.\n\n10\n\n\f[40] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Proceedings of\n\nthe 5th International Conference on Learning Representations (ICLR), 2017.\n\n[41] Y. P. Raykov, A. Boukouvalas, F. Baig, and M. A. Little. What to do when k-means clustering\n\nfails: a simple yet principled alternative algorithm. PloS one, 11(9):e0162259, 2016.\n\n[42] H. Ritter, A. Botev, and D. Barber. Online structured Laplace approximations for overcoming\ncatastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738\u2013\n3748, 2018.\n\n[43] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, 22(3):400\u2013407, 1951.\n\n[44] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich. To transfer or not to transfer.\n\nIn NIPS 2005 workshop on transfer learning, volume 898, pages 1\u20134, 2005.\n\n[45] A. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation.\n\nJournal of Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[46] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with hierarchical-deep models.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1958\u20131971, 2013.\n\n[47] J. Schmidhuber. Evolutionary principles in self-referential learning. PhD thesis, Institut f\u00fcr\n\nInformatik, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[48] D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference\n\non World wide web, pages 1177\u20131178. ACM, 2010.\n\n[49] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Advances\n\nin Neural Information Processing Systems (NIPS) 30, 2017.\n\n[50] N. Srivastava and R. R. Salakhutdinov. Discriminative transfer learning with tree-based priors.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 2094\u20132102, 2013.\n\n[51] A. Tank, N. Foti, and E. Fox. Streaming variational inference for Bayesian nonparametric\n\nmixture models. In Arti\ufb01cial Intelligence and Statistics, pages 968\u2013976, 2015.\n\n[52] S. Thrun. Discovering structure in multiple learning tasks: The tc algorithm.\n\n[53] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning.\n\nIn Advances in Neural Information Processing Systems (NIPS) 29, pages 3630\u20133638, 2016.\n\n[54] J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin,\nand L. Shen. Sparse Bayesian multi-task learning for predicting cognitive outcomes from\nneuroimaging measures in Alzheimer\u2019s disease. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 940\u2013947. IEEE, 2012.\n\n[55] M. Welling and K. Kurihara. Bayesian k-means as a \u201cmaximization-expectation\u201d algorithm. In\nProceedings of the 2006 SIAM international conference on data mining, pages 474\u2013478. SIAM,\n2006.\n\n[56] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with\n\ndirichlet process priors. Journal of Machine Learning Research, 8(Jan):35\u201363, 2007.\n\n[57] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks.\nIn Proceedings of the 22nd International Conference on Machine Learning (ICML), pages\n1012\u20131019, 2005.\n\n[58] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[59] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr images through a hidden markov\nrandom \ufb01eld model and the expectation-maximization algorithm. IEEE transactions on medical\nimaging, 20(1):45\u201357, 2001.\n\n11\n\n\f[60] Y. Zhang and J. G. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In\n\nAdvances in Neural Information Processing Systems, pages 2550\u20132558, 2010.\n\n[61] Y. Zhang and D.-Y. Yeung. A regularization approach to learning task relationships in multitask\n\nlearning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12, 2014.\n\n12\n\n\f", "award": [], "sourceid": 4896, "authors": [{"given_name": "Ghassen", "family_name": "Jerfel", "institution": "Duke University"}, {"given_name": "Erin", "family_name": "Grant", "institution": "UC Berkeley"}, {"given_name": "Tom", "family_name": "Griffiths", "institution": "Princeton University"}, {"given_name": "Katherine", "family_name": "Heller", "institution": "Google"}]}