{"title": "Incremental Few-Shot Learning with Attention Attractor Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5275, "page_last": 5285, "abstract": "Machine learning classifiers are often trained to recognize a set of pre-defined classes. However, in many applications, it is often desirable to have the flexibility of learning additional concepts, with limited data and without re-training on the full training set. This paper addresses this problem, incremental few-shot learning, where a regular classification network has already been trained to recognize a set of base classes, and several extra novel classes are being considered, each with only a few labeled examples. After learning the novel classes, the model is then evaluated on the overall classification performance on both base and novel classes. To this end, we propose a meta-learning model, the Attention Attractor Network, which regularizes the learning of novel classes. In each episode, we train a set of new weights to recognize novel classes until they converge, and we show that the technique of recurrent back-propagation can back-propagate through the optimization process and facilitate the learning of these parameters. We demonstrate that the learned attractor network can help recognize novel classes while remembering old classes without the need to review the original training set, outperforming various baselines.", "full_text": "Incremental Few-Shot Learning with Attention\n\nAttractor Networks\n\nMengye Ren1,2,3, Renjie Liao1,2,3, Ethan Fetaya1,2, Richard S. Zemel1,2\n\n1University of Toronto, 2Vector Institute, 3Uber ATG\n{mren, rjliao, ethanf, zemel}@cs.toronto.edu\n\nAbstract\n\nMachine learning classi\ufb01ers are often trained to recognize a set of pre-de\ufb01ned\nclasses. However, in many applications, it is often desirable to have the \ufb02exibility\nof learning additional concepts, with limited data and without re-training on the\nfull training set. This paper addresses this problem, incremental few-shot learning,\nwhere a regular classi\ufb01cation network has already been trained to recognize a\nset of base classes, and several extra novel classes are being considered, each\nwith only a few labeled examples. After learning the novel classes, the model is\nthen evaluated on the overall classi\ufb01cation performance on both base and novel\nclasses. To this end, we propose a meta-learning model, the Attention Attractor\nNetwork, which regularizes the learning of novel classes. In each episode, we\ntrain a set of new weights to recognize novel classes until they converge, and\nwe show that the technique of recurrent back-propagation can back-propagate\nthrough the optimization process and facilitate the learning of these parameters.\nWe demonstrate that the learned attractor network can help recognize novel classes\nwhile remembering old classes without the need to review the original training set,\noutperforming various baselines.\n\n1\n\nIntroduction\n\nThe availability of large scale datasets with detailed annotation, such as ImageNet [30], played a\nsigni\ufb01cant role in the recent success of deep learning. The need for such a large dataset is however a\nlimitation, since its collection requires intensive human labor. This is also strikingly different from\nhuman learning, where new concepts can be learned from very few examples. One line of work\nthat attempts to bridge this gap is few-shot learning [16, 36, 33], where a model learns to output a\nclassi\ufb01er given only a few labeled examples of the unseen classes. While this is a promising line\nof work, its practical usability is a concern, because few-shot models only focus on learning novel\nclasses, ignoring the fact that many common classes are readily available in large datasets.\nAn approach that aims to enjoy the best of both worlds, the ability to learn from large datasets for\ncommon classes with the \ufb02exibility of few-shot learning for others, is incremental few-shot learning\n[9]. This combines incremental learning where we want to add new classes without catastrophic\nforgetting [20], with few-shot learning when the new classes, unlike the base classes, only have a\nsmall amount of examples. One use case to illustrate the problem is a visual aid system. Most objects\nof interest are common to all users, e.g., cars, pedestrian signals; however, users would also like to\naugment the system with additional personalized items or important landmarks in their area. Such a\nsystem needs to be able to learn new classes from few examples, without harming the performance\non the original classes and typically without access to the dataset used to train the original classes.\nIn this work we present a novel method for incremental few-shot learning where during meta-learning\nwe optimize a regularizer that reduces catastrophic forgetting from the incremental few-shot learning.\nOur proposed regularizer is inspired by attractor networks [42] and can be thought of as a memory of\nthe base classes, adapted to the new classes. We also show how this regularizer can be optimized,\nusing recurrent back-propagation [18, 1, 25] to back-propagate through the few-shot optimization\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our proposed attention attractor network for incremental few-shot learning. During\npretraining we learn the base class weights Wa and the feature extractor CNN backbone. In the\nmeta-learning stage, a few-shot episode is presented. The support set only contains novel classes,\nwhereas the query set contains both base and novel classes. We learn an episodic classi\ufb01er network\nthrough an iterative solver, to minimize cross entropy plus an additional regularization term predicted\nby the attention attractor network by attending to the base classes. The attention attractor network is\nmeta-learned to minimize the expected query loss. During testing an episodic classi\ufb01er is learned in\nthe same way.\n\nstage. Finally, we show empirically that our proposed method can produce state-of-the-art results in\nincremental few-shot learning on mini-ImageNet [36] and tiered-ImageNet [29] tasks.\n\n2 Related Work\n\nRecently, there has been a surge in interest in few-shot learning [16, 36, 33, 17], where a model\nfor novel classes is learned with only a few labeled examples. One family of approaches for few-\nshot learning, including Deep Siamese Networks [16], Matching Networks [36] and Prototypical\nNetworks [33], follows the line of metric learning. In particular, these approaches use deep neural\nnetworks to learn a function that maps the input space to the embedding space where examples\nbelonging to the same category are close and those belonging to different categories are far apart.\nRecently, [8] proposes a graph neural networks based method which captures the information\npropagation from the labeled support set to the query set. [29] extends Prototypical Networks to\nleverage unlabeled examples while doing few-shot learning. Despite their simplicity, these methods\nare very effective and often competitive with the state-of-the-art.\nAnother class of approaches aims to learn models which can adapt to the episodic tasks. In particular,\n[27] treats the long short-term memory (LSTM) as a meta learner such that it can learn to predict\nthe parameter update of a base learner, e.g., a convolutional neural network (CNN). MAML [7]\ninstead learns the hyperparameters or the initial parameters of the base learner by back-propagating\nthrough the gradient descent steps. [31] uses a read/write augmented memory, and [21] combines\nsoft attention with temporal convolutions which enables retrieval of information from past episodes.\nMethods described above belong to the general class of meta-learning models. First proposed in\n[32, 23, 35], meta-learning is a machine learning paradigm where the meta-learner tries to improve\nthe base learner using the learning experiences from multiple tasks. Meta-learning methods typically\nlearn the update policy yet lack an overall learning objective in the few-shot episodes. Furthermore,\nthey could potentially suffer from short-horizon bias [41], if at test time the model is trained for\n\n2\n\nBackboneCNNFew-Shot Episodic NetworkPretrained BackboneAttention AttractorNetworkBase Class Weights!\"Support Loss#Attractor Regularizerdogcatplanetrain\u2026#(%&;()%&Fast weights+Minimization%+Joint Classificationdogcatplanetrain\u2026fishshipFew-Shot EpisodeQuerySupportAttentionPretraining StageWhen are parameters being learned?Meta-Learning StageEvery Few-Shot Episode,(()\flonger steps. To address this problem, [4] proposes to use fast convergent models like logistic\nregression (LR), which can be back-propagated via a closed form update rule. Compared to [4], our\nproposed method using recurrent back-propagation [18, 1, 25] is more general as it does not require a\nclosed-form update, and the inner loop solver can employ any existing continuous optimizers.\nOur work is also related to incremental learning, a setting where information is arriving continuously\nwhile prior knowledge needs to be transferred. A key challenge is catastrophic forgetting [20, 19], i.e.,\nthe model forgets the learned knowledge. Various memory-based models have since been proposed,\nwhich store training examples explicitly [28, 34, 5, 24], regularize the parameter updates [15], or\nlearn a generative model [13]. However, in these studies, incremental learning typically starts from\nscratch, and usually performs worse than a regular model that is trained with all available classes\ntogether since it needs to learned a good representation while dealing with catastrophic forgetting.\nIncremental few-shot learning is also known as low-shot learning. To leverage a good representation,\n[10, 37, 9] starts off with a pre-trained network on a set of base classes, and tries to augment the\nclassi\ufb01er with a batch of new classes that has not been seen during training. [10] proposes the squared\ngradient magnitude loss, which makes the learned classi\ufb01er from the low-shot examples have a smaller\ngradient value when learning on all examples. [37] propose the prototypical matching networks,\na combination of prototypical network and matching network. The paper also adds hallucination,\nwhich generates new examples. [9] proposes an attention based model which generates weights for\nnovel categories. They also promote the use of cosine similarity between feature representations and\nweight vectors to classify images.\nIn contrast, during each few-shot episode, we directly learn a classi\ufb01er network that is randomly\ninitialized and solved till convergence, unlike [9] which directly output the prediction. Since the\nmodel cannot see base class data within the support set of each few-shot learning episode, it is\nchallenging to learn a classi\ufb01er that jointly classi\ufb01es both base and novel categories. Towards this\nend, we propose to add a learned regularizer, which is predicted by a meta-network, the \u201cattention\nattractor network\u201d. The network is learned by differentiating through few-shot learning optimization\niterations. We found that using an iterative solver with the learned regularizer signi\ufb01cantly improves\nthe classi\ufb01er model on the task of incremental few-shot learning.\n\n3 Model\n\nIn this section, we \ufb01rst de\ufb01ne the setup of incremental few-shot learning, and then we introduce our\nnew model, the Attention Attractor Network, which attends to the set of base classes according to\nthe few-shot training data by using the attractor regularizing term. Figure 1 illustrates the high-level\nmodel diagram of our method.\n\n3.1\n\nIncremental Few-Shot Learning\n\nThe outline of our meta-learning approach to incremental few-shot learning is: (1) We learn a\n\ufb01xed feature representation and a classi\ufb01er on a set of base classes; (2) In each training and testing\nepisode we train a novel-class classi\ufb01er with our meta-learned regularizer; (3) We optimize our\nmeta-learned regularizer on combined novel and base classes classi\ufb01cation, adapting it to perform\nwell in conjunction with the base classi\ufb01er. Details of these stages follow.\n\nPretraining Stage: We learn a base model for the regular supervised classi\ufb01cation task on\ndataset {(xa,i, ya,i)}Na\ni=1 where xa,i is the i-th example from dataset Da and its labeled class\nya,i \u2208 {1, 2, ..., K}. The purpose of this stage is to learn both a good base classi\ufb01er and a good\nrepresentation. The parameters of the base classi\ufb01er are learned in this stage and will be \ufb01xed\nafter pretraining. We denote the parameters of the top fully connected layer of the base classi\ufb01er\nWa \u2208 RD\u00d7K where D is the dimension of our learned representation.\nIncremental Few-Shot Episodes: A few-shot dataset Db is presented, from which we can sample\nfew-shot learning episodes E. Note that this can be the same data source as the pretraining dataset Da,\nbut sampled episodically. For each N-shot K(cid:48)-way episode, there are K(cid:48) novel classes disjoint from\nthe base classes. Each novel class has N and M images from the support set Sb and the query set\nQb respectively. Therefore, we have E = (Sb, Qb), Sb = (xS\nb,i)M\u00d7K(cid:48)\n\nb,i)N\u00d7K(cid:48)\n\nb,i, yS\n\ni=1\n\n, Qb = (xQ\n\nb,i, yQ\n\ni=1\n\n3\n\n\fwhere yb,i \u2208 {K + 1, ..., K + K(cid:48)}. Sb and Qb can be regarded as this episodes training and validation\nsets. Each episode we learn a classi\ufb01er on the support set Sb whose learnable parameters Wb are\ncalled the fast weights as they are only used during this episode. To evaluate the performance on\na joint prediction of both base and novel classes, i.e., a (K + K(cid:48))-way classi\ufb01cation, a mini-batch\nQa = {(xa,i, ya,i)}M\u00d7K\nsampled from Da is also added to Qb to form Qa+b = Qa \u222a Qb. This\nmeans that the learning algorithm, which only has access to samples from the novel classes Sb, is\nevaluated on the joint query set Qa+b.\n\ni=1\n\nIn meta-training, we iteratively sample few-shot episodes E and try to learn\nMeta-Learning Stage:\nthe meta-parameters in order to minimize the joint prediction loss on Qa+b. In particular, we design a\nregularizer R(\u00b7, \u03b8) such that the fast weights are learned via minimizing the loss (cid:96)(Wb, Sb)+R(Wb, \u03b8)\nwhere (cid:96)(Wb, Sb) is typically cross-entropy loss for few-shot classi\ufb01cation. The meta-learner tries to\nlearn meta-parameters \u03b8 such that the optimal fast weights W \u2217\nb w.r.t. the above loss function performs\nwell on Qa+b. In our model, meta-parameters \u03b8 are encapsulated in our attention attractor network,\nwhich produces regularizers for the fast weights in the few-shot learning objective.\n\nJoint Prediction on Base and Novel Classes: We now introduce the details of our joint prediction\nframework performed in each few-shot episode. First, we construct an episodic classi\ufb01er, e.g., a\nlogistic regression (LR) model or a multi-layer perceptron (MLP), which takes the learned image\nfeatures as inputs and classi\ufb01es them according to the few-shot classes.\nDuring training on the support set Sb, we learn the fast weights Wb via minimizing the following\nregularized cross-entropy objective, which we call the episodic objective:\n\nLS(Wb, \u03b8) = \u2212 1\nN K(cid:48)\n\nyS\nb,i,c log \u02c6yS\n\nb,i,c + R(Wb, \u03b8).\n\nThis is a general formulation and the speci\ufb01c functional form of the regularization term R(Wb, \u03b8) will\nbe speci\ufb01ed later. The predicted output \u02c6yS\nwhere h(xb,i) is our classi\ufb01cation network and Wb is the fast weights in the network. In the case of\nLR, h is a linear model: h(xb,i; Wb) = W (cid:62)\nb xb,i. h can also be an MLP for more expressive power.\nDuring testing on the query set Qa+b, in order to predict both base and novel classes, we directly\naugment the softmax with the \ufb01xed base class weights Wa, \u02c6yQ\nwhere W \u2217\n\nb are the optimal parameters that minimize the regularized classi\ufb01cation objective in Eq. (1).\n\nb,i = softmax((cid:2)W (cid:62)\ni = softmax((cid:2)W (cid:62)\n\na xb,i, h(xb,i; W \u2217\n\nb,i is obtained via, \u02c6yS\n\na xi, h(xi; W \u2217\n\n(1)\n\nb )(cid:3)),\nb )(cid:3)),\n\nN K(cid:48)(cid:88)\n\nK+K(cid:48)(cid:88)\n\ni=1\n\nc=K+1\n\n3.2 Attention Attractor Networks\n\nK(cid:48)(cid:88)\n\nk(cid:48)=1\n\nDirectly learning the few-shot episode, e.g., by setting R(Wb, \u03b8) to be zero or simple weight decay,\ncan cause catastrophic forgetting on the base classes. This is because Wb which is trained to maximize\nthe correct novel class probability can dominate the base classes in the joint prediction. In this section,\nwe introduce the Attention Attractor Network to address this problem. The key feature of our attractor\nnetwork is the regularization term R(Wb, \u03b8):\n\nR(Wb, \u03b8) =\n\n(Wb,k(cid:48) \u2212 uk(cid:48))(cid:62)diag(exp(\u03b3))(Wb,k(cid:48) \u2212 uk(cid:48)),\n\n(2)\n\nwhere uk(cid:48) is the so-called attractor and Wb,k(cid:48) is the k(cid:48)-th column of Wb. This sum of squared\nMahalanobis distances from the attractors adds a bias to the learning signal arriving solely from\nnovel classes. Note that for a classi\ufb01er such as an MLP, one can extend this regularization term\nin a layer-wise manner. Speci\ufb01cally, one can have separate attractors per layer, and the number of\nattractors equals the number of output dimension of that layer.\nTo ensure that the model performs well on base classes, the attractors uk(cid:48) must contain some\ninformation about examples from base classes. Since we can not directly access these base examples,\nwe propose to use the slow weights to encode such information. Speci\ufb01cally, each base class has\na learned attractor vector Uk stored in the memory matrix U = [U1, ..., UK]. It is computed as,\nUk = f\u03c6(Wa,k), where f is a MLP of which the learnable parameters are \u03c6. For each novel class k(cid:48)\nits classi\ufb01er is regularized towards its attractor uk(cid:48) which is a weighted sum of Uk vectors. Intuitively\n\n4\n\n\f(cid:16)\n\nthe weighting is an attention mechanism where each novel class attends to the base classes according\nto the level of interference, i.e. how prediction of new class k(cid:48) causes the forgetting of base class k.\n(cid:17)\nFor each class in the support set, we compute the cosine similarity between the average representation\n(cid:80)\nof the class and base weights Wa then normalize using a softmax function\n(cid:80)\nj hj 1[yb,j = k(cid:48)], Wa,k)\nj hj 1[yb,j = k(cid:48)], Wa,k)\nsum of entries in the memory matrix U, uk(cid:48) =(cid:80)\n\nwhere A is the cosine similarity function, hj are the representations of the inputs in the support set Sb\nand \u03c4 is a learnable temperature scalar. ak(cid:48),k encodes a normalized pairwise attention matrix between\nthe novel classes and the base classes. The attention vector is then used to compute a linear weighted\nk ak(cid:48),kUk + U0, where U0 is an embedding vector\n\n(cid:80)\n\n(cid:17) ,\n\n\u03c4 A( 1\nN\n\n\u03c4 A( 1\nN\n\nak(cid:48),k =\n\nk exp\n\n(cid:16)\n\nexp\n\n(3)\n\nb , yQ\n\nb , yS\na+b, yQ\n\nb )},{(xQ\nb )} \u2190 GetEpisode(Db);\na+b} \u2190 GetMiniBatch(Da) \u222a {(xQ\n\nAlgorithm 1 Meta Learning for Incremental Few-Shot\nLearning\nRequire: \u03b80, Da, Db, h\nEnsure: \u03b8\n1: \u03b8 \u2190 \u03b80;\n2: for t = 1 ... T do\n{(xS\n3:\n{xQ\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nLS \u2190 1\nWb \u2190 OptimizerStep(Wb,\u2207Wb LS);\nuntil Wb converges\na+b,j \u2190 softmax([W (cid:62)\n\u02c6yQ\nLQ \u2190 1\n\nN K(cid:48)(cid:80)\n2N K(cid:48)(cid:80)\n\nb,i + R(Wb; \u03b8);\n\na+b,j; Wb)]);\n\na+b,j log \u02c6yQ\n\na+b,j, h(xQ\n\nb,i log \u02c6yS\n\nb )};\n\nrepeat\n\na+b,j;\n\nb , yQ\n\na xQ\n\nj yQ\n\ni yS\n\nb \u2190 Wb \u2212 \u03b1\u2207Wb LS;\n\n// Backprop through the above optimization via RBP\n// A dummy gradient descent step\nJ \u2190 \u2202W (cid:48)\nrepeat\n\n; g \u2190 v;\nv \u2190 J(cid:62)v \u2212 \u0001v; g \u2190 g + v;\n\n; v \u2190 \u2202LQ\n\nb\n\u2202Wb\n\n\u2202Wb\n\nuntil g converges\n\u03b8 \u2190 OptimizerStep(\u03b8, g(cid:62) \u2202W (cid:48)\n\u2202\u03b8 )\n\nb\n\n13: W (cid:48)\n14:\n15:\n16:\n17:\n18:\n19:\n20: end for\n\nyj,c log \u02c6yj,c(\u03b8, Sb)\n\n(4)\n\n\uf8f9\uf8fb ,\nb (\u03b8, Sb))(cid:3)(cid:1).\n\nand serves as a bias for the attractor.\nOur design takes inspiration from attractor\nnetworks [22, 42], where for each base\nclass one learns an \u201cattractor\u201d that stores\nthe relevant memory regarding that class.\nWe call our full model \u201cdynamic attractors\u201d\nas they may vary with each episode even\nafter meta-learning. In contrast if we only\nhave the bias term U0, i.e. a single attractor\nwhich is shared by all novel classes, it will\nnot change after meta-learning from one\nepisode to the other. We call this model\nvariant the \u201cstatic attractor\u201d.\nIn summary, our meta parameters \u03b8 include\n\u03c6, U0, \u03b3 and \u03c4, which is on the same\nscale as as the number of paramters in Wa.\nIt is important to note that R(Wb, \u03b8) is\nconvex w.r.t. Wb. Therefore, if we use\nthe LR model as the classi\ufb01er, the overall\ntraining objective on episodes in Eq. (1)\nis convex which implies that the optimum\nW \u2217\nb (\u03b8, Sb) is guaranteed to be unique and\nachievable. Here we emphasize that the\noptimal parameters W \u2217\nb are functions of\nparameters \u03b8 and few-shot samples Sb.\nDuring meta-learning, \u03b8 are updated to\nminimize an expected loss of the query\nset Qa+b which contains both base and\nnovel classes, averaging over all few-shot\nlearning episodes,\n\n\uf8ee\uf8f0M (K+K(cid:48))(cid:88)\nK+K(cid:48)(cid:88)\nwhere the predicted class is \u02c6yj(\u03b8, Sb) = softmax(cid:0)(cid:2)W (cid:62)\n\n(cid:2)LQ(\u03b8, Sb)(cid:3) = E\n\nmin\n\nE\nE\n\nj=1\n\nc=1\n\nE\n\n\u03b8\n\na xj, h (xj; W \u2217\n\n3.3 Learning via Recurrent Back-Propagation\n\nAs there is no closed-form solution to the episodic objective (the optimization problem in Eq. 1), in\neach episode we need to minimize LS to obtain W \u2217\nb through an iterative optimizer. The question is\nhow to ef\ufb01ciently compute \u2202W \u2217\n\u2202\u03b8 , i.e., back-propagating through the optimization. One option is to\nunroll the iterative optimization process in the computation graph and use back-propagation through\ntime (BPTT) [38]. However, the number of iterations for a gradient-based optimizer to converge can\nbe on the order of thousands, and BPTT can be computationally prohibitive. Another way is to use\n\nb\n\n5\n\n\fTable 1: Comparison of our proposed model with other methods\n\nMethod\n\nFew-shot learner\n\nEpisodic objective\n\nAttention mechanism\n\nImprint [26]\nLwoF [9]\nOurs\n\nPrototypes\nN/A\nPrototypes + base classes N/A\nA fully trained classi\ufb01er\n\nCross entropy on\nnovel classes\n\nN/A\nAttention on base classes\nAttention on learned attractors\n\nb ) = W (t)\n\nb \u2212 F (W (t)\n\nb ), where F (W (t)\n\nb ) = W (t+1)\n\nb\n\nthe truncated BPTT [39] (T-BPTT) which optimizes for T steps of gradient-based optimization, and\nis commonly used in meta-learning problems. However, when T is small the training objective could\nbe signi\ufb01cantly biased.\nAlternatively, the recurrent back-propagation (RBP) algorithm [1, 25, 18] allows us to back-propagate\nthrough the \ufb01xed point ef\ufb01ciently without unrolling the computation graph and storing intermediate\nactivations. Consider a vanilla gradient descent process on Wb with step size \u03b1. The difference\nbetween two steps \u03a6 can be written as \u03a6(W (t)\n=\nb \u2212 \u03b1\u2207LS(W (t)\nb ). Since \u03a6(W \u2217\nW (t)\nb (\u03b8)) is identically zero as a function of \u03b8, using the implicit\nfunction theorem we have \u2202W \u2217\ndenotes the Jacobian matrix\nof the mapping F evaluated at W \u2217\nb . Algorithm 1 outlines the key steps for learning the episodic\nobjective using RBP in the incremental few-shot learning setting. Note that the RBP algorithm\nimplicitly inverts (I \u2212 J(cid:62)) by computing the matrix inverse vector product, and has the same time\ncomplexity compared to truncated BPTT given the same number of unrolled steps, but meanwhile\nRBP does not have to store intermediate activations.\nDamped Neumann RBP To compute the matrix-inverse vector product (I\u2212J(cid:62))\u22121v, [18] propose\nn=0 v(n). Note that J(cid:62)v can be\ncomputed by standard back-propagation. However, directly applying the Neumann RBP algorithm\nsometimes leads to numerical instability. Therefore, we propose to add a damping term 0 < \u0001 < 1\nto I \u2212 J(cid:62). This results in the following update: \u02dcv(n) = (J(cid:62) \u2212 \u0001I)nv. In practice, we found the\ndamping term with \u0001 = 0.1 helps alleviate the issue signi\ufb01cantly.\n\nto use the Neumann series: (I \u2212 J(cid:62))\u22121v =(cid:80)\u221e\n\nn=0(J(cid:62))nv \u2261(cid:80)\u221e\n\n\u2202\u03b8 = (I \u2212 J(cid:62)\n\n\u2202\u03b8 , where JF,W \u2217\n\nb\n\nb\n\n)\u22121 \u2202F\n\nF,W \u2217\n\nb\n\n4 Experiments\n\nWe experiment on two few-shot classi\ufb01cation datasets, mini-ImageNet and tiered-ImageNet. Both\nare subsets of ImageNet [30], with images sizes reduced to 84 \u00d7 84 pixels. We also modi\ufb01ed the\ndatasets to accommodate the incremental few-shot learning settings. 1\n\n4.1 Datasets\n\u2022 mini-ImageNet Proposed by [36], mini-ImageNet contains 100 object classes and 60,000 images.\nWe used the splits proposed by [27], where training, validation, and testing have 64, 16 and 20\nclasses respectively.\n\n\u2022 tiered-ImageNet Proposed by [29], tiered-ImageNet is a larger subset of ILSVRC-12. It features\na categorical split among training, validation, and testing subsets. The categorical split means that\nclasses that belong to the same high-level category, e.g. \u201cworking dog\u201d and \u201dterrier\u201d or some other\ndog breed, are not split between training, validation and test. This is a harder task, but one that\nmore strictly evaluates generalization to new classes. It is also an order of magnitude larger than\nmini-ImageNet.\n\n4.2 Experiment setup\n\nWe use a standard ResNet backbone [11] to learn the feature representation through supervised\ntraining. For mini-ImageNet experiments, we follow [21] and use a modi\ufb01ed version of ResNet-10.\n\n1Code released at: https://github.com/renmengye/inc-few-shot-attractor-public\n\n6\n\n\fTable 2: mini-ImageNet 64+5-way results\n\nTable 3: tiered-ImageNet 200+5-way results\n\nModel\n\nProtoNet [33]\nImprint [26]\nLwoF [9]\n\nOurs\n\n1-shot\n\nAcc. \u2191\n\n42.73 \u00b1 0.15\n41.10 \u00b1 0.20\n52.37 \u00b1 0.20\n54.95 \u00b1 0.30\n\n\u2206 \u2193\n-20.21\n-22.49\n-13.65\n-11.84\n\n5-shot\n\nAcc. \u2191\n\n57.05 \u00b1 0.10\n44.68 \u00b1 0.23\n59.90 \u00b1 0.20\n63.04 \u00b1 0.30\n\n\u2206 \u2193\n-31.72\n-27.68\n-14.18\n-10.66\n\nModel\n\nProtoNet [33]\nImprint [26]\nLwoF [9]\n\nOurs\n\n1-shot\n\nAcc. \u2191\n\n30.04 \u00b1 0.21\n39.13 \u00b1 0.15\n52.40 \u00b1 0.33\n56.11 \u00b1 0.33\n\n\u2206 \u2193\n-29.54\n-22.26\n-8.27\n-6.11\n\n5-shot\n\nAcc. \u2191\n\n41.38 \u00b1 0.28\n53.60 \u00b1 0.18\n62.63 \u00b1 0.31\n65.52 \u00b1 0.31\n\n\u2206 \u2193\n-26.39\n-16.35\n-6.72\n-4.48\n2 (\u2206a + \u2206b))\n\n\u2206 = average decrease in acc. caused by joint prediction within base and novel classes (\u2206 = 1\n\n\u2191 (\u2193) represents higher (lower) is better.\n\nFor tiered-ImageNet, we use the standard ResNet-18 [11], but replace all batch normalization [12]\nlayers with group normalization [40], as there is a large distributional shift from training to testing in\ntiered-ImageNet due to categorical splits. We used standard data augmentation, with random crops\nand horizonal \ufb02ips. We use the same pretrained checkpoint as the starting point for meta-learning.\nIn the meta-learning stage as well as the \ufb01nal evaluation, we sample a few-shot episode from the\nDb, together with a regular mini-batch from the Da. The base class images are added to the query\nset of the few-shot episode. The base and novel classes are maintained in equal proportion in our\nexperiments. For all the experiments, we consider 5-way classi\ufb01cation with 1 or 5 support examples\n(i.e. shots). In the experiments, we use a query set of size 25\u00d72 =50.\nWe use L-BFGS [43] to solve the inner loop of our models to make sure Wb converges. We use the\nADAM [14] optimizer for meta-learning with a learning rate of 1e-3, which decays by a factor of 10\nafter 4,000 steps, for a total of 8,000 steps. We \ufb01x recurrent backpropagation to 20 iterations and\n\u0001 = 0.1.\nWe study two variants of the classi\ufb01er network. The \ufb01rst is a logistic regression model with a single\nweight matrix Wb. The second is a 2-layer fully connected MLP model with 40 hidden units in the\nmiddle and tanh non-linearity. To make training more ef\ufb01cient, we also add a shortcut connection in\nour MLP, which directly links the input to the output. In the second stage of training, we keep all\nbackbone weights frozen and only train the meta-parameters \u03b8.\n\n4.3 Evaluation metrics\n\nWe consider the following evaluation metrics: 1) overall accuracy on individual query sets and the\njoint query set (\u201cBase\u201d, \u201cNovel\u201d, and \u201cBoth\u201d); and 2) decrease in performance caused by joint\nprediction within the base and novel classes, considered separately (\u201c\u2206a\u201d and \u201c\u2206b\u201d). Finally we take\nthe average \u2206 = 1\n\n2 (\u2206a + \u2206b) as a key measure of the overall decrease in accuracy.\n\n4.4 Comparisons\n\nWe implemented and compared to three methods. First, we adapted Prototypical Networks [33]\nto incremental few-shot settings. For each base class we store a base representation, which is the\naverage representation (prototype) over all images belonging to the base class. During the few-shot\nlearning stage, we again average the representation of the few-shot classes and add them to the bank\nof base representations. Finally, we retrieve the nearest neighbor by comparing the representation of\na test image with entries in the representation store. In summary, both Wa and Wb are stored as the\naverage representation of all images seen so far that belong to a certain class. We also compare to the\nfollowing methods:\n\u2022 Weights Imprinting (\u201cImprint\u201d) [26]:\nthe base weights Wa are learned regularly through\n\u2022 Learning without Forgetting (\u201cLwoF\u201d) [9]: Similar to [26], Wb are computed using prototypical\naveraging. In addition, Wa is \ufb01netuned during episodic meta-learning. We implemented the most\nadvanced variants proposed in the paper, which involves a class-wise attention mechanism. This\nmodel is the previous state-of-the-art method on incremental few-shot learning, and has better\nperformance compared to other low-shot models [37, 10].\n\nsupervised pre-training, and Wb are computed using prototypical averaging.\n\n4.5 Results\n\nWe \ufb01rst evaluate our vanilla approach on the standard few-shot classi\ufb01cation benchmark where no\nbase classes are present in the query set. Our vanilla model consists of a pretrained CNN and a\nsingle-layer logistic regression with weight decay learned from scratch; this model performs on-par\n\n7\n\n\fTable 4: Ablation studies on mini-ImageNet\n\nTable 5: Ablation studies on tiered-ImageNet\n\n1-shot\n\nAcc. \u2191\n\n5-shot\n\nAcc. \u2191\n\n1-shot\n\nLR\n\n52.74 \u00b1 0.24\n53.63 \u00b1 0.30\nLR +S\n55.31 \u00b1 0.32\nLR +A\n49.36 \u00b1 0.29\nMLP\n54.46 \u00b1 0.31\nMLP +S\nMLP +A 54.95 \u00b1 0.30\n\nAcc. \u2191\n\n\u2206 \u2193\n-13.95\n-12.53\n-11.72\n-16.78\n-11.74\n-11.84\n\n\u2206 \u2193\n-10.44\n-6.88\n-6.07\n-10.61\n-6.28\n6.11\n\u201c+S\u201d stands for static attractors, and \u201c+A\u201d for attention attractors.\n\n48.84 \u00b1 0.23\n55.36 \u00b1 0.32\nLR +S\n55.98 \u00b1 0.32\nLR +A\n41.22 \u00b1 0.35\nMLP\n56.16 \u00b1 0.32\nMLP +S\nMLP +A 56.11 \u00b1 0.33\n\n60.34 \u00b1 0.20\n62.50 \u00b1 0.30\n63.00 \u00b1 0.29\n60.85 \u00b1 0.29\n62.79 \u00b1 0.31\n63.04 \u00b1 0.30\n\n\u2206 \u2193\n-13.60\n-11.29\n-10.80\n-12.62\n-10.77\n-10.66\n\nLR\n\n5-shot\n\nAcc. \u2191\n\n62.08 \u00b1 0.20\n65.53 \u00b1 0.30\n65.58 \u00b1 0.29\n62.70 \u00b1 0.31\n65.80 \u00b1 0.31\n65.52 \u00b1 0.31\n\n\u2206 \u2193\n-8.00\n-4.68\n-4.39\n-7.44\n-4.58\n-4.48\n\nFigure 2: Learning the proposed model using truncated BPTT vs. RBP. Models are evaluated with\n1-shot (left) and 5-shot (right) 64+5-way episodes, with various number of gradient descent steps.\n\nwith other competitive meta-learning approaches (1-shot 55.40 \u00b1 0.51, 5-shot 70.17 \u00b1 0.46). Note\nthat our model uses the same backbone architecture as [21] and [9], and is directly comparable\nwith their results. Similar \ufb01ndings of strong results using simple logistic regression on few-shot\nclassi\ufb01cation benchmarks are also recently reported in [6]. Our full model has similar performance\nas the vanilla model on pure few-shot benchmarks, and the full table is available in Supp. Materials.\nNext, we compare our models to other methods on incremental few-shot learning benchmarks in\nTables 2 and 3. On both benchmarks, our best performing model shows a signi\ufb01cant margin over\nthe prior works that predict the prototype representation without using an iterative optimization\n[33, 26, 9].\n\n4.6 Ablation studies\n\nwith a weight decay regularizer.\n\nTo understand the effectiveness of each part of the proposed model, we consider the following\nvariants:\n\u2022 Vanilla (\u201cLR, MLP\u201d) optimizes a logistic regression or an MLP network at each few-shot episode,\n\u2022 Static attractor (\u201c+S\u201d) learns a \ufb01xed attractor center u and attractor slope \u03b3 for all classes.\n\u2022 Attention attractor (\u201c+A\u201d) learns the full attention attractor model. For MLP models, the weights\nbelow the \ufb01nal layer are controlled by attractors predicted by the average representation across all\nthe episodes. f\u03c6 is an MLP with one hidden layer of 50 units.\n\nTables 4 and 5 shows the ablation experiment results. In all cases, the learned regularization function\nshows better performance than a manually set weight decay constant on the classi\ufb01er network, in\nterms of both jointly predicting base and novel classes, as well as less degradation from individual\nprediction. On mini-ImageNet, our attention attractors have a clear advantage over static attractors.\nFormulating the classi\ufb01er as an MLP network is slightly better than the linear models in our\nexperiments. Although the \ufb01nal performance is similar, our RBP-based algorithm have the \ufb02exibility\nof adding the fast episodic model with more capacity. Unlike [4], we do not rely on an analytic form\nof the gradients of the optimization process.\n\nComparison to truncated BPTT (T-BPTT) An alternative way to learn the regularizer is to unroll\nthe inner optimization for a \ufb01xed number of steps in a differentiable computation graph, and then\nback-propagate through time. Truncated BPTT is a popular learning algorithm in many recent\nmeta-learning approaches [2, 27, 7, 34, 3]. Shown in Figure 2, the performance of T-BPTT learned\n\n8\n\n0255075100125150175200Episodic Training Steps48.050.052.054.056.01-Shot Acc. (%)T-BPTT 20 StepsT-BPTT 50 StepsRBP0255075100125150175200Episodic Training Steps56.058.060.062.064.066.05-Shot Acc. (%)\f(a) Ours\n\n(b) LwoF [9]\n\nFigure 3: Visualization of a 5-shot 64+5-way episode using PCA. Left: Our attractor model learns to\n\u201cpull\u201d prototypes (large colored circles) towards base class weights (white circles). We visualize the\ntrajectories during episodic training; Right: Dynamic few-shot learning without forgetting [9].\n\nmodels are comparable to ours; however, when solved to convergence at test time, the performance\nof T-BPTT models drops signi\ufb01cantly. This is expected as they are only guaranteed to work well for\na certain number of steps, and failed to learn a good regularizer. While an early-stopped T-BPTT\nmodel can do equally well, in practice it is hard to tell when to stop; whereas for the RBP model,\ndoing the full episodic training is very fast since the number of support examples is small.\n\nVisualization of attractor dynamics We\nvisualize attractor dynamics in Figure 3. Our\nlearned attractors pulled the fast weights close\ntowards the base class weights. In comparison,\n[9] only modi\ufb01es the prototypes slightly.\n\nVarying the number of base classes While\nthe framework proposed in this paper cannot be\ndirectly applied on class-incremental continual\nlearning, as there is no module for memory\nconsolidation, we can simulate the continual\nlearning process by varying the number of base\nclasses, to see how the proposed models are\naffected by different stages of continual learning.\nFigure 4 shows that the learned regularizers\nconsistently improve over baselines with weight\ndecay only. The overall accuracy increases from\n50 to 150 classes due to better representations on the backbone network, and drops at 200 classes due\nto a more challenging classi\ufb01cation task.\n\nFigure 4: Results on tiered-ImageNet with {50,\n100, 150, 200} base classes.\n\n5 Conclusion\n\nIncremental few-shot learning, the ability to jointly predict based on a set of pre-de\ufb01ned concepts\nas well as additional novel concepts, is an important step towards making machine learning models\nmore \ufb02exible and usable in everyday life. In this work, we propose an attention attractor model,\nwhich regulates a per-episode training objective by attending to the set of base classes. We show that\nour iterative model that solves the few-shot objective till convergence is better than baselines that do\none-step inference, and that recurrent back-propagation is an effective and modular tool for learning\nin a general meta-learning setting, whereas truncated back-propagation through time fails to learn\nfunctions that converge well. Future directions of this work include sequential iterative learning of\nfew-shot novel concepts, and hierarchical memory organization.\n\nAcknowledgment Supported by NSERC and the Intelligence Advanced Research Projects\nActivity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number\nD16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views\nand conclusions contained herein are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of IARPA, DoI/IBC,\nor the U.S. Government.\n\n9\n\nfast weightsattractorsbase class weightsprototypesmodified prototypesattended base classesbase class weightsprototypes50100150200Number of Base Classes485052545658Accuracy (%)LRLR +SLR +AMLPMLP +SMLP +A\fReferences\n[1] L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial\nenvironment. In Proceedings of the 1st International Conference on Neural Networks, volume 2,\npages 609\u2013618. IEEE, 1987.\n\n[2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas.\nLearning to learn by gradient descent by gradient descent. In Advances in Neural Information\nProcessing Systems 29 (NIPS), 2016.\n\n[3] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization\nusing meta-regularization. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems (NeurIPS), 2018.\n\n[4] L. Bertinetto, J. F. Henriques, P. H. S. Torr, and A. Vedaldi. Meta-learning with differentiable\n\nclosed-form solvers. CoRR, abs/1805.08136, 2018.\n\n[5] F. M. Castro, M. Mar\u00b4\u0131n-Jim\u00b4enez, N. Guil, C. Schmid, and K. Alahari. End-to-end incremental\n\nlearning. In European Conference on Computer Vision (ECCV), 2018.\n\n[6] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang. A closer look at few-shot classi\ufb01cation. In\n\nProceedings of the 7th International Conference on Learning Representations (ICLR), 2019.\n\n[7] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning (ICML),\n2017.\n\n[8] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In Proceedings of the\n\n6th International Conference on Learning Representations (ICLR), 2018.\n\n[9] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[10] B. Hariharan and R. B. Girshick. Low-shot visual recognition by shrinking and hallucinating\nfeatures. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\n2017.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 770\u2013778, 2016.\n\n[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32nd International Conference on Machine\nLearning (ICML), 2015.\n\n[13] R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learning.\nProceedings of 6th International Conference on Learning Representations (ICLR), 2018.\n\nIn\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the\n\n3rd International Conference on Learning Representations (ICLR), 2015.\n\n[15] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural\nnetworks. Proceedings of the national academy of sciences (PNAS), page 201611835, 2017.\n\n[16] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image\n\nrecognition. In ICML Deep Learning Workshop, volume 2, 2015.\n\n[17] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum. One shot learning of simple\nvisual concepts. In Proceedings of the 33th Annual Meeting of the Cognitive Science Society\n(CogSci), 2011.\n\n[18] R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. S. Zemel.\nReviving and improving recurrent back-propagation. In Proceedings of the 35th International\nConference on Machine Learning (ICML), 2018.\n\n[19] J. L. McClelland, B. L. McNaughton, and R. C. O\u2019reilly. Why there are complementary\nlearning systems in the hippocampus and neocortex: insights from the successes and failures of\nconnectionist models of learning and memory. Psychological review, 102(3):419, 1995.\n\n[20] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The\nsequential learning problem. In Psychology of learning and motivation, volume 24, pages\n109\u2013165. Elsevier, 1989.\n\n[21] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In\n\nProceedings of the 6th International Conference on Learning Representations (ICLR), 2018.\n\n10\n\n\f[22] M. C. Mozer. Attractor networks. The Oxford companion to consciousness, pages 86\u201389, 2009.\n[23] D. K. Naik and R. Mammone. Meta-neural networks that learn by learning. In Proceedings of\n\nthe IEEE International Joint Conference on Neural Networks (IJCNN), 1992.\n\n[24] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In Proceedings\n\nof the 6th International Conference on Learning Representations (ICLR), 2018.\n\n[25] F. J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical review\n\nletters, 59(19):2229, 1987.\n\n[26] H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[27] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Proceedings of\n\nthe 5th International Conference on Learning Representations (ICLR), 2017.\n\n[28] S. Rebuf\ufb01, A. Kolesnikov, G. Sperl, and C. H. Lampert.\n\nicarl: Incremental classi\ufb01er and\nIn Proceedings of the IEEE Conference on Computer Vision and\n\nrepresentation learning.\nPattern Recognition (CVPR), 2017.\n\n[29] M. Ren, E. Trianta\ufb01llou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and\nR. S. Zemel. Meta-learning for semi-supervised few-shot classi\ufb01cation. In Proceedings of 6th\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[31] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot learning with\nmemory-augmented neural networks. In Proceedings of the 33rd International Conference on\nMachine Learning (ICML), 2016.\n\n[32] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\nlearn: The meta-meta-... hook. Diplomarbeit, Technische Universit\u00a8at M\u00a8unchen, M\u00a8unchen,\n1987.\n\n[33] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Advances\n\nin Neural Information Processing Systems 30 (NIPS), 2017.\n\n[34] P. Sprechmann, S. M. Jayakumar, J. W. Rae, A. Pritzel, A. P. Badia, B. Uria, O. Vinyals,\nD. Hassabis, R. Pascanu, and C. Blundell. Memory-based parameter adaptation. In Proceedings\nof 6th International Conference on Learning Representations (ICLR), 2018.\n\n[35] S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181\u2013209. Springer, 1998.\n[36] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for\n\none shot learning. In Advances in Neural Information Processing Systems 29 (NIPS), 2016.\n\n[37] Y. Wang, R. B. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[38] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the\n\nIEEE, 78(10):1550\u20131560, 1990.\n\n[39] R. J. Williams and J. Peng. An ef\ufb01cient gradient-based algorithm for on-line training of recurrent\n\nnetwork trajectories. Neural computation, 2(4):490\u2013501, 1990.\n\n[40] Y. Wu and K. He. Group normalization. In European Conference on Computer Vision (ECCV),\n\n2018.\n\n[41] Y. Wu, M. Ren, R. Liao, and R. B. Grosse. Understanding short-horizon bias in stochastic meta-\noptimization. In Proceedings of the 6th International Conference on Learning Representations\n(ICLR), 2018.\n\n[42] R. S. Zemel and M. C. Mozer. Localist attractor networks. Neural Computation, 13(5):1045\u2013\n\n1064, 2001.\n\n[43] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines\nfor large-scale bound-constrained optimization. ACM Transactions on Mathematical Software\n(TOMS), 23(4):550\u2013560, 1997.\n\n11\n\n\f", "award": [], "sourceid": 2850, "authors": [{"given_name": "Mengye", "family_name": "Ren", "institution": "University of Toronto / Uber ATG"}, {"given_name": "Renjie", "family_name": "Liao", "institution": "University of Toronto"}, {"given_name": "Ethan", "family_name": "Fetaya", "institution": "Bar Ilan University"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "Vector Institute/University of Toronto"}]}