{"title": "TADAM: Task dependent adaptive metric for improved few-shot learning", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 731, "abstract": "Few-shot learning has become essential for producing models that generalize from few examples. In this work, we identify that metric scaling and metric task conditioning are important to improve the performance of few-shot algorithms. Our analysis reveals that simple metric scaling completely changes the nature of few-shot algorithm parameter updates. Metric scaling provides improvements up to 14% in accuracy for certain metrics on the mini-Imagenet 5-way 5-shot classification task. We further propose a simple and effective way of conditioning a learner on the task sample set, resulting in learning a task-dependent metric space. Moreover, we propose and empirically test a practical end-to-end optimization procedure based on auxiliary task co-training to learn a task-dependent metric space. The resulting few-shot learning model based on the task-dependent scaled metric achieves state of the art on mini-Imagenet. We confirm these results on another few-shot dataset that we introduce in this paper based on CIFAR100.", "full_text": "TADAM: Task dependent adaptive metric for\n\nimproved few-shot learning\n\nBoris N. Oreshkin\n\nElement AI\n\nboris@elementai.com\n\nPau Rodriguez\n\nElement AI, CVC-UAB\n\npau.rodriguez@elementai.com\n\nallac@elementai.com\n\nAlexandre Lacoste\n\nElement AI\n\nAbstract\n\nFew-shot learning has become essential for producing models that generalize\nfrom few examples. In this work, we identify that metric scaling and metric task\nconditioning are important to improve the performance of few-shot algorithms.\nOur analysis reveals that simple metric scaling completely changes the nature of\nfew-shot algorithm parameter updates. Metric scaling provides improvements\nup to 14% in accuracy for certain metrics on the mini-Imagenet 5-way 5-shot\nclassi\ufb01cation task. We further propose a simple and effective way of conditioning a\nlearner on the task sample set, resulting in learning a task-dependent metric space.\nMoreover, we propose and empirically test a practical end-to-end optimization\nprocedure based on auxiliary task co-training to learn a task-dependent metric\nspace. The resulting few-shot learning model based on the task-dependent scaled\nmetric achieves state of the art on mini-Imagenet. We con\ufb01rm these results on\nanother few-shot dataset that we introduce in this paper based on CIFAR100.\n\n1\n\nIntroduction\n\nHumans can learn to identify new categories from few examples, even from a single one [2]. Few-shot\nlearning has recently attracted signi\ufb01cant attention [33, 28, 29, 24, 17, 16], as it aims to produce\nmodels that can generalize from small amounts of labeled data. In the few-shot setting, one aims to\nlearn a model that extracts information from a set of support examples (sample set) to predict the\nlabels of instances from a query set. Recently, this problem has been reframed into the meta-learning\nframework [22], i.e. the model is trained so that given a sample set or task, produces a classi\ufb01er for\nthat speci\ufb01c task. Thus, the model is exposed to different tasks (or episodes) during the training\nphase, and it is evaluated on a non-overlapping set of new tasks [33].\nTwo recent approaches have attracted signi\ufb01cant attention in the few-shot learning domain: Matching\nNetworks [33], and Prototypical Networks [28]. In both approaches, the sample set and the query set\nare embedded with a neural network, and nearest neighbor classi\ufb01cation is used given a metric in the\nembedded space. Since then, the problem of learning the most suitable metric for few-shot learning\nhas been of interest to the \ufb01eld [33, 28, 29, 17, 16]. Learning a metric space in the context of few-shot\nlearning generally implies identifying a suitable similarity measure (e.g. cosine or Euclidean), a\nfeature extractor mapping raw inputs onto similarity space (e.g. convolutional stack for images or\nLSTM stack for text), a cost function to drive the parameter updates, and a training scheme (often\nepisodic). Although the individual components in this list have been explored, the relationships\nbetween them have not received considerable attention.\nIn the current work we aim to close this gap. We show that taking into account the interaction\nbetween the identi\ufb01ed components leads to signi\ufb01cant improvements in the few-shot generalization.\nIn particular, we show that a non-trivial interaction between the similarity metric and the cost function\ncan be exploited to improve the performance of a given similarity metric via scaling. Using this\nmechanism we close more than the 10% gap in performance between the cosine similarity and\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe Euclidean distance reported in [28]. Even more importantly, we extend the very notion of the\nmetric space by making it task dependent via conditioning the feature extractor on the speci\ufb01c task.\nHowever, learning such a space is in general more challenging than learning a static one. Hence,\nwe \ufb01nd a solution in exploiting the interaction between the conditioned feature extractor and the\ntraining procedure based on auxiliary co-training on a simpler task. Our proposed few-shot learning\narchitecture based on task-dependent scaled metric achieves superior performance on two challenging\nfew-shot image classi\ufb01cation datasets. It shows up to 8.5% absolute accuracy improvement over the\nbaseline (Snell et al. [28]), and 4.8% over the state-of-the-art [17] on the 5-shot, 5-way mini-Imagenet\nclassi\ufb01cation task, reaching 76.7% of accuracy, which is the best-reported accuracy on this dataset.\n\n1.1 Background\nWe consider the episodic M-shot, K-way classi\ufb01cation scenario. In this scenario, a learning algorithm\ni=1 consisting of M examples for each of K classes and\nis provided with a sample set S = {(xi, yi)}M K\na query set Q = {(xi, yi)}q\ni=1 for a task to be solved within a given episode. The sample set provides\nthe task information via observations xi 2 RDx and their respective class labels yi 2{ 1, . . . , K}.\nGiven the information in the sample set S, the learning algorithm is able to classify individual\nsamples from the query set Q. Next, we de\ufb01ne a similarity measure d : RDz\u21e5Dz ! R. Note that\nd does not have to satisfy the classical metric properties (non-negativity, symmetry, subadditivity)\nto be useful in the context of few-shot learning. The dimensionality of metric input, Dz, will most\nnaturally be related to the size of embedding created by a (deep) feature extractor f : RDx ! RDz,\nparameterized by , mapping x to z. Here 2 RD is a list of parameters de\ufb01ning f, e.g. a list of\nweights in a neural network. The set of representations (f(xi), yi),8(xi, yi) 2S can directly be\nused to solve the few-shot learning classi\ufb01cation problem by association. For example, Matching\nnetworks [33] use sample-wise attention mechanism to perform kernel label regression. Instead,\nSnell et al. [28] de\ufb01ned a feature representation ck for each class k as the mean over embeddings\nbelonging to Sk: ck = 1\nf(xi). To learn , they minimize log p(y = k|x) using the\nsoftmax over prototypes ck to de\ufb01ne the likelihood: p(y = k|x) = softmax(d(f(x), ck)).\n1.2 Summary of contributions\nMetric Scaling: To our knowledge, this is the \ufb01rst study to (i) propose metric scaling to improve\nperformance of few-shot algorithms, (ii) mathematically analyze its effects on objective function\nupdates and (iii) empirically demonstrate its positive effects on few-shot performance.\nTask Conditioning: We use a task encoding network to extract a task representation based on the\ntask\u2019s sample set. This is used to in\ufb02uence the behavior of the feature extractor through FILM [19].\nAuxiliary task co-training: We show that co-training the feature extraction on a conventional\nsupervised classi\ufb01cation task reduces training complexity and provides better generalization.\n\nKPxi2Sk\n\n1.3 Related work\nThree main approaches for solving the few-shot classi\ufb01cation problem can be identi\ufb01ed in the\nliterature. The \ufb01rst one, which is used in this work, is the meta-learning approach, i.e. learning a\nmodel that, given a task (set of labeled data), produces a classi\ufb01er that generalizes across all tasks\n[31, 25]. This is the case of Matching Networks [33], which optionally use a Recurrent Neural\nNetwork (RNN) to accumulate information about a given task. In MAML [6], the parameters of an\narbitrary learner model are optimized so that they can be quickly adapted to a particular task. In\n\u201cOptimization as a model\u201d [22], a learner model is adapted to a new episodic task by a recurrent meta-\nlearner producing ef\ufb01cient parameter updates. A more general approach was proposed by Santoro\net al. [24], where the meta-learner is trained to represent entries from a sample set in an external\nmemory. Similarly, adaResNet [17] uses memory and the sample set to produce shift coef\ufb01cients on\nthe neuron activations of the query set classi\ufb01er. Many recent approaches focus on learning a metric\non the episodic feature space. Prototypical networks [28] use a feed-forward neural network to embed\nthe task examples and perform nearest neighbor classi\ufb01cation with the class centroids. The relation\nnetwork approach by Sung et al. [29] introduces a separate learnable similarity metric. SNAIL\n[16] uses an explicit attention mechanism applicable both to supervised and to the sequence based\nreinforcement learning tasks. It has also been shown that these approaches bene\ufb01t from leveraging\nunlabeled and simulated data [23, 34].\n\n2\n\n\fA second approach aims to maximize the distance between examples from different classes [10].\nSimilarly, in [7], a contrastive loss function is used to learn to project data onto a manifold that is\ninvariant to deformations in the input space. In the same vein, in [5, 26, 30], triplet loss is used\nfor learning a representation for few-shot learning. The attentive recurrent comparators [27] go\nbeyond classical siamese approaches and use a recurrent architecture to learn to perform pairwise\ncomparisons and predict if the compared examples belong to the same class.\nThe third approach relies on Bayesian modeling of the prior distribution of the different categories\nlike in Li et al. [15], Bauer et al. [1], or Lake et al. [13], Edwards and Storkey [4], Lacoste et al. [12]\nwho rely on hierarchical Bayesian modeling.\nAs for task conditioning, [3, 18, 19] proposed conditional batch normalization for style transfer and\nvisual reasoning. Differently, we modify the conditioning scheme to adapt it to few-shot learning,\nintroducing 0, 0 priors, and auxiliary co-training. In the few-shot learning context, task conditioning\nideas can be traced back to [33], although in an implicit form as there is no notion of task embedding.\nIn our work, we explicitly introduce a task representation (see Fig. 1) computed as the mean of the task\nclass centroids (task prototypes). This is much simpler than individual sample level LSTM/attention\nmodels in [33]. Conditioning in [33] is applied as a postprocessing of the output of a \ufb01xed feature\nextractor. We propose to condition the feature extractor by predicting its own batch normalization\nparameters thus making feature extractor behaviour task-dynamic without cumbersome \ufb01ne-tuning\non support set. In order to train the task conditioned architecture we use multitask training with\na usual 64-way classi\ufb01cation task. Even though auxiliary co-training is bene\ufb01cial for learning in\ngeneral, \u201clittle is known on when multitask learning works and whether there are data characteristics\nthat help to determine its success\u201d [20]. We show that combining task conditioning and auxiliary\nco-training is bene\ufb01cial in the context of few-shot learning.\nThe scaling and temperature adjustment in the softmax was discussed by Hinton et al. [9] in the\ncontext of model distillation. We propose to use it in the context of the few-shot learning scenario\nand provide novel theoretical and empirical results quantifying the effects of scaling parameter.\nThe rest of the paper is organized as follows. Section 2 describes our contributions in detail. Section 3\nhighlights the importance of each contribution via an ablation study. The study is performed over two\ndifferent benchmarks in the regime of 1-shot, 5-shot and 10-shot learning to verify if conclusions hold\nacross different setups. Finally, Section 4 concludes the paper and outlines future research directions.\n\n2 Model Description\n\n2.1 Metric Scaling\nSnell et al. [28] using approach described in detail in Section 1.1 found that the Euclidean distance\noutperformed the cosine distance used in Vinyals et al. [33]. We hypothesize that the improvement\ncould be directly attributed to the interaction of the different scaling of the metrics with the softmax.\nMoreover, the dimensionality of the output is known to have a direct impact on the output scale\neven for the Euclidean distance [32]. Hence, we propose to scale the distance metric by a learnable\ntemperature, \u21b5, p,\u21b5(y = k|x) = softmax(\u21b5d(z, ck)), to enable the model to learn the best regime\nfor each similarity metric, thus improving the performances of all metrics. To further understand the\nrole of \u21b5, we analyze the class-wise cross-entropy loss function, Jk(, \u21b5),1\n\nJk(, \u21b5) = Xxi2Qkh\u21b5d(f(xi), ck) + logXj\n\nexp(\u21b5d(f(xi), cj))i,\n\nwhere Qk = {(xi, yi) 2Q : yi = k} is the query set corresponding to the class k. Its gradient,\nwhich is used to update parameters is given by the following expression:\n\n(1)\n\n(2)\n\n# .\n\n@\n@\n\nJk(, \u21b5) = \u21b5 Xxi2Qk\" @\n\n@\n\nd(f(xi), ck) Pj exp(\u21b5d(f(xi), cj)) @\n\nPj exp(\u21b5d(f(xi), cj))\n\n@ d(f(xi), cj)\n\nAt \ufb01rst glance, the effect of \u21b5 on the expression of the derivative is twofold: (i) an overall scaling,\nand (ii) regulating the sharpness of weighting in the second term inside the brackets on the RHS.\nBelow we explore the behavior of the \u21b5-normalized2 gradient in the limits \u21b5 ! 0 and \u21b5 ! 1.\n1Note that the total loss is simply J(, \u21b5) =Pk Jk(, \u21b5)\n\n2The effect of \u21b5-related gradient scaling is trivial.\n\n3\n\n\fxi\n\nxi\n\nf\u0278(x,\u00a0 )\n\nf\u0278(x, 0)\n\nyi\nyiyi\n\n\u00a0\n\nClass\n\nrepresentation\n\nSimilarity\nmetric\n\nsoftmax\n\nf\u0278(x,\u00a0 )\n\nx* \n\nTask\n\nrepresentation\n\nTEN network\u00a0\n\nFigure 1: Proposed few-shot architecture. Blocks with shared parameters have dashed border.\n\nLemma 1 (Metric scaling). If the following assumptions hold:\n\nA1 : d(f(x), ck) 6= d(f(x0), ck),8k, x 6= x0 2Q k; A2 : @\n\nthen it is true that:\n@\n@\n\n1\n\u21b5\n\nlim\n\u21b5!0\n\n1\n\u21b5\n\n@\n@\n\nlim\n\u21b5!1\n\nJk(, \u21b5) = Xxi2Qkh K 1\nJk(, \u21b5) = Xxi2Qkh @\n\n@\n\nK\n\nwhere j\u21e4i = arg minj d(f(xi), cj).\n\n1\n\n@ d(f(x), c) < 1,8x, c,,\nd(f(xi), cj)i,\nKXj6=k\nd(f(xi), cj\u21e4i )i;\n\n@\n@\n\n(3)\n\n(4)\n\n@\n@\n\nd(f(xi), ck) \n\nd(f(xi), ck) \n\n@\n@\n\nProof. Please refer to Appendix A.\n\nFrom Eq. (3), it is clear that for small \u21b5 values, the \ufb01rst term minimizes the embedding distance\nbetween query samples and their corresponding prototypes. The second term maximizes the embed-\nding distance between the samples and the prototypes of the non-belonging categories. For large \u21b5\nvalues (Eq. (4)), the \ufb01rst term is the same as in Eq. (3); while the second term maximizes the distance\nof the sample with the closest wrongly assigned prototype cj\u21e4i (if any). If j\u21e4i = k (no error), the\nderivative contribution of the point xi is zero. This is equivalent to learning only from the hardest\nexamples resulting in association errors. Thus, the two different regimes of \u21b5 favor either minimizing\nthe overlap of the sample distributions or correcting cluster assignments sample-wise.\nThe large \u21b5 regime is more directly related to resolving the few-shot classi\ufb01cation errors. At the\nsame time, the update strategy generated in this regime has a drawback. As the optimization proceeds\nand the classi\ufb01cation accuracy increases, the number of incorrectly classi\ufb01ed samples reduces on\naverage, and this leads to the reduction in the average effective batch size (more samples generate\nzero derivatives). Therefore, our hypothesis is that there is an optimal value of scaling parameter \u21b5\nfor a given combination of dataset, metric and task. Section 3.4 empirically demonstrates that the\noptimal value of \u21b5 indeed exists and it can be e.g. cross-validated on a validation set.\n\n2.2 Task conditioning\nUp until now we assumed the feature extractor f(\u00b7) to be task-independent. A dynamic task-\nconditioned feature extractor should be better suited for \ufb01nding correct associations between given\nsample set class representations and query samples, this is implicitly done by Vinyals et al. [33]\nwith a bidirectional LSTM as a postprocessing of a \ufb01xed feature extractor. Differently, we explicitly\nde\ufb01ne a dynamic feature extractor f(x, ), where is the set of parameters predicted from a task\nrepresentation such that the performance of f(x, ) is optimized given the task sample set S. This\nis related to the FILM conditioning layer [19] and conditional batch normalization [3, 18] of the form\nh`+1 = h` + , where and are scaling and shift vectors applied to the layer h`. Concretely,\nKPk ck, encode\nwe propose to use the mean of the class prototypes as the task representation, c = 1\nit with a task embedding network (TEN), and predict layer-level element-wise scale and shift vectors\n, for each convolutional layer in the feature extractor (see Figures 1 and 2 in the Supplementary\n\n4\n\n\fTable 1: mini-Imagenet (Vinyals et al. [33]), 5-way classi\ufb01cation results. \u2020Our re-implementation.\n\nMeta Nets [22]\nMatching Networks [33]\nMAML [6]\nProto Nets [28]\nRelation Net [29]\nSNAIL [16]\nDiscriminative k-shot [1]\nadaResNet [17]\nOurs\n\n1-shot\n43.4\n46.6\n48.7\n49.4\n50.4\n55.7\n56.3\n56.9\n58.5\n\n5-shot\n60.6\n60.0\n63.1\n68.2\n65.3\n68.9\n73.9\n71.9\n76.7\n\n10-shot\n\n-\n-\n-\n\n-\n-\n\n-\n\n74.3\u2020\n\n78.5\n\n80.8\n\nMaterials, Section S1). The task representation de\ufb01ned as the mean of task class centroids (i) reduces\nthe dimensionality of the TEN input and (ii) replaces expensive RNN/CNN/attention modeling. On\nthe other hand, it is an effective way to cluster tasks. Tasks having larger number of similar classes in\ncommon will tend to cluster closer in the task representation space.\nOur implementation of the TEN (see Supplementary Materials, Section S1 for more details) uses\ntwo separate fully connected residual networks to generate vectors , . Following the terminology\nin [18], the parameter is learned in the delta regime, i.e. predicting deviation from unity. The\nmost critical component in being able to successfully train the TEN was the addition of the scalar L2\npenalized post-multipliers 0 and 0. They limit the effect of (and ) by encoding a prior belief\nthat all components of (and ) should be simultaneously close to zero for a given layer unless\ntask conditioning provides a signi\ufb01cant information gain for this layer. Mathematically, this can be\nexpressed as = 0g\u2713(c) and = 0h'(c) + 1, where g\u2713 and h' are predictors of and .\n\n2.3 Architecture\n\nThe overall proposed few-shot classi\ufb01cation architecture is depicted in Fig. 1 (see Supplementary\nMaterials, Section S1 for more details). We employ ResNet-12 [8] as the backbone feature extractor.\nIt has 4 blocks of depth 3 with 3x3 kernels and shortcut connections. 2x2 max-pool is applied at\nthe end of each block. Convolutional layer depth starts with 64 \ufb01lters and is doubled after every\nmax-pool. Note that this architecture is similar in spirit to architectures used in [1] and [17], but we\ndo not use any projection layers before or after the main backbone ResNet. On the \ufb01rst pass over\nsample set, the TEN predicts the values of and parameters for each convolutional layer in the\nfeature extractor from the task representation. Next, the sample set and the query set are processed by\nthe feature extractor conditioned with the values of and just generated. Both outputs are fed into\na similarity metric to \ufb01nd an association between class prototypes and query instances. The output of\nsimilarity metric is scaled by scalar \u21b5 and is fed into a softmax layer.\n\n2.4 Auxiliary task co-training\n\nThe TEN (Section 2.2) introduces additional complexity into the architecture via task conditioning\nlayers inserted after the convolutional and batch norm blocks. We empirically observed that simulta-\nneously optimizing convolutional \ufb01lters and the TEN is overly challenging. We solved the problem by\nauxiliary co-training with an additional logit head (the normal 64-way classi\ufb01cation in mini-Imagenet\ncase). The auxiliary task is sampled with a probability that is annealed over episodes. We annealed it\nusing an exponential decay schedule of the form 0.9b20t/Tc, where T is the total number of training\nepisodes, t is episode index. The initial auxiliary task selection probability was cross-validated to\nbe 0.9 and the number of decay steps was chosen to be 20. We observed signi\ufb01cant positive effects\nfrom the auxiliary task co-training (please refer to Section 3.4). The same positive effects were not\nobserved with simple pre-training of the feature extractor. We attribute this to the regularization\neffects achieved via back-propagating auxiliary task gradients together with those of the main task.\nIt is of interest to note that the few-shot co-training with an auxiliary classi\ufb01cation task is related to\ncurriculum learning [24]. The auxiliary classi\ufb01cation problem could be considered a part of a simpler\ncurriculum that helps the learner acquire minimal skill level necessary before tackling on harder\n\n5\n\n\ffew-shot classi\ufb01cation tasks. Being effective at feature extraction (i.e. at task representation) forms a\n\u201cprerequisite\u201d at being effective at re-conditioning features based on the representation of a given task.\n\n3 Experimental Results\n\nTable 1 presents our key result in the context of existing state-of-the art. The \ufb01ve \ufb01rst rows show\napproaches that use the same feature extractor as [33], i.e. four stacked convolutions layers of 64\n\ufb01lters (32 in [22, 6] to avoid over\ufb01tting). In the following rows we include models like the one we\npropose, which is based on resnet [8]. Concretely, SNAIL [16], adaResNet [17], and our architecture\nuse four residual blocks of three stacked 3 \u21e5 3 convolutional layers, each block followed by max\npooling. Differently, the feature extractor proposed in [1] is based on a ResNet-34 architecture with a\nreduced number of features.\nAs it can be seen, the proposed algorithm signi\ufb01cantly improves over the existing state-of-the-art\nresults on the mini-Imagenet dataset. In the rest of the section we address the following research\nquestions: (i) can metric scaling improve few-shot classi\ufb01cation results? (Sections 3.2 and 3.4), (ii)\nwhat are the contributions of each components of our proposed architecture? (Section 3.4), (iii) can\ntask conditioning improve few-shot classi\ufb01cation results and how important it is at different feature\nextractor depths? (Sections 3.3 and 3.4), and (iv) can auxiliary classi\ufb01cation task co-training improve\naccuracy on the few-shot classi\ufb01cation task? (Section 3.4).\n\n3.1 Experimental setup and datasets\n\nThe details of the experimental and training setup are provided in Supplementary Materials, Section S3.\nNote that we focused on mini-Imagenet [33] and Fewshot-CIFAR100 (introduced below) instead of\nOmniglot [14, 33, 28] as the former ones are more challenging, and the error rate is more sensitive to\nmodel improvements.\nmini-Imagenet. The mini-Imagenet dataset was proposed by Vinyals et al. [33]. It has 100 classes,\nwith 600 84 \u21e5 84 images per class. Each task is generated by sampling 5 classes uniformly and\n5 training samples per class, the remaining images from the 5 classes are used as query images to\ncompute accuracy. To perform meta-validation and meta-test on unseen tasks (and classes), we isolate\n16 and 20 classes from the original set of 100, leaving 64 classes for the training tasks. We use exactly\nthe same train/validation/test split as the one suggested by Ravi and Larochelle [22].\nFewshot-CIFAR100. We introduce a new image based dataset based on CIFAR100 [11] for few-shot\nlearning. We will refer to it as FC100. The main motivation for introducing this new dataset is\nto validate that the main results appearing in the experimental section generalize well beyond the\nmini-Imagenet. The secondary motivation is that the FC100 is suited for faster few-shot scenario\nprototyping than the mini-Imagenet and it presents a more challenging few-shot learning problem,\nbecause of reduced image size. On top of that, we propose a class split in FC100 to minimize the\ninformation overlap between splits to make it signi\ufb01cantly more challenging than e.g. Omniglot. The\noriginal CIFAR100 dataset consists of 32 \u21e5 32 color images belonging to 100 different classes, 600\nimages per class. The 100 classes are further grouped into 20 superclasses. We split the dataset by\nsuperclass, rather than by individual class to minimize the information overlap. Thus the train split\ncontains 60 classes belonging to 12 superclasses, the validation and test contain 20 classes belonging\nto 5 superclasses each. The exact class split is provided in Supplementary Materials, Section S2. The\ntasks are sampled uniformly at random within train, validation and test subsets. Therefore, each task\nwith high probability contains samples belonging to classes from several superclasses.\n\n3.2 On the similarity metric\n\nWe re-implemented prototypical networks [28], and use the Euclidean and the cosine similarity to\ntest the effects of scaling (see Section 2). We closely follow the experimental setup de\ufb01ned by Snell\net al. [28] (same feature extractor and training procedure). The scaling parameter \u21b5 used on the last\nrow was cross-validated on the validation set. Results are presented in Table 2.\n\n6\n\n\fTable 2: Average classi\ufb01cation accuracy in percent with 95% con\ufb01dence interval. 5-shot, 5 way\nclassi\ufb01cation task. The three last rows correspond to our implementation, \ufb01rst with euclidean distance,\nsecond with cosine distance, and third with the scaled cosine distance.\n\nProto Nets [28]\nProto Nets\nPrototypical Cosine\nPrototypical Cosine Scaled\n\nmini-Imagenet\n\n5-way train\n65.8 \u00b1 0.7\n67.7 \u00b1 0.2\n54.5 \u00b1 1.1\n68.2 \u00b1 0.8\n\n20-way train\n68.2 \u00b1 0.7\n68.9 \u00b1 0.3\n53.9 \u00b1 0.6\n68.1 \u00b1 0.7\n\nFC100\n\n5-way train\n\n20-way train\n\nN/A\n\n51.1 \u00b1 0.2\n40.9 \u00b1 0.6\n51.0 \u00b1 0.6\n\nN/A\n\n50.3 \u00b1 0.3\n37.1 \u00b1 1.9\n49.6 \u00b1 0.5\n\n(a) Results on mini-Imagenet.\n\n(b) Results on FC100.\n\nFigure 2: Distribution of the absolute values of the TEN scaling and bias parameters 0 and 0 across\nlayers of ResNet feature extractor. X-axes depict layer number in both subplots. Higher convolutional\nlayers are located closer to the \ufb01nal softmax layer.\n\nAs it can be seen in row two of Table 2, our re-implementation of Proto Nets [28] obtained slightly\nbetter performance (68.9% and 67.7%) in 20-way and 5-way training scenarios respectively by\nincreasing the number of training steps from 20K to 40K3.\nImportantly, we con\ufb01rm the hypothesis that the improvement attributed to the Euclidean distance\nin [28] was due to a scaling effect. Namely, we show that the scaled cosine similarity matches\nvery closely the performance of the Euclidean metric, with an improvement of 14 percentage points\non the mini-Imagenet (similar results on FC100) over the non-scaled version. In order to control\nfor the potential effect that the scaling parameter \u21b5 may have on the learning rate as indicated by\nEquation (2) training was performed using multiple initial learning rates (covering the range between\n0.0005 and 0.01), obtaining similar accuracy each time. Hereinafter, we report the results with\nthe Euclidean metric for brevity, since the cosine produces similar results. Moreover, since the\nprototypical approach with Euclidean distance as well as with the scaled cosine are close and both\nare superior to [33], we base our results on [28].\n\n3.3 TEN importance across layers\n\nWe hypothesized in Section 2.2 that the TEN conditioning should not be equally important at all\ndepths. Fig. 2 depicts the boxplot of the empirical observations of the learned TEN post-multipliers4\n0 and 0 at different depths of the feature extractor. We can see that for the multiplier , the absolute\nvalue of its scale 0 tends to increase as we approach the softmax layer. Interestingly, peaks can be\nobserved every 3 layers (layers 3, 6, 9, 12). The peaks correspond to the location of the convolutional\nlayers preceding the max-pool layers. For the bias parameter 0, the only layer having a large absolute\nvalue of its scale is the last layer, before the softmax. We attribute the observed pattern to the fact\nthat the shallower layers in the feature extractor tend to be less task-speci\ufb01c than the deeper layers.\nFollowing this intuition, we performed experiments in which we (i) kept the TEN injection solely in\nlayers preceding the max pool and (ii) kept the TEN injection only in the very last layer. Interestingly,\n\n3With 20K steps it was possible to recover the exact original performance reported in Snell et al. [28], which\n\nis not included in Table 2 for the sake of brevity.\n\n4Larger absolute values of 0 and 0 imply a larger in\ufb02uence of their respective TEN layers\n\n7\n\n\fTable 3: Average classi\ufb01cation accuracy (%) with 95% con\ufb01dence interval on the 5 way classi\ufb01cation\ntask, and training with the Euclidean distance. The scale parameter is cross-validated on the validation\nset. AT: auxiliary co-training. TC: task conditioning with TEN.\n\n\u21b5 AT TC\n\nmini-Imagenet\n1-shot\n56.5 \u00b1 0.4\nX\n56.8 \u00b1 0.3\nX X\n58.0 \u00b1 0.3\nX 54.4 \u00b1 0.3\nX\nX X X 58.5 \u00b1 0.3\n\n5-shot\n74.2 \u00b1 0.2\n75.7 \u00b1 0.2\n75.6 \u00b1 0.4\n74.6 \u00b1 0.3\n76.7 \u00b1 0.3\n\n10-shot\n78.6 \u00b1 0.4\n79.6 \u00b1 0.4\n80.0 \u00b1 0.3\n78.7 \u00b1 0.4\n80.8 \u00b1 0.3\n\nFC100\n1-shot\n37.8 \u00b1 0.4\n38.0 \u00b1 0.3\n39.0 \u00b1 0.4\n37.8 \u00b1 0.2\n40.1 \u00b1 0.4\n\n5-shot\n53.3 \u00b1 0.5\n54.0 \u00b1 0.5\n54.7 \u00b1 0.5\n54.0 \u00b1 0.7\n56.1 \u00b1 0.4\n\n10-shot\n58.7 \u00b1 0.4\n59.8 \u00b1 0.3\n60.4 \u00b1 0.4\n58.8 \u00b1 0.3\n61.6 \u00b1 0.5\n\n(a) Scaled Euclidean. mini-Imagenet.\n\n(b) Scaled Euclidean. FC100.\n\n(c) Scaled Euclidean with TEN. mini-Imagenet.\n\n(d) Scaled Euclidean with TEN. FC100.\n\nFigure 3: Metric scale parameter \u21b5 cross-validation results.\n\nwe saw that TEN layers with small weight still provide some positive contribution, although most of\nthe contribution is indeed provided by the layers preceding the max pool operation.\n\n3.4 Ablation study\n\nIn this section, we study the impact in generalization accuracy of the scaling, task conditioning,\nauxiliary co-training, and the feature extractor. Results are summarized in Table 3.\nFirst, we validated the hypothesis that there is an optimal value of the metric scaling parameter (\u21b5)\nfor a given combination of dataset and metric, which is re\ufb02ected in the inverse U-shape of the curves\nin Fig. 3.\nSecond, we studied the effects of the task conditioning described in Section 2.2. No improvement\nwas observed for the task-conditioned ResNet-12 without auxiliary co-training (see Table 3). We\nobserved that learning useful features for the TEN and the main feature extractor at the same time\nis hard and gets stuck in local extrema. The problem is solved by co-training on the auxiliary task\nof predicting Imagenet labels using an additional fully-connected layer with softmax, see Section\n2.4. In effect, we observed that auxiliary co-training provides two bene\ufb01ts: (i) making the initial\nconvergence easier, and (ii) providing regularization on the few-shot learning task by forcing the\nfeature extractor to perform well on two decoupled tasks. The latter bene\ufb01t can only be observed\nwhen the feature extraction unit is suf\ufb01ciently decoupled on the main task and the auxiliary task via\nthe use of TEN (the feature extractor output is additionally adjusted on the target task using FILM).\nAs it can be seen in the last row of Tables 1 and 3, our model trained with TEN and auxiliary\nco-training outperforms all the baselines and achieves state-of-the-art results.\n\n4 Conclusions and Future Work\n\nWe proposed, analyzed, and empirically validated several improvements in the domain of few-shot\nlearning. We showed that the scaled cosine similarity performs at par with Euclidean distance,\nunlike its unscaled counterpart. In fact, based on our results, we argue that the scaling factor is a\n\n8\n\n\fnecessary standard component of any few-shot learning algorithm relying on a similarity metric\nand the cross-entropy loss function. This is especially important in the context of \ufb01nding new more\neffective similarity measures for few-shot learning. Moreover, our theoretical analysis demonstrated\nthat simply scaling the similarity metric results in completely different regimes of parameter updates\nwhen using softmax and categorical cross-entropy. We also identi\ufb01ed that the optimal performance\nis achieved in between two asymptotic regimes of the softmax. This poses the research question of\nexplicitly designing loss functions and the \u21b5 schedules optimal for few-shot learning. We further\nproposed task representation conditioning as a way to improve the performance of a feature extractor\non the few-shot classi\ufb01cation task. In this context, designing more powerful task representations, for\nexample, based on higher order statistics of class embeddings, looks like a very promising venue for\nfuture work. The experimental results obtained on two independent challenging datasets demonstrated\nthat the proposed approach signi\ufb01cantly improves over existing results and achieves state-of-the-art\non few-shot image classi\ufb01cation task.\n\nAppendix\n\nA Proof of Lemma 1\nFirst, consider the case \u21b5 ! 0. Denoting z\n\ni = f(xi) we have:\n\n1\n\u21b5\n\n@\n@\n\nlim\n\u21b5!0\n\nJk(, \u21b5) = Xxi2Qk\n= Xxi2Qk\n= Xxi2Qk\nSecond, consider the case \u21b5 ! 1:\n\ni , cj)\n\n@ d(z\ni , cj))\n\n@\n@\n\nd(z\n\ni , ck) lim\n1\n\nd(z\n\n@\n@\nK 1\nK\n\ni , ck) \n@\n@\n\nd(z\n\ni , cj)) @\n\n\u21b5!0Pj exp(\u21b5d(z\nKXj\n\nPj exp(\u21b5d(z\nKXj6=k\n\ni , cj).\n\ni , cj)\n\n@\n@\n\n@\n@\n\nd(z\n\nd(z\n\n1\n\ni , ck) \n\n1\n\u21b5\n\n@\n@\n\nlim\n\u21b5!1\n\nJk(, \u21b5) = Xxi2Qk\n= Xxi2Qk\n\n@\n@\n\n@\n@\n\nd(z\n\nd(z\n\ni , ck) Xj\ni , ck) Xj\n\nlim\n\u21b5!1\n\nlim\n\u21b5!1\n\ni , cj)\n\n@\n\n@ d(z\ni , c`))\n\ni , cj)) @\n\nexp(\u21b5d(z\nP` exp(\u21b5d(z\n1 +P`6=j exp(\u21b5[d(z\ni , c`) d(z\n\n@ d(z\n\ni , cj)\ni , c`) d(z\n\n.\n\ni , cj)])\n\nIt is obvious that whenever at least one of the exponential terms in the denominator in the expression\nabove has positive rate, corresponding to the case 9` 6= j : [d(z\ni , cj)] < 0, the ratio\nconverges to zero as \u21b5 ! 1 under assumption A2. The only case when the limit is non-zero is\nwhen cj is the prototype closest to the query point xi. If we de\ufb01ne the index of this prototype as\nj\u21e4i = arg minj d(z\ni , cj\u21e4i )] > 0, leading\n(under additional assumption A1) to:\n\ni , cj), then the following holds: 8` 6= j\u21e4i : [d(z\n\ni , c`) d(z\n\n1\n\n1 +P`6=j exp(\u21b5[d(z\n\ni , c`) d(z\n\ni , cj\u21e4i )])\n\n= 1.\n\nlim\n\u21b5!1\n\nTherefore, (4) follows.\n\nAcknowledgements\n\nAuthors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO/FEDER), the\n2016FI B 01163 grant of Generalitat de Catalunya. Authors would like to thank Nicolas Chapados,\nAdam Salvail and Rachel Samson as well as anonymous reviewers for their careful reading of the\nmanuscript and for providing constructive feedback and valuable suggestions.\n\n9\n\n\fReferences\n[1] M. Bauer, M. Rojas-Carulla, J. B. \u00b4Swi \u02dbatkowski, B. Sch\u00f6lkopf, and R. E. Turner. Discriminative\n\nk-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.\n\n[2] S. Carey and E. Bartlett. Acquiring a single new word. 1978.\n[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. ICLR, 2017.\n[4] H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185,\n\n2016.\n\n[5] M. Fink. Object classi\ufb01cation from a single example utilizing class relevance metrics. In NIPS,\n\npages 449\u2013456, 2005.\n\n[6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In ICML, pages 1126\u20131135, 2017.\n\n[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant\n\nmapping. In CVPR, volume 2, pages 1735\u20131742. IEEE, 2006.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR,\n\npages 770\u2013778, 2016.\n\n[9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS\nDeep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/\nabs/1503.02531.\n\n[10] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image\n\nrecognition. In ICML Deep Learning Workshop, volume 2, 2015.\n\n[11] A. Krizhevsky. Learning multiple layers of features from tiny images.\n\n2009.\n\n, University of Toronto,\n\n[12] A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshkin, W. Chung, and D. Krueger. Deep prior.\n\narXiv preprint arXiv:1712.05016, 2017.\n\n[13] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a composi-\n\ntional causal process. In NIPS, pages 2526\u20132534, 2013.\n\n[14] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[15] F.-F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. PAMI, 28(4):594\u2013611,\n\n2006.\n\n[16] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In\n\nICLR, 2018.\n\n[17] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted\n\nneurons. In ICML, 2018.\n\n[18] E. Perez, H. de Vries, F. Strub, V. Dumoulin, and A. C. Courville. Learning visual reasoning\n\nwithout strong priors. CoRR, abs/1707.03017, 2017.\n\n[19] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a\n\ngeneral conditioning layer. In AAAI, 2018.\n\n[20] B. Plank and H. M. Alonso. When is multitask learning effective? Semantic sequence prediction\nunder varying data conditions. In Proceedings of the 15th Conference of the European Chapter\nof the Association for Computational Linguistics, EACL 2017, Valencia, Spain, pages 44\u201353,\n2017.\n\n[21] P. Ramachandran, B. Zoph, and Q. V. Lea. Searching for activation functions. In ICLR, 2018.\n[22] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2016.\n[23] M. Ren, E. Trianta\ufb01llou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle,\nand R. S. Zemel. Meta-learning for semi-supervised few-shot classi\ufb01cation. arXiv preprint\narXiv:1803.00676, 2018.\n\n[24] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nmemory-augmented neural networks. In M. F. Balcan and K. Q. Weinberger, editors, ICML,\nvolume 48 of Proceedings of Machine Learning Research, pages 1842\u20131850, New York, New\nYork, USA, 20\u201322 Jun 2016. PMLR.\n\n10\n\n\f[25] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm,\nadaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105\u2013130,\n1997.\n\n[26] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition\n\nand clustering. In CVPR, pages 815\u2013823, 2015.\n\n[27] P. Shyam, S. Gupta, and A. Dukkipati. Attentive recurrent comparators. In ICML, pages\n\n3173\u20133181, 2017.\n\n[28] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In NIPS,\n\npages 4080\u20134090, 2017.\n\n[29] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare:\n\nRelation network for few-shot learning. In CVPR, 2018.\n\n[30] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face identi\ufb01cation. In\n\nCVPR, pages 2746\u20132754, 2015.\n\n[31] S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181\u2013209. Springer, 1998.\n[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\n\nI. Polosukhin. Attention is all you need. In NIPS, pages 6000\u20136010, 2017.\n\n[33] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for\n\none shot learning. In NIPS, pages 3630\u20133638. 2016.\n\n[34] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-Shot Learning from Imaginary\n\nData. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 414, "authors": [{"given_name": "Boris", "family_name": "Oreshkin", "institution": "Element AI"}, {"given_name": "Pau", "family_name": "Rodr\u00edguez L\u00f3pez", "institution": "CVC UAB"}, {"given_name": "Alexandre", "family_name": "Lacoste", "institution": "Element AI"}]}