{"title": "Few-Shot Learning Through an Information Retrieval Lens", "book": "Advances in Neural Information Processing Systems", "page_first": 2255, "page_last": 2265, "abstract": "Few-shot learning refers to understanding new concepts from only a few examples. We propose an information retrieval-inspired approach for this problem that is motivated by the increased importance of maximally leveraging all the available information in this low-data regime. We define a training objective that aims to extract as much information as possible from each training batch by effectively optimizing over all relative orderings of the batch points simultaneously. In particular, we view each batch point as a `query' that ranks the remaining ones based on its predicted relevance to them and we define a model within the framework of structured prediction to optimize mean Average Precision over these rankings. Our method achieves impressive results on the standard few-shot classification benchmarks while is also capable of few-shot retrieval.", "full_text": "Few-Shot Learning Through an Information\n\nRetrieval Lens\n\nEleni Trianta\ufb01llou\nUniversity of Toronto\n\nVector Institute\n\nRichard Zemel\n\nUniversity of Toronto\n\nVector Institute\n\nRaquel Urtasun\n\nUniversity of Toronto\n\nVector Institute\n\nUber ATG\n\nAbstract\n\nFew-shot learning refers to understanding new concepts from only a few examples.\nWe propose an information retrieval-inspired approach for this problem that is\nmotivated by the increased importance of maximally leveraging all the available\ninformation in this low-data regime. We de\ufb01ne a training objective that aims to\nextract as much information as possible from each training batch by effectively\noptimizing over all relative orderings of the batch points simultaneously. In partic-\nular, we view each batch point as a \u2018query\u2019 that ranks the remaining ones based\non its predicted relevance to them and we de\ufb01ne a model within the framework\nof structured prediction to optimize mean Average Precision over these rankings.\nOur method achieves impressive results on the standard few-shot classi\ufb01cation\nbenchmarks while is also capable of few-shot retrieval.\n\n1\n\nIntroduction\n\nRecently, the problem of learning new concepts from only a few labelled examples, referred to\nas few-shot learning, has received considerable attention [1, 2]. More concretely, K-shot N-way\nclassi\ufb01cation is the task of classifying a data point into one of N classes, when only K examples\nof each class are available to inform this decision. This is a challenging setting that necessitates\ndifferent approaches from the ones commonly employed when the labelled data of each new concept\nis abundant. Indeed, many recent success stories of machine learning methods rely on large datasets\nand suffer from over\ufb01tting in the face of insuf\ufb01cient data. It is however not realistic nor preferred to\nalways expect many examples for learning a new class or concept, rendering few-shot learning an\nimportant problem to address.\nWe propose a model for this problem that aims to extract as much information as possible from each\ntraining batch, a capability that is of increased importance when the available data for learning each\nclass is scarce. Towards this goal, we formulate few-shot learning in information retrieval terms: each\npoint acts as a \u2018query\u2019 that ranks the remaining ones based on its predicted relevance to them. We are\nthen faced with the choice of a ranking loss function and a computational framework for optimization.\nWe choose to work within the framework of structured prediction and we optimize mean Average\nPrecision (mAP) using a standard Structural SVM (SSVM) [3], as well as a Direct Loss Minimization\n(DLM) [4] approach. We argue that the objective of mAP is especially suited for the low-data regime\nof interest since it allows us to fully exploit each batch by simultaneously optimizing over all relative\norderings of the batch points. Figure 1 provides an illustration of this training objective.\nOur contribution is therefore to adopt an information retrieval perspective on the problem of few-shot\nlearning; we posit that a model is prepared for the sparse-labels setting by being trained in a manner\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Best viewed in color. Illustration of our training objective. Assume a batch of 6 points: G1,\nG2 and G3 of class \"green\", Y1 and Y2 of \"yellow\", and another point. We show in columns 1-5\nthe predicted rankings for queries G1, G2, G3, Y1 and Y2, respectively. Our learning objective is to\nmove the 6 points in positions that simultaneously maximize the Average Precision (AP) of the 5\nrankings. For example, the AP of G1\u2019s ranking would be optimal if G2 and G3 had received the two\nhighest ranks, and so on.\n\nthat fully exploits the information in each batch. We also introduce a new form of a few-shot learning\ntask, \u2018few-shot retrieval\u2019, where given a \u2018query\u2019 image and a pool of candidates all coming from\npreviously-unseen classes, the task is to \u2018retrieve\u2019 all relevant (identically labelled) candidates for the\nquery. We achieve competitive with the state-of-the-art results on the standard few-shot classi\ufb01cation\nbenchmarks and show superiority over a strong baseline in the proposed few-shot retrieval problem.\n\n2 Related Work\n\nOur approach to few-shot learning heavily relies on learning an informative similarity metric, a goal\nthat has been extensively studied in the area of metric learning. This can be thought of as learning\na mapping of objects into a space where their relative positions are indicative of their similarity\nrelationships. We refer the reader to a survey of metric learning [5] and merely touch upon a few\nrepresentative methods here.\nNeighborhood Component Analysis (NCA) [6] learns a metric aiming at high performance in nearest\nneirhbour classi\ufb01cation. Large Margin Nearest Neighbor (LMNN) [7] refers to another approach for\nnearest neighbor classi\ufb01cation which constructs triplets and employs a contrastive loss to move the\n\u2018anchor\u2019 of each triplet closer to the similarly-labelled point and farther from the dissimilar one by at\nleast a prede\ufb01ned margin.\nMore recently, various methods have emerged that harness the power of neural networks for metric\nlearning. These methods vary in terms of loss functions but have in common a mechanism for the\nparallel and identically-parameterized embedding of the points that will inform the loss function.\nSiamese and triplet networks are commonly-used variants of this family that operate on pairs and\ntriplets, respectively. Example applications include signature veri\ufb01cation [8] and face veri\ufb01cation\n[9, 10]. NCA and LMNN have also been extended to their deep variants [11] and [12], respectively.\nThese methods often employ hard-negative mining strategies for selecting informative constraints\nfor training [10, 13]. A drawback of siamese and triplet networks is that they are local, in the sense\nthat their loss function concerns pairs or triplets of training examples, guiding the learning process\nto optimize the desired relative positions of only two or three examples at a time. The myopia of\nthese local methods introduces drawbacks that are re\ufb02ected in their embedding spaces. [14] propose\na method to address this by using higher-order information.\nWe also learn a similarity metric in this work, but our approach is speci\ufb01cally tailored for few-shot\nlearning. Other metric learning approaches for few-shot learning include [15, 1, 16, 17]. [15] employs\na deep convolutional neural network that is trained to correctly predict pairwise similarities. Attentive\nRecurrent Comparators [16] also perform pairwise comparisons but form the representation of the\npair through a sequence of glimpses at the two points that comprise it via a recurrent neural network.\nWe note that these pairwise approaches do not offer a natural mechanism to solve K-shot N-way tasks\nfor K > 1 and focus on one-shot learning, whereas our method tackles the more general few-shot\nlearning problem. Matching Networks [1] aim to \u2018match\u2019 the training setup to the evaluation trials of\nK-shot N-way classi\ufb01cation: they divide each sampled training \u2018episode\u2019 into disjoint support and\nquery sets and backpropagate the classi\ufb01cation error of each query point conditioned on the support\nset. Prototypical Networks [17] also perform episodic training, and use the simple yet effective\nmechanism of representing each class by the mean of its examples in the support set, constructing a\n\n2\n\n\f\u2018prototype\u2019 in this way that each query example will be compared with. Our approach can be thought\nof as constructing all such query/support sets within each batch in order to fully exploit it.\nAnother family of methods for few-shot learning is based on meta-learning. Some representative\nwork in this category includes [2, 18]. These approaches present models that learn how to use the\nsupport set in order to update the parameters of a learner model in such a way that it can generalize to\nthe query set. Meta-Learner LSTM [2] learns an initialization for learners that can solve new tasks,\nwhereas Model-Agnostic Meta-Learner (MAML) [18] learns an update step that a learner can take to\nbe successfully adapted to a new task. Finally, [19] presents a method that uses an external memory\nmodule that can be integrated into models for remembering rarely occurring events in a life-long\nlearning setting. They also demonstrate competitive results on few-shot classi\ufb01cation.\n\n3 Background\n\n3.1 Mean Average Precision (mAP)\nConsider a batch B of points: X = {x1, x2, . . . , xN} and denote by cj the class label of the point\nxj. Let Relx1 = {xj \u2208 B : c1 == cj} be the set of points that are relevant to x1, determined in a\nbinary fashion according to class membership. Let Ox1 denote the ranking based on the predicted\nsimilarity between x1 and the remaining points in B so that Ox1 [j] stores x1\u2019s jth most similar point.\nPrecision at j in the ranking Ox1, denoted by P rec@jx1 is the proportion of points that are relevant\nto x1 within the j highest-ranked ones. The Average Precision (AP) of this ranking is then computed\nby averaging the precisions at j over all positions j in Ox1 that store relevant points.\n\nP rec@jx1\n|Relx1|\n\nwhere P rec@jx1 =\n\n|{k \u2264 j : Ox1[k] \u2208 Relx1}|\n\nj\n\n(cid:88)\n\nAP x1 =\n\nj\u2208{1,...,|B\u22121|:\nOx1 [j]\u2208Relx1}\n\nFinally, mean Average Precision (mAP) calculates the mean AP across batch points.\n\n(cid:88)\n\nmAP =\n\n1\n|B|\n\nAP xi\n\ni\u2208{1,...B}\n\n3.2 Structural Support Vector Machine (SSVM)\n\nStructured prediction refers to a family of tasks with inter-dependent structured output variables\nsuch as trees, graphs, and sequences, to name just a few [3]. Our proposed learning objective\nthat involves producing a ranking over a set of candidates also falls into this category so we adopt\nstructured prediction as our computational framework. SSVM [3] is an ef\ufb01cient method for these\ntasks with the advantage of being tunable to custom task loss functions. More concretely, let X\nand Y denote the spaces of inputs and structured outputs, respectively. Assume a scoring function\nF (x, y; w) depending on some weights w, and a task loss L(yGT, \u02c6y) incurred when predicting \u02c6y\nwhen the groundtruth is yGT. The margin-rescaled SSVM optimizes an upper bound of the task loss\nformulated as:\n\nmin\n\nw\n\n\u02c6y\u2208Y {L(yGT, \u02c6y) \u2212 F (x, yGT; w) + F (x, \u02c6y; w)}]\nE[max\n\u2207wL(y) = \u2207wF (X , yhinge, w) \u2212 \u2207wF (X , yGT, w)\n\n{F (X , \u02c6y, w) + L(yGT, \u02c6y)}\n\nThe loss gradient can then be computed as:\n\nwith yhinge = arg max\n\n\u02c6y\u2208Y\n\n3.3 Direct Loss Minimization (DLM)\n\n(1)\n\n[4] proposed a method that directly optimizes the task loss of interest instead of an upper bound of it.\nIn particular, they provide a perceptron-like weight update rule that they prove corresponds to the\ngradient of the task loss. [20] present a theorem that equips us with the corresponding weight update\nrule for the task loss in the case of nonlinear models, where the scoring function is parameterized by\na neural network. Since we make use of their theorem, we include it below for completeness.\nLet D = {(x, y)} be a dataset composed of input x \u2208 X and output y \u2208 Y pairs. Let F (X , y, w) be\na scoring function which depends on the input, the output and some parameters w \u2208 RA.\n\n3\n\n\fTheorem 1 (General Loss Gradient Theorem from [20]). When given a \ufb01nite set Y, a scoring function\nF (X , y, w), a data distribution, as well as a task-loss L(y, \u02c6y), then, under some mild regularity\nconditions, the direct loss gradient has the following form:\n\nwith:\n\n\u2207wL(y, yw) = \u00b1 lim\n\u0001\u21920\n\n1\n\u0001\n\n(\u2207wF (X , ydirect, w) \u2212 \u2207wF (X , yw, w))\n\n(2)\n\nyw = arg max\n\n\u02c6y\u2208Y\n\nF (X , \u02c6y, w) and ydirect = arg max\n\u02c6y\u2208Y\n\n{F (X , \u02c6y, w) \u00b1 \u0001L(y, \u02c6y)}\n\nThis theorem presents us with two options for the gradient update, henceforth the positive and negative\nupdate, obtained by choosing the + or \u2212 of the \u00b1 respectively. [4] and [20] provide an intuitive view\nfor each one. In the case of the positive update, ydirect can be thought of as the \u2018worst\u2019 solution since\nit corresponds to the output value that achieves high score while producing high task loss. In this\ncase, the positive update encourages the model to move away from the bad solution ydirect. On the\nother hand, when performing the negative update, ydirect represents the \u2018best\u2019 solution: one that does\nwell both in terms of the scoring function and the task loss. The model is hence encouraged in this\ncase to adjust its weights towards the direction of the gradient of this best solution\u2019s score.\nIn a nutshell, this theorem provides us with the weight update rule for the optimization of a custom\ntask loss, provided that we de\ufb01ne a scoring function and procedures for performing standard and\nloss-augmented inference.\n\n3.4 Relationship between DLM and SSVM\n\nAs also noted in [4], the positive update of direct loss minimization strongly resembles that of the\nmargin-rescaled structural SVM [3] which also yields a loss-informed weight update rule. This\ngradient computation differs from that of the direct loss minimization approach only in that, while\nSSVM considers the score of the ground-truth F (X , yGT, w), direct loss minimization considers the\nscore of the current prediction F (X , yw, w). The computation of yhinge strongly resembles that of\nydirect in the positive update. Indeed SSVM\u2019s training procedure also encourages the model to move\naway from weights that produce the \u2018worst\u2019 solution yhinge.\n\n3.5 Optimizing for Average Precision (AP)\n\nIn the following section we adapt and extend a method for optimizing AP [20].\nGiven a query point, the task is to rank N points x = (x1, . . . , xN ) with respect to their relevance\nto the query, where a point is relevant if it belongs to the same class as the query and irrelevant\notherwise. Let P and N be the sets of \u2018positive\u2019 (i.e. relevant) and \u2018negative\u2019 (i.e. irrelevant) points\nrespectively. The output ranking is represented as yij pairs where \u2200i, j, yij = 1 if i is ranked higher\nthan j and yij = \u22121 otherwise, and \u2200i, yii = 0. De\ufb01ne y = (. . . , yij, . . . ) to be the collection of all\nsuch pairwise rankings.\nThe scoring function that [20] used is borrowed from [21] and [22]:\n\nF (x, y, w) =\n\n1\n\n|P||N|\n\nyij(\u03d5(xi, w) \u2212 \u03d5(xj, w))\n\n(cid:88)\n\ni\u2208P,j\u2208N\n\nwhere \u03d5(xi, w) can be interpreted as the learned similarity between xi and the query.\n[20] devise a dynamic programming algorithm to perform loss-augmented inference in this setting\nwhich we make use of but we omit for brevity.\n\n4 Few-Shot Learning by Optimizing mAP\n\nIn this section, we present our approach for few-shot learning that optimizes mAP. We extend the\nwork of [20] that optimizes for AP in order to account for all possible choices of query among the\nbatch points. This is not a straightforward extension as it requires ensuring that optimizing the AP of\none query\u2019s ranking does not harm the AP of another query\u2019s ranking.\nIn what follows we de\ufb01ne a mathematical framework for this problem and we show that we can treat\neach query independently without sacri\ufb01cing correctness, therefore allowing to ef\ufb01ciently in parallel\n\n4\n\n\flearn to optimize all relative orderings within each batch. We then demonstrate how we can use the\nframeworks of SSVM and DLM for optimization of mAP, producing two variants of our method\nhenceforth referred to as mAP-SSVM and mAP-DLM, respectively.\nSetup: Let B be a batch of points: B = {x1, x2, . . . , xN} belonging to C different classes. Each\nclass c \u2208 {1, 2, . . . ,C} de\ufb01nes the positive set P c containing the points that belong to c and the\nnegative set N c containing the rest of the points. We denote by ci the class label of the ith point.\nWe represent the output rankings as a collection of yi\nhigher than j in i\u2019s ranking, yi\nconvenience we combine these comparisons for each query i in yi = (. . . , yi\nLet f (x, w) be the embedding function, parameterized by a neural network and \u03d5(x1, x2, w) the\ncosine similarity of points x1 and x2 in the embedding space given by w:\n\nkj = 1 if k is ranked\nkj = \u22121 if j is ranked higher than k in i\u2019s ranking. For\n\nkj variables where yi\n\nkk = 0 and yi\n\nkj, . . . ).\n\n\u03d5(x1, x2, w) =\n\nf (x1, w) \u00b7 f (x2, w)\n|f (x1, w)||f (x2, w)|\n\n\u03d5(xi, xj, w) is typically referred in the literature as the score of a siamese network.\nWe consider for each query i, the function F i(X , yi, w):\nkj(\u03d5(xi, xk, w) \u2212 \u03d5(xi, xj, w))\n(cid:88)\nyi\n\nWe then compose the scoring function by summing over all queries: F (X , y, w) =\n\nF i(X , yi, w) =\n\n|P ci||N ci|\n\n(cid:88)\n\n(cid:88)\n\nk\u2208P ci\\i\n\nj\u2208N ci\n\n1\n\nF i(X , yi, w)\n\ni\u2208B\n\nFurther, for each query i \u2208 B, we let pi = rank(yi) \u2208 {0, 1}|P ci|+|N ci| be a vector obtained by\nsorting the yi\ng = 1 if g is relevant for query i\nand pi\n\ng = \u22121 otherwise. Then the AP loss for the ranking induced by some query i is de\ufb01ned as:\n\nkj\u2019s \u2200k \u2208 P ci \\ i, j \u2208 N ci, such that for a point g (cid:54)= i, pi\n\nAP (pi, \u02c6pi) = 1 \u2212 1\nLi\n|P ci|\n\nP rec@j\n\n(cid:88)\n\nj: \u02c6pi\n\nj =1\n\nwhere P rec@j is the percentage of relevant points among the top-ranked j and pi and \u02c6pi denote the\nground-truth and predicted binary relevance vectors for query i, respectively. We de\ufb01ne the mAP loss\nto be the average AP loss over all query points.\nInference: We proof-sketch in the supplementary material that inference can be performed ef\ufb01ciently\nin parallel as we can decompose the problem of optimizing the orderings induced by the different\nqueries to optimizing each ordering separately. Speci\ufb01cally, for a query i of class c the computation\nof the yi\nk(cid:48)j(cid:48)\u2019s for\nsome other query i(cid:48) (cid:54)= i. We are thus able to optimize the ordering induced by each query point\nindependently of those induced by the other queries. For query i, positive point k and negative point\n= arg maxyi F i(X , yi, w) and can be computed as\nj, the solution of standard inference is yi\nfollows\n\nkj\u2019s, \u2200k \u2208 P c \\ i, j \u2208 N c can happen independently of the computation of the yi(cid:48)\n\nwkj\n\nyi\nwkj\n\n=\n\n\u22121, otherwise\n\n(cid:26)1, if \u03d5(xi, xk, w) \u2212 \u03d5(xi, xj, w) > 0\n(cid:8)F i(X , \u02c6yi, w) \u00b1 \u0001Li(yi, \u02c6yi)(cid:9)\n(cid:88)\n\n\u02c6yi\n\n(3)\n\n(4)\n\nLoss-augmented inference for query i is de\ufb01ned as\n\nyi\ndirect = arg max\n\n(cid:88)\n\nF (X , yw, w) =\n\nand can be performed via a run of the dynamic programming algorithm of [20]. We can then combine\nthe results of all the independent inferences to compute the overall scoring function\nF i(X , yi\n\n(5)\nFinally, we de\ufb01ne the ground-truth output value yGT . For any query i and distinct points m, n (cid:54)= i\nwe set yi\n= 0\notherwise.\n\n= \u22121 if n \u2208 P ci and m \u2208 N ci and yi\n\n= 1 if m \u2208 P ci and n \u2208 N ci, yi\n\nand F (X , ydirect, w) =\n\nF i(X , yi\n\ndirect, w)\n\nw, w)\n\ni\u2208B\n\ni\u2208B\n\nGTmn\n\nGTmn\n\nGTmn\n\n5\n\n\fAlgorithm 1 Few-Shot Learning by Optimizing mAP\n\nInput: A batch of points X = {x1, . . . , xN} of C different classes and \u2200c \u2208 {1, . . . ,C} the sets P c and N c.\n\nInitialize w\nif using mAP-SSVM then\n\nSet yi\n\nGT = ONES(|P ci|,|N ci|), \u2200i = 1, . . . , N\n\nend if\nrepeat\n\nif using mAP-DLM then\n\nStandard inference: Compute yi\n\nw, \u2200i = 1, . . . , N as in Equation 3\n\ndirect, \u2200i = 1, . . . , N via the DP algorithm of [20] as in Equation 4.\n\nend if\nLoss-augmented inference: Compute yi\nIn the case of mAP-SSVM, always use the positive update option and set \u0001 = 1\nCompute F (X , ydirect, w) as in Equation 5\nif using mAP-DLM then\nCompute F (X , yw, w) as in Equation 5\nCompute the gradient \u2207wL(y, yw) as in Equation 2\nCompute F (X , yGT , w) as in Equation 6\nCompute the gradient \u2207wL(y, yw) as in Equation 1 (using ydirect in the place of yhinge)\n\nelse if using mAP-SSVM then\n\nend if\nPerform the weight update rule with stepsize \u03b7: w \u2190 w \u2212 \u03b7\u2207wL(y, yw)\n\nuntil stopping criteria\n\nWe note that by construction of our scoring function de\ufb01ned above, we will only have to compute\nkj\u2019s where k and i belong to the same class ci and j is a point from another class. Because of this, we\nyi\nGT = ones(|P ci|,|N ci|).\nset the yi\nThe overall score of the ground truth is then\n\nGT for each query i to be an appropriately-sized matrix of ones: yi\n\nF (X , yGT , w) =\n\nF i(X , yi\n\nGT , w)\n\n(6)\n\n(cid:88)\n\ni\u2208B\n\nOptimizing mAP via SSVM and DLM We have now de\ufb01ned all the necessary components to\ncompute the gradient update as speci\ufb01ed by the General Loss Gradient Theorem of [20] in equation 2\nor as de\ufb01ned by the Structural SVM in equation 1. For clarity, Algorithm 1 describes this process,\noutlining the two variants of our approach for few-shot learning, namely mAP-DLM and mAP-SSVM.\n\n5 Evaluation\n\nIn what follows, we describe our training setup, the few-shot learning tasks of interest, the datasets we\nuse, and our experimental results. Through our experiments, we aim to evaluate the few-shot retrieval\nability of our method and additionally to compare our model to competing approaches for few-shot\nclassi\ufb01cation. For this, we have updated our tables to include very recent work that is published\nconcurrently with ours in order to provide the reader with a complete view of the state-of-the-art on\nfew-shot learning. Finally, we also aim to investigate experimentally our model\u2019s aptness for learning\nfrom little data via its training objective that is designed to fully exploit each training batch.\nControlling the in\ufb02uence of loss-augmented inference on the loss gradient We found empirically\nthat for the positive update of mAP-DLM and for mAP-SSVM, it is bene\ufb01cial to introduce a\nhyperparamter \u03b1 that controls the contribution of the loss-augmented F (X , ydirect, w) relative to that\nof F (X , yw, w) in the case of mAP-DLM, or F (X , yGT , w) in the case of mAP-SSVM. The updated\nrules that we use in practice for training mAP-DLM and mAP-SSVM, respectively, are shown below,\nwhere \u03b1 is a hyperparamter.\n\n\u2207wL(y, yw) = \u00b1 lim\n\u0001\u21920\n\n1\n\u0001\n\n(\u03b1\u2207wF (X , ydirect, w) \u2212 \u2207wF (X , yw, w))\n\nand\n\n\u2207wL(y) = \u03b1\u2207wF (X , ydirect, w) \u2212 \u2207wF (X , yyGT , w)\n\nWe refer the reader to the supplementary material for more details concerning this hyperparameter.\n\n6\n\n\fSiamese\nMatching Networks [1]\nPrototypical Networks [17]\nMAML [18]\nConvNet w/ Memory [19]\nmAP-SSVM (ours)\nmAP-DLM (ours)\n\nClassi\ufb01cation\n\n1-shot\n\n5-shot\n\n5-way\n98.8\n98.1\n98.8\n98.7\n98.4\n98.6\n98.8\n\n20-way\n\n5-way\n\n20-way\n\n95.5\n93.8\n96.0\n95.8\n95.0\n95.2\n95.4\n\n-\n\n98.9\n99.7\n99.9\n99.6\n99.6\n99.6\n\n-\n\n98.5\n98.9\n98.9\n98.6\n98.6\n98.6\n\nRetrieval\n1-shot\n\n5-way\n98.6\n\n20-way\n\n95.7\n\n98.6\n98.7\n\n95.7\n95.8\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\nTable 1: Few-shot learning results on Omniglot (averaged over 1000 test episodes). We report accuracy for the\nclassi\ufb01cation and mAP for the retrieval tasks.\n\nFew-shot Classi\ufb01cation and Retrieval Tasks Each K-shot N-way classi\ufb01cation \u2018episode\u2019 is con-\nstructed as follows: N evaluation classes and 20 images from each one are selected uniformly at\nrandom from the test set. For each class, K out of the 20 images are randomly chosen to act as the\n\u2018representatives\u2019 of that class. The remaining 20 \u2212 K images of each class are then to be classi\ufb01ed\namong the N classes. This poses a total of (20 \u2212 K)N classi\ufb01cation problems. Following the\nstandard procedure, we repeat this process 1000 times when testing on Omniglot and 600 times for\nmini-ImageNet in order to compute the results reported in tables 1 and 2.\nWe also designed a similar one-shot N-way retrieval task, where to form each episode we select N\nclasses at random and 10 images per class, yielding a pool of 10N images. Each of these 10N images\nacts as a query and ranks all remaining (10N - 1) images. The goal is to retrieve all 9 relevant images\nbefore any of the (10N - 10) irrelevant ones. We measure the performance on this task using mAP.\nNote that since this is a new task, there are no publicly available results for the competing few-shot\nlearning methods.\nOur Algorithm for K-shot N-way classi\ufb01cation Our model classi\ufb01es image x into class c =\narg maxi AP i(x), where AP i(x) denotes the average precision of the ordering that image x assigns\nto the pool of all KN representatives assuming that the ground truth class for image x is i. This\nmeans that when computing AP i(x), the K representatives of class i will have a binary relevance of\n1 while the K(N \u2212 1) representatives of the other classes will have a binary relevance of 0. Note that\nin the one-shot learning case where K = 1 this amounts to classifying x into the class whose (single)\nrepresentative is most similar to x according to the model\u2019s learned similarity metric.\nWe note that the siamese model does not naturally offer a procedure for exploiting all K representatives\nof each class when making the classi\ufb01cation decision for some reference. Therefore we omit few-shot\nlearning results for siamese when K > 1 and examine this model only in the one-shot case.\nTraining details We use the same embedding architecture for all of our models for both Omniglot and\nmini-ImageNet. This architecture mimics that of [1] and consists of 4 identical blocks stacked upon\neach other. Each of these blocks consists of a 3x3 convolution with 64 \ufb01lters, batch normalization\n[23], a ReLU activation, and 2x2 max-pooling. We resize the Omniglot images to 28x28, and the\nmini-ImageNet images to 3x84x84, therefore producing a 64-dimensional feature vector for each\nOmniglot image and a 1600-dimensional one for each mini-ImageNet image. We use ADAM [24]\nfor training all models. We refer the reader to the supplementary for more details.\nOmniglot The Omniglot dataset [25] is designed for testing few-shot learning methods. This dataset\nconsists of 1623 characters from 50 different alphabets, with each character drawn by 20 different\ndrawers. Following [1], we use 1200 characters as training classes and the remaining 423 for\nevaluation while we also augment the dataset with random rotations by multiples of 90 degrees. The\nresults for this dataset are shown in Table 1. Both mAP-SSVM and mAP-DLM are trained with\n\u03b1 = 10, and for mAP-DLM the positive update was used. We used |B| = 128 and N = 16 for our\nmodels and the siamese. Overall, we observe that many methods perform very similarly on few-shot\nclassi\ufb01cation on this dataset, ours being among the top-performing ones. Further, we perform equally\nwell or better than the siamese network in few-shot retrieval. We\u2019d like to emphasize that the siamese\nnetwork is a tough baseline to beat, as can be seen from its performance in the classi\ufb01cation tasks\nwhere it outperforms recent few-shot learning methods.\nmini-ImageNet mini-ImageNet refers to a subset of the ILSVRC-12 dataset [26] that was used as\na benchmark for testing few-shot learning approaches in [1]. This dataset contains 60,000 84x84\ncolor images and constitutes a signi\ufb01cantly more challenging benchmark than Omniglot. In order to\n\n7\n\n\fClassi\ufb01cation\n\n5-way\n\n1-shot\n\n5-shot\n\nRetrieval\n\n5-way\n1-shot\n\n20-way\n1-shot\n\nBaseline Nearset Neighbors*\nMatching Networks* [1]\nMatching Networks FCE* [1]\nMeta-Learner LSTM* [2]\nPrototypical Networks [17]\nMAML [18]\nSiamese\nmAP-SSVM (ours)\nmAP-DLM (ours)\n\n41.08 \u00b1 0.70 % 51.04 \u00b1 0.65 %\n43.40 \u00b1 0.78 % 51.09 \u00b1 0.71 %\n43.56 \u00b1 0.84 % 55.31 \u00b1 0.73 %\n43.44 \u00b1 0.77 % 60.60 \u00b1 0.71 %\n49.42 \u00b1 0.78% 68.20 \u00b1 0.66 %\n48.70 \u00b1 1.84 % 63.11 \u00b1 0.92 %\n48.42 \u00b1 0.79 %\n51.24 \u00b1 0.57 % 22.66 \u00b1 0.13 %\n50.32 \u00b1 0.80 % 63.94 \u00b1 0.72 % 52.85 \u00b1 0.56 % 23.87 \u00b1 0.14 %\n50.28 \u00b1 0.80 % 63.70 \u00b1 0.70 % 52.96 \u00b1 0.55 % 23.68 \u00b1 0.13 %\n\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\n-\n\nTable 2: Few-shot learning results on miniImageNet (averaged over 600 test episodes and reported with 95%\ncon\ufb01dence intervals). We report accuracy for the classi\ufb01cation and mAP for the retrieval tasks. *Results reported\nby [2].\n\ncompare our method with the state-of-the-art on this benchmark, we adapt the splits introduced in [2]\nwhich contain a total of 100 classes out of which 64 are used for training, 16 for validation and 20 for\ntesting. We train our models on the training set and use the validation set for monitoring performance.\nTable 2 reports the performance of our method and recent competing approaches on this benchmark.\nAs for Omniglot, the results of both versions of our method are obtained with \u03b1 = 10, and with the\npositive update in the case of mAP-DLM. We used |B| = 128 and N = 8 for our models and the\nsiamese. We also borrow the baseline reported in [2] for this task which corresponds to performing\nnearest-neighbors on top of the learned embeddings. Our method yields impressive results here,\noutperforming recent approaches tailored for few-shot learning either via deep-metric learning such\nas Matching Networks [1] or via meta-learning such as Meta-Learner LSTM [2] and MAML [18] in\nfew-shot classi\ufb01cation. We set the new state-of-the-art for 1-shot 5-way classi\ufb01cation. Further, our\nmodels are superior than the strong baseline of the siamese network in the few-shot retrieval tasks.\nCUB We also experimented on the Caltech-UCSD Birds (CUB) 200-2011 dataset [27], where we\noutperform the siamese network as well. More details can be found in the supplementary.\nLearning Ef\ufb01ciency We examine our method\u2019s learning ef\ufb01ciency via comparison with a siamese\nnetwork. For fair comparison of these models, we create the training batches in a way that enforces\nthat they have the same amount of information available for each update: each training batch B\nis formed by sampling N classes uniformly at random and |B| examples from these classes. The\nsiamese network is then trained on all possible pairs from these sampled points. Figure 2 displays the\nperformance of our model and the siamese on different metrics on Omniglot and mini-ImageNet. The\n\ufb01rst two rows show the performance of our two variants and the siamese in the few-shot classi\ufb01cation\n(left) and few-shot retrieval (right) tasks, for various levels of dif\ufb01culty as regulated by the different\nvalues of N. The \ufb01rst row corresponds to Omniglot and the second to mini-ImageNet. We observe\nthat even when both methods converge to comparable accuracy or mAP values, our method learns\nfaster, especially when the \u2018way\u2019 of the evaluation task is larger, making the problem harder.\nIn the third row in Figure 2, we examine the few-shot learning performance of our model and the\nall-pairs siamese that were trained with N = 8 but with different |B|. We note that for a given N,\nlarger batch size implies larger \u2018shot\u2019. For example, for N = 8, |B| = 64 results to on average 8\nexamples of each class in each batch (8-shot) whereas |B| = 16 results to on average 2-shot. We\nobserve that especially when the \u2018shot\u2019 is smaller, there is a clear advantage in using our method\nover the all-pairs siamese. Therefore it indeed appears to be the case that the fewer examples we are\ngiven per class, the more we can bene\ufb01t from our structured objective that simultaneously optimizes\nall relative orderings. Further, mAP-DLM can reach higher performance overall with smaller batch\nsizes (thus smaller \u2018shot\u2019) than the siamese, indicating that our method\u2019s training objective is indeed\nef\ufb01ciently exploiting the batch examples and showing promise in learning from less data.\nDiscussion It is interesting to compare experimentally methods that have pursued different paths\nin addressing the challenge of few-shot learning. In particular, the methods we compare against\neach other in our tables include deep metric learning approaches such as ours, the siamese network,\nPrototypical Networks and Matching Networks, as well as meta-learning methods such as Meta-\nLearner LSTM [2] and MAML [18]. Further, [19] has a metric-learning \ufb02avor but employs external\nmemory as a vehicle for remembering representations of rarely-observed classes. The experimental\n\n8\n\n\fFigure 2: Few-shot learning performance (on unseen validation classes). Each point represents the average\nperformance across 100 sampled episodes. Top row: Omniglot. Second and third rows: mini-ImageNet.\n\nresults suggest that there is no clear winner category and all these directions are worth exploring\nfurther.\nOverall, our model performs on par with the state-of-the-art results on the classi\ufb01cation benchmarks,\nwhile also offering the capability of few-shot retrieval where it exhibits superiority over a strong\nbaseline. Regarding the comparison between mAP-DLM and mAP-SSVM, we remark that they\nmostly perform similarly to each other on the benchmarks considered. We have not observed in this\ncase a signi\ufb01cant win for directly optimizing the loss of interest, offered by mAP-DLM, as opposed\nto minimizing an upper bound of it.\n\n6 Conclusion\n\nWe have presented an approach for few-shot learning that strives to fully exploit the available\ninformation of the training batches, a skill that is utterly important in the low-data regime of few-shot\nlearning. We have proposed to achieve this via de\ufb01ning an information-retrieval based training\nobjective that simultaneously optimizes all relative orderings of the points in each training batch.\nWe experimentally support our claims for learning ef\ufb01ciency and present promising results on two\nstandard few-shot learning datasets. An interesting future direction is to not only reason about how to\nbest exploit the information within each batch, but additionally about how to create training batches\nin order to best leverage the information in the training set. Furthermore, we leave it as future work to\nexplore alternative information retrieval metrics, instead of mAP, as training objectives for few-shot\nlearning (e.g. ROC curve, discounted cumulative gain etc).\n\n9\n\n\fReferences\n[1] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In Advances in Neural Information Processing Systems, pages 3630\u20133638, 2016.\n\n[2] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations, volume 1, page 6, 2017.\n\n[3] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods\nfor structured and interdependent output variables. Journal of machine learning research, 6(Sep):1453\u2013\n1484, 2005.\n\n[4] Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured prediction.\n\nIn Advances in Neural Information Processing Systems, pages 1594\u20131602, 2010.\n\n[5] Aur\u00e9lien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and\n\nstructured data. arXiv preprint arXiv:1306.6709, 2013.\n\n[6] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighbourhood components\n\nanalysis. In Advances in Neural Information Processing Systems, page 513\u2013520, 2005.\n\n[7] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest\nneighbor classi\ufb01cation. In Advances in neural information processing systems, pages 1473\u20131480, 2005.\n\n[8] Jane Bromley, James W Bentz, L\u00e9on Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard S\u00e4ckinger,\nand Roopak Shah. Signature veri\ufb01cation using a \u201csiamese\u201d time delay neural network. International\nJournal of Pattern Recognition and Arti\ufb01cial Intelligence, 7(04):669\u2013688, 1993.\n\n[9] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with\napplication to face veri\ufb01cation. In 2005 IEEE Computer Society Conference on Computer Vision and\nPattern Recognition (CVPR\u201905), volume 1, pages 539\u2013546. IEEE, 2005.\n\n[10] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\nrecognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 815\u2013823, 2015.\n\n[11] Ruslan Salakhutdinov and Geoffrey E Hinton. Learning a nonlinear embedding by preserving class\n\nneighbourhood structure. In AISTATS, volume 11, 2007.\n\n[12] Renqiang Min, David A Stanley, Zineng Yuan, Anthony Bonner, and Zhaolei Zhang. A deep non-linear\nIn Data Mining, 2009. ICDM\u201909. Ninth IEEE\n\nfeature mapping for large-margin knn classi\ufb01cation.\nInternational Conference on, pages 357\u2013366. IEEE, 2009.\n\n[13] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured\nfeature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 4004\u20134012, 2016.\n\n[14] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Learnable structured clustering\n\nframework for deep metric learning. arXiv preprint arXiv:1612.01213, 2016.\n\n[15] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis, University of Toronto,\n\n2015.\n\n[16] Pranav Shyam, Shubham Gupta, and Ambedkar Dukkipati. Attentive recurrent comparators. arXiv preprint\n\narXiv:1703.00767, 2017.\n\n[17] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. arXiv\n\npreprint arXiv:1703.05175, 2017.\n\n[18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. arXiv preprint arXiv:1703.03400, 2017.\n\n[19] \u0141ukasz Kaiser, O\ufb01r Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. arXiv\n\npreprint arXiv:1703.03129, 2017.\n\n[20] Yang Song, Alexander G Schwing, Richard S Zemel, and Raquel Urtasun. Training deep neural networks\nvia direct loss minimization. In Proceedings of The 33rd International Conference on Machine Learning,\npages 2169\u20132177, 2016.\n\n10\n\n\f[21] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for\noptimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on\nResearch and development in information retrieval, pages 271\u2013278. ACM, 2007.\n\n[22] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Ef\ufb01cient optimization for average precision svm. In\n\nAdvances in Neural Information Processing Systems, pages 2312\u20132320, 2014.\n\n[23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[24] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. One shot learning of\n\nsimple visual concepts. In CogSci, volume 172, page 2, 2011.\n\n[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nInternational Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[27] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd\n\nbirds-200-2011 dataset. 2011.\n\n11\n\n\f", "award": [], "sourceid": 1329, "authors": [{"given_name": "Eleni", "family_name": "Triantafillou", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}]}