{"title": "Adaptive Cross-Modal Few-shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4847, "page_last": 4857, "abstract": "Metric-based meta-learning techniques have successfully been applied to few-shot classification problems. In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods.\nVisual and semantic feature spaces have different structures by definition. For certain concepts, visual features might be richer and more discriminative than text ones. While for others, the inverse might be true. Moreover, when the support from visual information is limited in image classification, semantic representations (learned from unsupervised text corpora) can provide strong prior knowledge and context to help learning. Based on these two intuitions, we propose a mechanism that can adaptively combine information from both modalities according to new image categories to be learned. Through a series of experiments, we show that by this adaptive combination of the two modalities, our model outperforms current uni-modality few-shot learning methods and modality-alignment methods by a large margin on all benchmarks and few-shot scenarios tested. Experiments also show that our model can effectively adjust its focus on the two modalities. The improvement in performance is particularly large when the number of shots is very small.", "full_text": "Adaptive Cross-Modal Few-shot Learning\n\nChen Xing\u2217\n\nCollege of Computer Science,\n\nNankai University, Tianjin, China\n\nElement AI, Montreal, Canada\n\nNegar Rostamzadeh\n\nElement AI, Montreal, Canada\n\nBoris N. Oreshkin\n\nPedro O. Pinheiro\n\nElement AI, Montreal, Canada\n\nElement AI, Montreal, Canada\n\nAbstract\n\nMetric-based meta-learning techniques have successfully been applied to few-\nshot classi\ufb01cation problems. In this paper, we propose to leverage cross-modal\ninformation to enhance metric-based few-shot learning methods. Visual and se-\nmantic feature spaces have different structures by de\ufb01nition. For certain concepts,\nvisual features might be richer and more discriminative than text ones. While\nfor others, the inverse might be true. Moreover, when the support from visual\ninformation is limited in image classi\ufb01cation, semantic representations (learned\nfrom unsupervised text corpora) can provide strong prior knowledge and context\nto help learning. Based on these two intuitions, we propose a mechanism that\ncan adaptively combine information from both modalities according to new im-\nage categories to be learned. Through a series of experiments, we show that by\nthis adaptive combination of the two modalities, our model outperforms current\nuni-modality few-shot learning methods and modality-alignment methods by a\nlarge margin on all benchmarks and few-shot scenarios tested. Experiments also\nshow that our model can effectively adjust its focus on the two modalities. The\nimprovement in performance is particularly large when the number of shots is very\nsmall.\n\n1\n\nIntroduction\n\nDeep learning methods have achieved major advances in areas such as speech, language and vi-\nsion [25]. These systems, however, usually require a large amount of labeled data, which can be\nimpractical or expensive to acquire. Limited labeled data lead to over\ufb01tting and generalization issues\nin classical deep learning approaches. On the other hand, existing evidence suggests that human\nvisual system is capable of effectively operating in small data regime: humans can learn new concepts\nfrom a very few samples, by leveraging prior knowledge and context [23, 30, 46]. The problem of\nlearning new concepts with small number of labeled data points is usually referred to as few-shot\nlearning [1, 6, 27, 22] (FSL).\n\nMost approaches addressing few-shot learning are based on meta-learning paradigm [43, 3, 52, 13],\na class of algorithms and models focusing on learning how to (quickly) learn new concepts. Meta-\nlearning approaches work by learning a parameterized function that embeds a variety of learning\ntasks and can generalize to new ones. Recent progress in few-shot image classi\ufb01cation has primarily\nbeen made in the context of unimodal learning. In contrast to this, employing data from another\nmodality can help when the data in the original modality is limited. For example, strong evidence\nsupports the hypothesis that language helps recognizing new visual objects in toddlers [15, 45]. This\n\n\u2217Work done when interning at Element AI. Contact through: xingchen1113@gmail.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fping-pong ball\n\negg\n\nKomondor\n\nmop\n\ncat\n\nchair\n\nFigure 1: Concepts have different visual and semantic feature space. (Left) Some categories may have\nsimilar visual features and dissimilar semantic features. (Right) Other can possess same semantic\nlabel but very distinct visual features. Our method adaptively exploits both modalities to improve\nclassi\ufb01cation performance in low-shot regime.\n\nsuggests that semantic features from text can be a powerful source of information in the context of\nfew-shot image classi\ufb01cation.\n\nExploiting auxiliary modality (e.g., attributes, unlabeled text corpora) to help image classi\ufb01cation\nwhen data from visual modality is limited, have been mostly driven by zero-shot learning [24, 36]\n(ZSL). ZSL aims at recognizing categories whose instances have not been seen during training. In\ncontrast to few-shot learning, there is no small number of labeled samples from the original modality\nto help recognize new categories. Therefore, most approaches consist of aligning the two modalities\nduring training. Through this modality-alignment, the modalities are mapped together and forced to\nhave the same semantic structure. This way, knowledge from auxiliary modality is transferred to the\nvisual side for new categories at test time [9].\n\nHowever, visual and semantic feature spaces have heterogeneous structures by de\ufb01nition. For certain\nconcepts, visual features might be richer and more discriminative than text ones. While for others, the\ninverse might be true. Figure 1 illustrates this remark. Moreover, when the number of support images\nfrom visual side is very small, information provided from this modality tend to be noisy and local.\nOn the contrary, semantic representations (learned from large unsupervised text corpora) can act as\nmore general prior knowledge and context to help learning. Therefore, instead of aligning the two\nmodalities (to transfer knowledge to the visual modality), for few-shot learning in which information\nare provided from both modalities during test, it is better to treat them as two independent knowledge\nsources and adaptively exploit both modalities according to different scenarios. Towards this end, we\npropose Adaptive Modality Mixture Mechanism (AM3), an approach that adaptively and selectively\ncombines information from two modalities, visual and semantic, for few-shot learning.\n\nAM3 is built on top of metric-based meta-learning approaches. These approaches perform classi\ufb01-\ncation by comparing distances in a learned metric space (from visual data). On the top of that, our\nmethod also leverages text information to improve classi\ufb01cation accuracy. AM3 performs classi\ufb01ca-\ntion in an adaptive convex combination of the two distinctive representation spaces with respect to\nimage categories. With this mechanism, AM3 can leverage the bene\ufb01ts from both spaces and adjust\nits focus accordingly. For cases like Figure 1(Left), AM3 focuses more on the semantic modality to\nobtain general context information. While for cases like Figure 1(Right), AM3 focuses more on the\nvisual modality to capture rich local visual details to learn new concepts.\n\nOur main contributions can be summarized as follows: (i) we propose adaptive modality mixture\nmechanism (AM3) for cross-modal few-shot classi\ufb01cation. AM3 adapts to few-shot learning better\nthan modality-alignment methods by adaptively mixing the semantic structures of the two modalities.\n(ii) We show that our method achieves considerable boost in performance over different metric-based\nmeta-learning approaches. (iii) AM3 outperforms by a considerable margin current (single-modality\nand cross-modality) state of the art in few-shot classi\ufb01cation on different datasets and different\nnumber of shots. (iv) We perform quantitative investigations to verify that our model can effectively\nadjust its focus on the two modalities according to different scenarios.\n\n2 Related Work\n\nFew-shot learning. Meta-learning has a prominent history in machine learning [43, 3, 52]. Due\nto advances in representation learning methods [11] and the creation of new few-shot learning\ndatasets [22, 53], many deep meta-learning approaches have been applied to address the few-shot\nlearning problem . These methods can be roughly divided into two main types: metric-based and\ngradient-based approaches.\n\nMetric-based approaches aim at learning representations that minimize intra-class distances while\nmaximizing the distance between different classes. These approaches rely on an episodic training\n\n2\n\n\fframework: the model is trained with sub-tasks (episodes) in which there are only a few training\nsamples for each category. For example, matching networks [53] follows a simple nearest neighbour\nframework. In each episode, it uses an attention mechanism (over the encoded support) as a similarity\nmeasure for one-shot classi\ufb01cation.\n\nIn prototypical networks [47], a metric space is learned where embeddings of queries of one category\nare close to the centroid (or prototype) of supports of the same category, and far away from centroids\nof other classes in the episode. Due to the simplicity and good performance of this approach, many\nmethods extended this work. For instance, Ren et al. [39] propose a semi-supervised few-shot learning\napproach and show that leveraging unlabeled samples outperform purely supervised prototypical\nnetworks. Wang et al. [54] propose to augment the support set by generating hallucinated examples.\nTask-dependent adaptive metric (TADAM) [35] relies on conditional batch normalization [5] to\nprovide task adaptation (based on task representations encoded by visual features) to learn a task-\ndependent metric space.\n\nGradient-based meta-learning methods aim at training models that can generalize well to new tasks\nwith only a few \ufb01ne-tuning updates. Most these methods are built on top of model-agnostic meta-\nlearning (MAML) framework [7]. Given the universality of MAML, many follow-up works were\nrecently proposed to improve its performance on few-shot learning [33, 21]. Kim et al. [18] and\nFinn et al. [8] propose a probabilistic extension to MAML trained with variational approximation.\nConditional class-aware meta-learning (CAML) [16] conditionally transforms embeddings based on\na metric space that is trained with prototypical networks to capture inter-class dependencies. Latent\nembedding optimization (LEO) [41] aims to tackle MAML\u2019s problem of only using a few updates\non a low data regime to train models in a high dimensional parameter space. The model employs\na low-dimensional latent model embedding space for update and then decodes the actual model\nparameters from the low-dimensional latent representations. This simple yet powerful approach\nachieves current state of the art result in different few-shot classi\ufb01cation benchmarks. Other meta-\nlearning approaches for few-shot learning include using memory architecture to either store exemplar\ntraining samples [42] or to directly encode fast adaptation algorithm [38]. Mishra et al. [32] use\ntemporal convolution to achieve the same goal.\n\nCurrent approaches mentioned above rely solely on visual features for few-shot classi\ufb01cation. Our\ncontribution is orthogonal to current metric-based approaches and can be integrated into them to\nboost performance in few-shot image classi\ufb01cation.\n\nZero-shot learning. Current ZSL methods rely mostly on visual-auxiliary modality alignment [9,\n58]. In these methods, samples for the same class from the two modalities are mapped together so\nthat the two modalities obtain the same semantic structure. There are three main families of modality\nalignment methods: representation space alignment, representation distribution alignment and data\nsynthetic alignment.\n\nRepresentation space alignment methods either map the visual representation space to the semantic\nrepresentation space [34, 48, 9], or map the semantic space to the visual space [59]. Distribution\nalignment methods focus on making the alignment of the two modalities more robust and balanced to\nunseen data [44]. ReViSE [14] minimizes maximum mean discrepancy (MMD) of the distributions\nof the two representation spaces to align them. CADA-VAE [44] uses two VAEs [19] to embed\ninformation for both modalities and align the distribution of the two latent spaces. Data synthetic\nmethods rely on generative models to generate image or image feature as data augmentation [60, 57,\n31, 54] for unseen data to train the mapping function for more robust alignment.\n\nZSL does not have access to any visual information when learning new concepts. Therefore, ZSL\nmodels have no choice but to align the two modalities. This way, during test the image query can be\ndirectly compared to auxiliary information for classi\ufb01cation [59]. Few-shot learning, on the other\nhand, has access to a small amount of support images in the original modality during test. This makes\nalignment methods from ZSL seem unnecessary and too rigid for FSL. For few-shot learning, it\nwould be better if we could preserve the distinct structures of both modalities and adaptively combine\nthem for classi\ufb01cation according to different scenarios. In Section 4 we show that by doing so, AM3\noutperforms directly applying modality alignment methods for few-shot learning by a large margin.\n3 Method\n\nIn this section, we explain how AM3 adaptively leverages text data to improve few-shot image\nclassi\ufb01cation. We start with a brief explanation of episodic training for few-shot learning and a\n\n3\n\n\fsummary of prototypical networks followed by the description of the proposed adaptive modality\nmixture mechanism.\n\n3.1 Preliminaries\n\n3.1.1 Episodic Training\n\nFew-shot learning models are trained on a labeled dataset Dtrain and tested on Dtest. The class sets\nare disjoint between Dtrain and Dtest. The test set has only a few labeled samples per category. Most\nsuccessful approaches rely on an episodic training paradigm: the few shot regime faced at test time is\nsimulated by sampling small samples from the large labeled set Dtrain during training.\n\nIn general, models are trained on K-shot, N -way episodes. Each episode e is created by \ufb01rst sampling\nN categories from the training set and then sampling two sets of images from these categories: (i) the\nsupport set Se = {(si, yi)}N \u00d7K\ncontaining K examples for each of the N categories and (ii) the\nquery set Qe = {(qj, yj)}Q\n\nj=1 containing different examples from the same N categories.\n\ni=1\n\nThe episodic training for few-shot classi\ufb01cation is achieved by minimizing, for each episode, the\nloss of the prediction on samples in query set, given the support set. The model is a parameterized\nfunction and the loss is the negative loglikelihood of the true class of each query sample:\n\nL(\u03b8) = E\n\n(Se,Qe)\n\n\u2212\n\nQe\n\nX\n\nt=1\n\nlog p\u03b8(yt|qt, Se) ,\n\n(1)\n\nwhere (qt, yt) \u2208 Qe and Se are, respectively, the sampled query and support set at episode e and \u03b8\nare the parameters of the model.\n\n3.1.2 Prototypical Networks\n\nWe build our model on top of metric-based meta-learning methods. We choose prototypical net-\nwork [47] for explaining our model due to its simplicity. We note, however, that the proposed method\ncan potentially be applied to any metric-based approach.\n\nPrototypical networks use the support set to compute a centroid (prototype) for each category (in\nthe sampled episode) and query samples are classi\ufb01ed based on the distance to each prototype. The\nmodel is a convolutional neural network [26] f : Rnv \u2192 Rnp , parameterized by \u03b8f , that learns a\nnp-dimensional space where samples of the same category are close and those of different categories\nare far apart.\n\nFor every episode e, each embedding prototype pc (of category c) is computed by averaging the\nembeddings of all support samples of class c:\n\npc =\n\n1\ne| X\n|Sc\n\n(si,yi)\u2208S c\ne\n\nf (si) ,\n\n(2)\n\nwhere S c\n\ne \u2282 Se is the subset of support belonging to class c.\n\nThe model produces a distribution over the N categories of the episode based on a softmax [4] over\n(negative) distances d of the embedding of the query qt (from category c) to the embedded prototypes:\n\np(y = c|qt, Se, \u03b8) =\n\nexp(\u2212d(f (qt), pc))\n\nPk exp(\u2212d(f (qt), pk))\n\n.\n\n(3)\n\nWe consider d to be the Euclidean distance. The model is trained by minimizing Equation 1 and the\nparameters are updated with stochastic gradient descent.\n\n3.2 Adaptive Modality Mixture Mechanism\n\nThe information contained in semantic concepts can signi\ufb01cantly differ from visual contents. For\ninstance, \u2018Siberian husky\u2019 and \u2018wolf\u2019, or \u2018komondor\u2019 and \u2018mop\u2019, might be dif\ufb01cult to discriminate\nwith visual features, but might be easier to discriminate with language semantic features.\n\n4\n\n\ff\n\n(null)\n(null)\n(null)\n(null)\n\nprototypes\n\nS c\ne(null)\n\n(null)\n(null)\n(null)\n\npc(null)\n\n(null)\n(null)\n(null)\n\nconvex \n\ncombination\n\n\u03bbc(null)\n\n(null)\n(null)\n(null)\n\n0\n\npc\n\n(null)\n(null)\n(null)\n(null)\n\n{\u2018dog\u2019}\n\nWAAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiKCLotuXFawD0hDmUwn7dDJTJi5EUroZ7hxoYhbv8adf+OkzUJbDwwczrmXOfdEqeAGPe/bWVvf2NzaruxUd/f2Dw5rR8cdozJNWZsqoXQvIoYJLlkbOQrWSzUjSSRYN5rcFX73iWnDlXzEacrChIwkjzklaKWgnxAcUyLy7mxQq3sNbw53lfglqUOJ1qD21R8qmiVMIhXEmMD3UgxzopFTwWbVfmZYSuiEjFhgqSQJM2E+jzxzz60ydGOl7ZPoztXfGzlJjJkmkZ0sIpplrxD/84IM45sw5zLNkEm6+CjOhIvKLe53h1wzimJqCaGa26wuHRNNKNqWqrYEf/nkVdK5bPiWP1zVm7dlHRU4hTO4AB+uoQn30II2UFDwDK/w5qDz4rw7H4vRNafcOYE/cD5/AJG+kW0=\n\nAAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiKCLotuXFawD0hDmUwn7dDJTJi5EUroZ7hxoYhbv8adf+OkzUJbDwwczrmXOfdEqeAGPe/bWVvf2NzaruxUd/f2Dw5rR8cdozJNWZsqoXQvIoYJLlkbOQrWSzUjSSRYN5rcFX73iWnDlXzEacrChIwkjzklaKWgnxAcUyLy7mxQq3sNbw53lfglqUOJ1qD21R8qmiVMIhXEmMD3UgxzopFTwWbVfmZYSuiEjFhgqSQJM2E+jzxzz60ydGOl7ZPoztXfGzlJjJkmkZ0sIpplrxD/84IM45sw5zLNkEm6+CjOhIvKLe53h1wzimJqCaGa26wuHRNNKNqWqrYEf/nkVdK5bPiWP1zVm7dlHRU4hTO4AB+uoQn30II2UFDwDK/w5qDz4rw7H4vRNafcOYE/cD5/AJG+kW0=\nAAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiKCLotuXFawD0hDmUwn7dDJTJi5EUroZ7hxoYhbv8adf+OkzUJbDwwczrmXOfdEqeAGPe/bWVvf2NzaruxUd/f2Dw5rR8cdozJNWZsqoXQvIoYJLlkbOQrWSzUjSSRYN5rcFX73iWnDlXzEacrChIwkjzklaKWgnxAcUyLy7mxQq3sNbw53lfglqUOJ1qD21R8qmiVMIhXEmMD3UgxzopFTwWbVfmZYSuiEjFhgqSQJM2E+jzxzz60ydGOl7ZPoztXfGzlJjJkmkZ0sIpplrxD/84IM45sw5zLNkEm6+CjOhIvKLe53h1wzimJqCaGa26wuHRNNKNqWqrYEf/nkVdK5bPiWP1zVm7dlHRU4hTO4AB+uoQn30II2UFDwDK/w5qDz4rw7H4vRNafcOYE/cD5/AJG+kW0=\nAAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSiKCLotuXFawD0hDmUwn7dDJTJi5EUroZ7hxoYhbv8adf+OkzUJbDwwczrmXOfdEqeAGPe/bWVvf2NzaruxUd/f2Dw5rR8cdozJNWZsqoXQvIoYJLlkbOQrWSzUjSSRYN5rcFX73iWnDlXzEacrChIwkjzklaKWgnxAcUyLy7mxQq3sNbw53lfglqUOJ1qD21R8qmiVMIhXEmMD3UgxzopFTwWbVfmZYSuiEjFhgqSQJM2E+jzxzz60ydGOl7ZPoztXfGzlJjJkmkZ0sIpplrxD/84IM45sw5zLNkEm6+CjOhIvKLe53h1wzimJqCaGa26wuHRNNKNqWqrYEf/nkVdK5bPiWP1zVm7dlHRU4hTO4AB+uoQn30II2UFDwDK/w5qDz4rw7H4vRNafcOYE/cD5/AJG+kW0=\n\ng\n\n(null)\n(null)\n(null)\n(null)\n\nec(null)\n\n(null)\n(null)\n(null)\n\nh(null)\n\n(null)\n(null)\n(null)\n\nwc\n\n(null)\n(null)\n(null)\n(null)\n\nFigure 2: (Left) Adaptive modality mixture model. The \ufb01nal category prototype is a convex combi-\nnation of the visual and the semantic feature representations. The mixing coef\ufb01cient is conditioned\non the semantic label embedding. (Right) Qualitative example of how AM3 works. Assume query\nsample q has category i. (a) The closest visual prototype to the query sample q is pj . (b) The semantic\nprototypes. (c) The mixture mechanism modify the positions of the prototypes, given the semantic\nembeddings. (d) After the update, the closest prototype to the query is now the one of the category i,\ncorrecting the classi\ufb01cation.\n\nIn zero-shot learning, where no visual information is given at test time (that is, the support set is\nvoid), algorithms need to solely rely on an auxiliary (e.g., text) modality. On the other extreme, when\nthe number of labeled image samples is large, neural network models tend to ignore the auxiliary\nmodality as it is able to generalize well with large number of samples [20].\n\nFew-shot learning scenario \ufb01ts in between these two extremes. Thus, we hypothesize that both\nvisual and semantic information can be useful for few-shot learning. Moreover, given that visual\nand semantic spaces have different structures, it is desirable that the proposed model exploits both\nmodalities adaptively, given different scenarios. For example, when it meets objects like \u2018ping-pong\nballs\u2019 which has many visually similar counterparts, or when the number of shots is very small from\nthe visual side, it relies more on text modality to distinguish them.\n\nIn AM3, we augment metric-based FSL methods to incorporate language structure learned by a word-\nembedding model W (pre-trained on unsupervised large text corpora), containing label embeddings\nof all categories in Dtrain \u222a Dtest. In our model, we modify the prototype representation of each\ncategory by taking into account their label embeddings.\n\nMore speci\ufb01cally, we model the new prototype representation as a convex combination of the two\nmodalities. That is, for each category c, the new prototype is computed as:\n\np\n\n\u2032\nc = \u03bbc \u00b7 pc + (1 \u2212 \u03bbc) \u00b7 wc ,\n\n(4)\nwhere \u03bbc is the adaptive mixture coef\ufb01cient (conditioned on the category) and wc = g(ec) is a\ntransformed version of the label embedding for class c. The representation ec is the pre-trained\nword embedding of label c from W. This transformation g : Rnw \u2192 Rnp , parameterized by \u03b8g, is\nimportant to guarantee that both modalities lie on the space Rnp of the same dimension and can be\ncombined. The coef\ufb01cient \u03bbc is conditioned on category and calculated as follows:\n\n\u03bbc =\n\n1\n\n1 + exp(\u2212h(wc))\n\n,\n\n(5)\n\nwhere h is the adaptive mixing network, with parameters \u03b8h. Figure 2(left) illustrates the proposed\nmodel. The mixing coef\ufb01cient \u03bbc can be conditioned on different variables. In Appendix F we show\nhow performance changes when the mixing coef\ufb01cient is conditioned on different variables.\n\nThe training procedure is similar to that of the original prototypical networks. However, the distances\nd (used to calculate the distribution over classes for every image query) are between the query and\nthe cross-modal prototype p\n\n\u2032\nc:\n\np\u03b8(y = c|qt, Se, W) =\n\nexp(\u2212d(f (qt), p\n\n\u2032\nc))\n\u2032\nPk exp(\u2212d(f (qt), p\nk))\n\n,\n\n(6)\n\nwhere \u03b8 = {\u03b8f , \u03b8g, \u03b8h} is the set of parameters. Once again, the model is trained by minimizing\nEquation 1. Note that in this case the probability is also conditioned on the word embeddings W.\n\n5\n\n\fFigure 2(right) illustrates an example on how the proposed method works. Algorithm 1, on supple-\nmentary material, shows the pseudocode for calculating the episode loss. We chose prototypical\nnetwork [47] for explaining our model due to its simplicity. We note, however, that AM3 can\npotentially be applied to any metric-based approach that calculates prototypical embeddings pc for\ncategories. As shown in next section, we apply AM3 on both ProtoNets and TADAM [35]. TADAM\nis a task-dependent metric-based few-shot learning method, which currently performs the best among\nall metric-based FSL methods.\n\n4 Experiments\n\nIn this section we compare our model, AM3, with three different types of baselines: uni-modality\nfew-shot learning methods, modality-alignment methods and metric-based extensions of modality-\nalignment methods. We show that AM3 outperforms the state of the art of each family of baselines.\nWe also verify the adaptiveness of AM3 through quantitative analysis.\n\n4.1 Experimental Setup\n\nWe conduct main experiments with two widely used few-shot learning datasets: miniImageNet [53]\nand tieredImageNet [39]. We also experiment on CUB-200 [55], a widely used zero-shot learning\ndataset. We evaluate on this dataset to provide a more direct comparison with modality-alignment\nmethods. This is because most modality-alignment methods have no published results on few-shot\ndatasets. We use GloVe [37] to extract the word embeddings for the category labels of the two image\nfew-shot learning data sets. The embeddings are trained with large unsupervised text corpora.\n\nMore details about the three datasets can be found in Appendix B.\n\nBaselines. We compare AM3 with three family of methods. The \ufb01rst is uni-modality few-shot\nlearning methods such as MAML [7], LEO [41], Prototypical Nets [47] and TADAM [35]. LEO\nachieves current state of the art among uni-modality methods. The second fold is modality alignment\nmethods. CADA-VAE [44], among them, has the best published results on both zero and few-shot\nlearning. To better extend modality alignment methods to few-shot setting, we also apply the metric-\nbased loss and the episode training of ProtoNets on their visual side to build a visual representation\nspace that better \ufb01ts few-shot scenario. This leads to the third fold baseline, modality alignment\nmethods extended to metric-based FSL.\n\nDetails of baseline implementations can be found in Appendix C.\n\nAM3 Implementation. We test AM3 with two backbone metric-based few-shot learning methods:\nProtoNets and TADAM. In our experiments, we use the stronger ProtoNets implementation of [35],\nwhich we call ProtoNets++. Prior to AM3, TADAM achieves the current state of the art among all\nmetric-based few-shot learning methods. For details on network architectures, training and evaluation\nprocedures, see Apprendix D. Source code is released at https://github.com/ElementAI/am3.\n\n4.2 Results\n\nTable 1 and Table 2 show classi\ufb01cation accuracy on miniImageNet and on tieredImageNet, respec-\ntively. We conclude multiple results from these experiments. First, AM3 outperforms its backbone\nmethods by a large margin in all cases tested. This indicates that when properly employed, text\nmodality can be used to boost performance in metric-based few-shot learning framework very\neffectively.\n\nSecond, AM3 (with TADAM backbone) achieves results superior to current state of the art (in both\nsingle modality FSL and modality alignment methods). The margin in performance is particularly\nremarkable in the 1-shot scenario. The margin of AM3 w.r.t. uni-modality methods is larger with\nsmaller number of shots. This indicates that the lower the visual content is, the more important\nsemantic information is for classi\ufb01cation. Moreover, the margin of AM3 w.r.t. modality alignment\nmethods is larger with smaller number of shots. This indicates that the adaptiveness of AM3 would\nbe more effective when the visual modality provides less information. A more detailed analysis about\nthe adaptiveness of AM3 is provided in Section 4.3.\n\n6\n\n\fModel\n\nTest Accuracy\n\n5-way 1-shot\n\n5-way 5-shot\n\n5-way 10-shot\n\nUni-modality few-shot learning baselines\n\nMatching Network [53]\nPrototypical Network [47]\nDiscriminative k-shot [2]\nMeta-Learner LSTM [38]\nMAML [7]\nProtoNets w Soft k-Means [39]\nSNAIL [32]\nCAML [16]\nLEO [41]\n\n-\n\n43.56 \u00b1 0.84% 55.31 \u00b1 0.73%\n49.42 \u00b1 0.78% 68.20 \u00b1 0.66% 74.30 \u00b1 0.52%\n56.30 \u00b1 0.40% 73.90 \u00b1 0.30% 78.50 \u00b1 0.00%\n43.44 \u00b1 0.77% 60.60 \u00b1 0.71%\n48.70 \u00b1 1.84% 63.11 \u00b1 0.92%\n50.41 \u00b1 0.31% 69.88 \u00b1 0.20%\n55.71 \u00b1 0.99% 68.80 \u00b1 0.92%\n59.23 \u00b1 0.99% 72.35 \u00b1 0.71%\n61.76 \u00b1 0.08% 77.59 \u00b1 0.12%\n\n-\n-\n-\n-\n-\n-\n\nModality alignment baselines\n\nDeViSE [9]\nReViSE [14]\nCBPL [29]\nf-CLSWGAN [57]\nCADA-VAE [44]\n\n37.43\u00b10.42%\n43.20\u00b10.87%\n58.50\u00b10.82%\n53.29\u00b10.82%\n58.92\u00b11.36%\n\n59.82\u00b10.39%\n66.53\u00b10.68%\n75.62\u00b10.61%\n72.58\u00b10.27%\n73.46\u00b11.08%\n\n66.50\u00b10.28%\n72.60\u00b10.66%\n\n-\n\n73.49\u00b10.29%\n76.83\u00b10.98%\n\nModality alignment baselines extended to metric-based FSL framework\n\nDeViSE-FSL\nReViSE-FSL\nf-CLSWGAN-FSL\nCADA-VAE-FSL\n\nProtoNets++\nAM3-ProtoNets++\nTADAM [35]\nAM3-TADAM\n\n56.99 \u00b1 1.33% 72.63 \u00b1 0.72% 76.70 \u00b1 0.53%\n57.23 \u00b1 0.76% 73.85 \u00b1 0.63% 77.21 \u00b1 0.31%\n58.47 \u00b1 0.71% 72.23 \u00b1 0.45% 76.90 \u00b1 0.38%\n61.59 \u00b1 0.84% 75.63 \u00b1 0.52% 79.57 \u00b1 0.28%\n\nAM3 and its backbones\n\n56.52 \u00b1 0.45% 74.28 \u00b1 0.20% 78.31 \u00b1 0.44%\n65.21 \u00b1 0.30% 75.20 \u00b1 0.27% 78.52 \u00b1 0.28%\n58.56 \u00b1 0.39% 76.65 \u00b1 0.38% 80.83 \u00b1 0.37%\n65.30 \u00b1 0.49% 78.10 \u00b1 0.36% 81.57 \u00b1 0.47 %\n\nTable 1: Few-shot classi\ufb01cation accuracy on test split of miniImageNet. Results in the top use only\nvisual features. Modality alignment baselines are shown on the middle and our results (and their\nbackbones) on the bottom part.\n\nModel\n\nTest Accuracy\n\n5-way 1-shot\n\n5-way 5-shot\n\nUni-modality few-shot learning baselines\n\nMAML\u2020 [7]\nProto. Nets with Soft k-Means [39]\nRelation Net\u2020 [50]\nTransductive Prop. Nets [28]\nLEO [41]\n\n51.67 \u00b1 1.81% 70.30 \u00b1 0.08%\n53.31 \u00b1 0.89% 72.69 \u00b1 0.74%\n54.48 \u00b1 0.93% 71.32 \u00b1 0.78%\n54.48 \u00b1 0.93% 71.32 \u00b1 0.78%\n66.33 \u00b1 0.05% 81.44 \u00b1 0.09%\n\nModality alignment baselines\n\nDeViSE [9]\nReViSE [14]\nCADA-VAE [44]\n\n49.05\u00b10.92%\n52.40\u00b10.46%\n58.92\u00b11.36%\n\n68.27\u00b10.73%\n69.92\u00b10.59%\n73.46\u00b11.08%\n\nModality alignment baselines extended to metric-based FSL framework\n\nDeViSE-FSL\nReViSE-FSL\nCADA-VAE-FSL\n\nProtoNets++\nAM3-ProtoNets++\nTADAM [35]\nAM3-TADAM\n\n61.78 \u00b1 0.43% 77.17 \u00b1 0.81%\n62.77 \u00b1 0.31% 77.27 \u00b1 0.42%\n63.16 \u00b1 0.93% 78.86 \u00b1 0.31%\n\nAM3 and its backbones\n\n58.47 \u00b1 0.64% 78.41 \u00b1 0.41%\n67.23 \u00b1 0.34% 78.95 \u00b1 0.22%\n62.13 \u00b1 0.31% 81.92 \u00b1 0.30%\n69.08 \u00b1 0.47% 82.58 \u00b1 0.31%\n\nTable 2: Few-shot classi\ufb01cation accuracy on test split of tieredImageNet. Results in the top use only\nvisual features. Modality alignment baselines are shown in the middle and our results (and their\nbackbones) in the bottom part. \u2020deeper net, evaluated in [28].\n\nFinally, it is also worth noting that all modality alignment baselines get a signi\ufb01cant performance\nimprovement when extended to metric-based, episodic, few-shot learning framework. However, most\n\n7\n\n\fProtoNets++\n\nTADAM\n\nAM3-ProtoNets++\n\nAM3-TADAM\n\n(a) Accuracy vs. # shots\n\n(b) \u03bb vs. # shots\n\nFigure 3: (a) Comparison of AM3 and its corresponding backbone for different number of shots\n(b) Average value of \u03bb (over whole validation set) for different number of shot, considering both\nbackbones.\n\nof modality alignment methods (original and extended), perform worse than current state-of-the-art\nuni-modality few-shot learning method. This indicates that although modality alignment methods are\neffective for cross-modality in ZSL, it does not \ufb01t few-shot scenario very much. One possible reason\nis that when aligning the two modalities, some information from both sides could be lost because two\ndistinct structures are forced to align.\n\nWe also conducted few-shot learning experiments on CUB-200, a popular dataest for ZSL dataset, to\nbetter compare with published results of modality alignment methods. All the conclusion discussed\nabove hold true on CUB-200. Moreover, we also conduct ZSL and generalized FSL experiments to\nverify the importance of the proposed adaptive mechanism. Results on on this dataset are shown in\nAppendix E.\n\n4.3 Adaptiveness Analysis\n\nWe argue that the adaptive mechanism is the main reason for the performance boosts observed in the\nprevious section. We design an experiment to quantitatively verify that the adaptive mechanism of\nAM3 can adjust its focus on the two modalities reasonably and effectively.\n\nFigure 3(a) shows the accuracy of our model compared to the two backbones tested (ProtoNets++ and\nTADAM) on miniImageNet for 1-10 shot scenarios. It is clear from the plots that the gap between\nAM3 and the corresponding backbone gets reduced as the number of shots increases. Figure 3(b)\nshows the mean and std (over whole validation set) of the mixing coef\ufb01cient \u03bb for different shots and\nbackbones.\n\nFirst, we observe that the mean of \u03bb correlates with number of shots. This means that AM3 weighs\nmore on text modality (and less on visual one) as the number of shots (hence, the number of visual\ndata points) decreases. This trend suggests that AM3 can automatically adjust its focus more to text\nmodality to help classi\ufb01cation when information from the visual side is very low. Second, we can\nalso observe that the variance of \u03bb (shown in Figure 3(b)) correlates with the performance gap of\nAM3 and its backbone methods (shown in Figure 3(a)). When the variance of \u03bb decreases with the\nincrease of number of shots, the performance gap also shrinks. This indicates that the adaptiveness of\nAM3 on category level plays a very important role for the performance boost.\n\n5 Conclusion\n\nIn this paper, we propose a method that can adaptively and effectively leverage cross-modal informa-\ntion for few-shot classi\ufb01cation. The proposed method, AM3, boosts the performance of metric-based\napproaches by a large margin on different datasets and settings. Moreover, by leveraging unsupervised\ntextual data, AM3 outperforms state of the art on few-shot classi\ufb01cation by a large margin. The\ntextual semantic features are particularly helpful on the very low (visual) data regime (e.g. one-shot).\nWe also conduct quantitative experiments to show that AM3 can reasonably and effectively adjust its\nfocus on the two modalities.\n\nReferences\n\n[1] E. Bart and S. Ullman. Cross-generalization: learning novel classes from a single example by\n\nfeature replacement. In CVPR, 2005.\n\n8\n\n\f[2] Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartlomiej Swikatkowski, Bernhard Scholkopf,\nand Richard E Turner. Discriminative k-shot learning using probabilistic models. In NIPS\nBayesian Deep Learning, 2017.\n\n[3] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a\nsynaptic learning rule. In Conference on Optimality in Biological and Arti\ufb01cial Networks, 1992.\n\n[4] John Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs with\nrelationships to statistical pattern recognition. Neurocomputing: Algorithms, Architectures and\nApplications, 1990.\n\n[5] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron\n\nCourville, and Yoshua Bengio. Feature-wise transformations. Distill, 2018.\n\n[6] Michael Fink. Object classi\ufb01cation from a single example utilizing class relevance metrics. In\n\nNIPS, 2005.\n\n[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In ICML, 2017.\n\n[8] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In\n\nNeurIPS, 2018.\n\n[9] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al.\n\nDevise: A deep visual-semantic embedding model. In NIPS, 2013.\n\n[10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\n\nAISTATS, 2011.\n\n[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[13] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient\n\ndescent. In ICANN, 2001.\n\n[14] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visual-\n\nsemantic embeddings. In CVPR, 2017.\n\n[15] R Jackendoff. On beyond zebra: the relation of linguistic and visual information. Cognition,\n\n1987.\n\n[16] Xiang Jiang, Mohammad Havaei, Farshid Varno, Gabriel Chartrand, Nicolas Chapados, and\n\nStan Matwin. Learning to learn with conditional class dependencies. In ICLR, 2019.\n\n[17] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H\u00e9rve J\u00e9gou, and Tomas\nMikolov. Fasttext.zip: Compressing text classi\ufb01cation models. arXiv preprint arXiv:1612.03651,\n2016.\n\n[18] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.\n\nBayesian model-agnostic meta-learning. In NeurIPS, 2018.\n\n[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, 2012.\n\n[21] Alexandre Lacoste, Thomas Boquet, Negar Rostamzadeh, Boris Oreshki, Wonchang Chung,\n\nand David Krueger. Deep prior. NIPS workshop, 2017.\n\n[22] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning\n\nof simple visual concepts. In Annual Meeting of the Cognitive Science Society, 2011.\n\n[23] Barbara Landau, Linda B Smith, and Susan S Jones. The importance of shape in early lexical\n\nlearning. Cognitive development, 1988.\n\n[24] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In\n\nAAAI, 2008.\n\n[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015.\n\n[26] Yann Lecun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied\n\nto document recognition. In Proceedings of the IEEE, 1998.\n\n9\n\n\f[27] Fei-Fei Li, Rob Fergus, and Pietro Perona. One-shot learning of object categories. PAMI, 2006.\n\n[28] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, and Yi Yang. Transductive propagation\n\nnetwork for few-shot learning. ICLR, 2019.\n\n[29] Zhiwu Lu, Jiechao Guan, Aoxue Li, Tao Xiang, An Zhao, and Ji-Rong Wen. Zero and\nfew shot learning with semantic feature synthesis and competitive learning. arXiv preprint\narXiv:1810.08332, 2018.\n\n[30] Ellen M Markman. Categorization and naming in children: Problems of induction. MIT Press,\n\n1991.\n\n[31] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. A generative model\nfor zero shot learning using conditional variational autoencoders. In CVPR Workshops, 2018.\n\n[32] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\n\nmeta-learner. In ICLR, 2018.\n\n[33] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\narXiv, 2018.\n\n[34] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea\nFrome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of\nsemantic embeddings. ICLR, 2014.\n\n[35] Boris N Oreshkin, Alexandre Lacoste, and Pau Rodriguez. Tadam: Task dependent adaptive\n\nmetric for improved few-shot learning. In NeurIPS, 2018.\n\n[36] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning\n\nwith semantic output codes. In NIPS, 2009.\n\n[37] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word\n\nrepresentation. In EMNLP, 2014.\n\n[38] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR,\n\n2017.\n\n[39] Mengye Ren, Eleni Trianta\ufb01llou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenen-\nbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot\nclassi\ufb01cation. In ICLR, 2018.\n\n[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. IJCV, 2015.\n\n[41] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon\nOsindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019.\n\n[42] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\n\nMeta-learning with memory-augmented neural networks. In ICML, 2016.\n\n[43] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. on learning now to\nlearn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany,\n1987.\n\n[44] Edgar Sch\u00f6nfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. General-\n\nized zero-and few-shot learning via aligned variational autoencoders. CVPR, 2019.\n\n[45] Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from\n\nbabies. Arti\ufb01cial life, 2005.\n\n[46] Linda B Smith and Lauren K Slone. A developmental approach to machine learning? Frontiers\n\nin psychology, 2017.\n\n[47] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nNIPS, 2017.\n\n[48] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning\n\nthrough cross-modal transfer. In NIPS, 2013.\n\n[49] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\n\nDropout: A simple way to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n10\n\n\f[50] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\n\nLearning to compare: Relation network for few-shot learning. In CVPR, 2018.\n\n[51] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\n\ninitialization and momentum in deep learning. In ICML, 2013.\n\n[52] Sebastian Thrun. Lifelong learning algorithms. Kluwer Academic Publishers, 1998.\n\n[53] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks\n\nfor one shot learning. In NIPS, 2016.\n\n[54] Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning\n\nfrom imaginary data. In CVPR, 2018.\n\n[55] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD\n\nBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[56] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a\n\ncomprehensive evaluation of the good, the bad and the ugly. PAMI, 2018.\n\n[57] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks\n\nfor zero-shot learning. In CVPR, 2018.\n\n[58] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the\n\nugly. In CVPR, 2017.\n\n[59] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot\n\nlearning. In CVPR, pages 2021\u20132030, 2017.\n\n[60] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative\n\nadversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2692, "authors": [{"given_name": "Chen", "family_name": "Xing", "institution": "Montreal Institute of Learning Algorithms"}, {"given_name": "Negar", "family_name": "Rostamzadeh", "institution": "Elemenet AI"}, {"given_name": "Boris", "family_name": "Oreshkin", "institution": "Element AI"}, {"given_name": "Pedro", "family_name": "O. Pinheiro", "institution": "Element AI"}]}