{"title": "Learning to Model the Tail", "book": "Advances in Neural Information Processing Systems", "page_first": 7029, "page_last": 7039, "abstract": "We describe an approach to learning from long-tailed, imbalanced datasets that are prevalent in real-world settings. Here, the challenge is to learn accurate \"few-shot'' models for classes in the tail of the class distribution, for which little data is available. We cast this problem as transfer learning, where knowledge from the data-rich classes in the head of the distribution is transferred to the data-poor classes in the tail. Our key insights are as follows. First, we propose to transfer meta-knowledge about learning-to-learn from the head classes. This knowledge is encoded with a meta-network that operates on the space of model parameters, that is trained to predict many-shot model parameters from few-shot model parameters. Second, we transfer this meta-knowledge in a progressive manner, from classes in the head to the \"body'', and from the \"body'' to the tail. That is, we transfer knowledge in a gradual fashion, regularizing meta-networks for few-shot regression with those trained with more training data. This allows our final network to capture a notion of model dynamics, that predicts how model parameters are likely to change as more training data is gradually added. We demonstrate results on image classification datasets (SUN, Places, and ImageNet) tuned for the long-tailed setting, that significantly outperform common heuristics, such as data resampling or reweighting.", "full_text": "Learning to Model the Tail\n\nYu-Xiong Wang\n\nDeva Ramanan\n\nRobotics Institute, Carnegie Mellon University\n{yuxiongw,dramanan,hebert}@cs.cmu.edu\n\nMartial Hebert\n\nAbstract\n\nWe describe an approach to learning from long-tailed, imbalanced datasets that\nare prevalent in real-world settings. Here, the challenge is to learn accurate \u201cfew-\nshot\u201d models for classes in the tail of the class distribution, for which little data\nis available. We cast this problem as transfer learning, where knowledge from\nthe data-rich classes in the head of the distribution is transferred to the data-poor\nclasses in the tail. Our key insights are as follows. First, we propose to transfer\nmeta-knowledge about learning-to-learn from the head classes. This knowledge is\nencoded with a meta-network that operates on the space of model parameters, that\nis trained to predict many-shot model parameters from few-shot model parameters.\nSecond, we transfer this meta-knowledge in a progressive manner, from classes\nin the head to the \u201cbody\u201d, and from the \u201cbody\u201d to the tail. That is, we transfer\nknowledge in a gradual fashion, regularizing meta-networks for few-shot regression\nwith those trained with more training data. This allows our \ufb01nal network to capture\na notion of model dynamics, that predicts how model parameters are likely to\nchange as more training data is gradually added. We demonstrate results on\nimage classi\ufb01cation datasets (SUN, Places, and ImageNet) tuned for the long-tailed\nsetting, that signi\ufb01cantly outperform common heuristics, such as data resampling\nor reweighting.\n\n1 Motivation\n\nDeep convolutional neural networks (CNNs) have revolutionized the landscape of visual recognition,\nthrough the ability to learn \u201cbig models\u201d with hundreds of millions of parameters [1, 2, 3, 4]. Such\nmodels are typically learned with arti\ufb01cially balanced datasets [5, 6, 7], in which objects of different\nclasses have approximately evenly distributed, very large number of human-annotated images. In\nreal-world applications, however, visual phenomena follow a long-tailed distribution as shown in\nFig. 1, in which the number of training examples per class varies signi\ufb01cantly from hundreds or\nthousands for head classes to as few as one for tail classes [8, 9, 10].\nLong-tail: Minimizing the skewed distribution by collecting more tail examples is a notoriously\ndif\ufb01cult task when constructing datasets [11, 6, 12, 10]. Even those datasets that are balanced along\none dimension still tend to be imbalanced in others [13]; e.g., balanced scene datasets still contain\nlong-tail sets of objects [14] or scene subclasses [8]. This intrinsic long-tail property poses a multitude\nof open challenges for recognition in the wild [15], since the models will be largely dominated by\nthose few head classes while degraded for many other tail classes. Rebalancing training data [16, 17]\nis the most widespread state-of-the-art solution, but this is heuristic and suboptimal \u2014 it merely\ngenerates redundant data through over-sampling or loses critical information through under-sampling.\nHead-to-tail knowledge transfer: An attractive alternative is to transfer knowledge from data-rich\nhead classes to data-poor tail classes. While transfer learning from a source to target task is a well\nstudied problem [18, 19], by far the most common approach is \ufb01ne-tuning a model pre-trained on the\nsource task [20]. In the long-tailed setting, this fails to provide any noticeable improvement since\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(b) Knowledge transfer from head to tail classes.\n\n(a) Long-tail distribution on the SUN-397 dataset.\nFigure 1: Head-to-tail knowledge transfer in model space for long-tail recognition. Fig. 1a shows\nthe number of examples by scene class on SUN-397 [14], a representative dataset that follows an\nintrinsic long-tailed distribution. In Fig. 1b, from the data-rich head classes (e.g., living rooms), we\nintroduce a meta-learner F to learn the model dynamics \u2014 a series of transformations (denoted\nas solid lines) that represents how few k-shot models \u03b8k start from \u03b81 and gradually evolve to the\nunderlying many-shot models \u03b8\u2217 trained from large sets of samples. The model parameters \u03b8 are\nvisualized as points in the \u201cdual\u201d model (parameter) space. We leverage the model dynamics as prior\nknowledge to facilitate recognizing tail classes (e.g., libraries) by hallucinating their model evolution\ntrajectories (denoted as dashed lines).\n\npre-training on the head is quite similar to training on the unbalanced long-tailed dataset (which is\ndominated by the head) [10].\nTransferring meta-knowledge: Inspired by the recent work on meta-learning [21, 22, 23, 24, 25, 26],\nwe instead transfer meta-level knowledge about learning to learn from the head classes. Speci\ufb01cally,\nwe make use of the approach of [21], which describes a method for learning from small datasets (the\n\u201cfew-shot\u201d learning problem) through estimating a generic model transformation. To do so, [21]\nlearns a meta-level network that operates on the space of model parameters, which is speci\ufb01cally\ntrained to regress many-shot model parameters (trained on large datasets) from few-shot model\nparameters (trained on small datasets). Our meta-level regressor, which we call MetaModelNet, is\ntrained on classes from the head of the distribution and then applied to those from the tail. As an\nillustrative example in Fig. 1, consider learning scene classi\ufb01ers on a long-tailed dataset with many\nliving-rooms but few outside libraries. We learn both many-shot and few-shot living-room models\n(by subsampling the training data as needed), and train a regressor that maps between the two. We\ncan then apply the regressor on few-shot models of libraries learned from the tail.\nProgressive transfer: The above description suggests that we need to split up a long-tailed training\nset into a distinct set of source classes (the head) and target classes (the tail). This is most naturally\ndone by thresholding the number of training examples per class. But what is the correct threshold?\nA high threshold might result in a meta-network that simply acts as an identity function, returning\nthe input set of model parameters. This certainly would not be useful to apply on few-shot models.\nSimilarly, a low threshold may not be useful when regressing from many-shot models. Instead,\nwe propose a \u201ccontinuous\u201d strategy that builds multiple regressors across a (logarithmic) range of\nthresholds (e.g., 1-shot, 2-shot, 4-shot regressors, etc.), corresponding to different head-tail splits.\nImportantly, these regressors can be ef\ufb01ciently implemented with a single, chained MetaModelNet\nthat is naturally regularized with residual connections, such that the 2-shot regressor need only predict\nmodel parameters that are fed into the 4-shot regressor, and so on (until the many-shot regressor\nthat defaults to the identity). By doing so, MetaModelNet encodes a trajectory over the space of\nmodel parameters that captures their evolution with increasing sample sizes, as shown in Fig. 1b.\nInterestingly, such a network is naturally trained in a progressive manner from the head towards the\ntail, effectively capturing the gradual dynamics of transferring meta-knowledge from data-rich to\ndata-poor regimes.\nModel dynamics: It is natural to ask what kind of dynamics are learned by MetaModelNet \u2014 how\ncan one consistently predict how model parameters will change with more training data? We posit that\nthe network learns to capture implicit data augmentation \u2014 for example, given a 1-shot model trained\nwith a single image, the network may learn to implicitly add rotations of that single image. But rather\n\n2\n\n050100150200250300350400Class index020040060080010001200# OccurrencesHeadLong tail\ud835\udf03\"\ud835\udf03#LivingRoom\ud835\udf03\u2217\ud835\udf03%\ud835\udf03&Library\ud835\udf03\"\ud835\udf03&Knowledge Transfer\u2131\ud835\udf03\u2217\u2131\fthan explicitly creating data, MetaModelNet predicts their impact on the learned model parameters.\nInterestingly, past work tends to apply the same augmentation strategies across all input classes. But\nperhaps different classes should be augmented in different ways \u2014 e.g., churches maybe viewed\nfrom consistent viewpoints and should not be augmented with out-of-plane rotations. MetaModelNet\nlearns class-speci\ufb01c transformations that are smooth across the space of models \u2014 e.g., classes with\nsimilar model parameters tend to transform in similar ways (see Fig. 1b and Fig. 4 for more details).\nOur contributions are three-fold. (1) We analyze the dynamics of how model parameters evolve\nwhen given access to more training examples. (2) We show that a single meta-network, based on deep\nresidual learning, can learn to accurately predict such dynamics. (3) We train such a meta-network on\nlong-tailed datasets through a recursive approach that gradually transfers meta-knowledge learned\nfrom the head to the tail, signi\ufb01cantly improving long-tail recognition on a broad range of tasks.\n\n2 Related Work\n\nA widespread yet suboptimal strategy is to resample and rebalance training data in the presence of the\nlong tail, either by sampling examples from the rare classes more frequently [16, 17], or reducing the\nnumber of examples from the common classes [27]. The former generates redundancy and quickly\nruns into the problem of over-\ufb01tting to the rare classes, whereas the latter loses critical information\ncontained within the large-sample sets. An alternative practice is to introduce additional weights for\ndifferent classes, which, however, makes optimization of the models very dif\ufb01cult in the large-scale\nrecognition scenarios [28].\nOur underlying assumption that model parameters across different classes share similar dynamics\nis somewhat common in meta-learning [21, 22, 25]. While [22, 25] consider the dynamics during\nstochastic gradient descent (SGD) optimization, we address the dynamics as more training data is\ngradually made available. In particular, the model regression network from [21] empirically shows a\ngeneric nonlinear transformation from small-sample to large-sample models for different types of\nfeature spaces and classi\ufb01er models. We extend [21] for long-tail recognition by introducing a single\nnetwork that can model transformations across different samples sizes. To train such a network, we\nintroduce recursive algorithms for head-to-tail transfer learning and architectural modi\ufb01cations based\non deep residual networks (that ensure that transformations of large-sample models default to the\nidentity).\nOur approach is broadly related to different meta-learning concepts such as learning-to-learn, transfer\nlearning, and multi-task learning [29, 30, 18, 31]. Such approaches tend to learn shared structures\nfrom a set of relevant tasks and generalize to novel tasks. Speci\ufb01cally, our approach is inspired\nby early work on parameter prediction that modi\ufb01es the weights of one network using another [32,\n33, 34, 35, 36, 37, 38, 26, 39]. Such techniques have also been recently explored in the context of\nregressing classi\ufb01er weights from training sample [40, 41, 42]. From an optimization perspective,\nour approach is related to work on learning to optimize, which replaces hand-designed update rules\n(e.g., SGD) with a learned update rule [22, 24, 25].\nThe most related formulation is that of one/few-shot learning [43, 44, 45, 46, 47, 48, 21, 49, 35,\n50, 23, 51, 52, 53, 54, 55, 56]. Past work has explored strategies of using the common knowledge\ncaptured among a set of one-shot learning tasks during meta-training for a novel one-shot learning\nproblem [52, 25, 35, 53]. These techniques, however, are typically developed for a \ufb01xed set of\nfew-shot tasks, in which each class has the same, \ufb01xed number of training samples. They appear\ndif\ufb01cult to generalize to novel tasks with a wide range of sample sizes, the hallmark of long-tail\nrecognition.\n\n3 Head-to-Tail Meta-Knowledge Transfer\n\nGiven a long-tail recognition task of interest and a base recognition model such as a deep CNN, our\ngoal is to transfer knowledge from the data-rich head to the data-poor tail classes. As shown in Fig. 1,\nknowledge is represented as trajectories in model space that capture the evolution of parameters with\nmore and more training examples. We train a meta-learner (MetaModelNet) to learn such model\ndynamics from head classes, and then \u201challucinate\u201d the evolution of parameters for the tail classes.\nTo simplify exposition, we \ufb01rst describe the approach for a \ufb01xed split of our training dataset into a\nhead and tail. We then generalize the approach to multiple splits.\n\n3\n\n\f(a) Learning a sample-size dependent transformation.\n\n(b) Structure of residual blocks.\n\nFigure 2: MetaModelNet architecture for learning model dynamics. We instantiate MetaModelNet\nas a deep residual network with residual blocks i = 0, 1, . . . , N in Fig. 2a, which accepts few-shot\nmodel parameters \u03b8 (trained on small datasets across a logarithmic range of sample sizes k, k = 2i)\nas (multiple) inputs and regresses them to many-shot model parameters \u03b8\u2217 (trained on large datasets)\nas output. The skip connections ensure the identity regularization. Fi denotes the meta-learner that\ntransforms (regresses) k-shot \u03b8 to \u03b8\u2217. Fig. 2b shows the structure of the residual blocks. Note that\nthe meta-learners Fi for different k are derived from this single, chained meta-network, with nested\ncircles (subnetworks) corresponding to Fi.\n\n3.1 Fixed-size model transformations\n\nLet us write Ht for the \u201chead\u201d training set of (x, y) data-label pairs constructed by assembling those\nclasses for which there exist more than t training examples. We will use Ht to learn a meta-network\nthats maps few-shot model parameters to many-shot parameters, and then apply this network on\nfew-shot models from the tail classes. To do so, we closely follow the model regression framework\nfrom [21], but introduce notation that will be useful later. Let us write a base learner as g(x; \u03b8) as a\nfeedforward function g(\u00b7) that processes an input sample x given parameters \u03b8. We \ufb01rst learn a set\nof \u201coptimal\u201d model parameters \u03b8\u2217 by tuning g on Ht with a standard loss function. We also learn\nfew-shot models by randomly sampling a smaller \ufb01xed number of examples per class from Ht. We\nthen train a meta-network F(\u00b7) to map or regress the few-shot parameters to \u03b8\u2217.\nParameters: In principle, F(\u00b7) applies to model parameters from multiple CNN layers. Directly\nregressing parameters from all layers is, however, dif\ufb01cult to do because of the larger number of\nparameters. For example, recent similar methods for meta-learning tend to restrict themselves to\nsmaller toy networks [22, 25]. For now, we focus on parameters from the last fully-connected layer\nfor a single class \u2014 e.g., \u03b8 \u2208 R4096 for an AlexNet architecture. This allows us to learn regressors\nthat are shared across classes (as in [21]), and so can be applied to any individual test class. This\nis particularly helpful in the long-tailed setting, where the number of classes in the tail tends to\noutnumber the head. Later we will show that (nonlinear) \ufb01ne-tuning of the \u201centire network\u201d during\nhead-to-tail transfer can further improve performance.\nLoss function: The meta-network F(\u00b7) is itself parameterized with weights w. The objective function\n\nfor each class is: (cid:88)\n\n\u03b8\u2208kShot(Ht)\n\n(cid:110)||F(\u03b8; w) \u2212 \u03b8\u2217||2 + \u03bb\n\n(cid:88)\n\n(x,y)\u2208Ht\n\n(cid:16)\n\nloss\n\ng(cid:0)x;F(\u03b8; w)(cid:1), y\n\n(cid:17)(cid:111)\n\n.\n\n(1)\n\nThe \ufb01nal loss is averaged over all the head classes and minimized with respect to w. Here, kShot(Ht)\nis the set of few-shot models learned by subsampling k examples per class from Ht, and loss refers to\nthe performance loss used to train the base network (e.g., cross-entropy). \u03bb > 0 is the regularization\nparameter used to control the trade-off between the two terms. [21] found that the performance loss\nwas useful to learn regressors that maintained high accuracy on the base task. This formulation can\nbe viewed as an extension to those in [21, 25]. With only the performance loss, Eqn. (1) reduces to\nthe loss function in [25]. When the performance loss is evaluated on the subsampled set, Eqn. (1)\nreduces to the loss function in [21].\n\n4\n\nRes0Res1Res\ud835\udc41\u2026\t\ud835\udf03\u2217\u2026\t1Shot\ud835\udf032Shot \ud835\udf032%Shot \ud835\udf032&Shot \ud835\udf03Res\ud835\udc56\u2131)\u2131&(\ud835\udc58=2&)BNLeaky ReLUWeightBNLeaky ReLUWeight2\ud835\udc56Shot \ud835\udf03\fTraining: What should be the value of k, for the k-shot models being trained? One might be tempted\nto set k = t, but this implies that there will be some head classes near the cutoff that have only\nt training examples, implying \u03b8 and \u03b8\u2217 will be identical. To ensure that a meaningful mapping is\nlearned, we set\n\nk = t/2.\n\nIn other terms, we intentionally learn very-few-shot models to ensure that target model parameters\nare suf\ufb01ciently more general.\n\n3.2 Recursive residual transformations\n\nWe wish to apply the above module on all possible head-tail splits of a long-tailed training set. To do\nso, we extend the above approach in three crucial ways:\n\n\u2022 (Sample-size dependency) Generate a sequence of different meta-learners Fi each tuned for\na speci\ufb01c k, where k = k(i) is an increasing function of i (that will be speci\ufb01ed shortly).\nThrough a straightforward extension, prior work on model regression [21] learns a single\n\ufb01xed meta-learner for all the k-shot regression tasks.\n\n\u2022 (Identity regularization) Ensure that the meta-learner defaults to the identity function for\n\nlarge i: Fi \u2192 I as i \u2192 \u221e.\n\n\u2022 (Compositionality) Compose meta-learners out of each other: \u2200i < j, Fi(\u03b8) = Fj\n\nwhere Fij is the regressor that maps between k(i)-shot and k(j)-shot models.\n\n(cid:16)Fij(\u03b8)\n(cid:17)\n\nHere we dropped the explicit dependence of F(\u00b7) on w for notational simplicity. These observations\nemphasize the importance of (1) the identity regularization and (2) sample-size dependent regressors\nfor long-tailed model transfer. We operationalize these extensions with a recursive residual network:\n\n(cid:16)\n\nFi(\u03b8) = Fi+1\n\n(cid:17)\n\n\u03b8 + f (\u03b8; wi)\n\n,\n\n(2)\n\nwhere f denotes a residual block parameterized by wi and visualized in Fig. 2b. Inspired by [57, 21],\nf consists of batch normalization (BN) and leaky ReLU as pre-activation, followed by fully-connected\nweights. By construction, each residual block transforms an input k(i)-shot model to a k(i + 1)-shot\nmodel. The \ufb01nal MetaModelNet can be ef\ufb01ciently implemented through a chained network of N + 1\nresidual blocks, as shown in Fig. 2a. By feeding in a few-shot model at a particular block, we can\nderive any meta-learner Fi from the central underlying chain.\n\n3.3 Training\n\nGiven the network structure de\ufb01ned above, we now describe an ef\ufb01cient method for training based on\ntwo insights. (1) The recursive de\ufb01nition of MetaModelNet suggests a recursive strategy for training.\nWe begin with the last block and train it with the largest threshold (e.g., those few classes in the head\nwith many examples). The associated k-shot regressor should be easy to learn because it is similar to\nan identity mapping. Given the learned parameters for the last block, we then train the next-to-last\nblock, and so on. (2) Inspired by the general observation that recognition performance improves\non a logarithmic scale as the number of training samples increases [8, 9, 58], we discretize blocks\naccordingly, to be tuned for 1-shot, 2-shot, 4-shot, ... recognition. In terms of notation, we write the\nrecursive training procedure as follows. We iterate over blocks i from N to 0, and for each i:\n\n\u2022 Using Eqn. (1), train parameters of the residual block wi on the head split Ht with k-shot\n\nmodel regression, where k = 2i and t = 2k = 2i+1.\n\nThe above \u201cback-to-front\u201d training procedure works because whenever block i is trained, all subse-\nquent blocks (i + 1, . . . , N ) have already been trained. In practice, rather than holding all subsequent\nblocks \ufb01xed, it is natural to \ufb01ne-tune them while training block i. One approach might be \ufb01ne-tuning\nthem on the current k = 2i-shot regression task being considered at iteration i. But because Meta-\nModelNet will be applied across a wide range of k, we \ufb01ne-tune blocks in a multi-task manner across\nthe current viable range of k = (2i, 2i+1, . . . , 2N ) at each iteration i.\n\n5\n\n\f3.4\n\nImplementation details\n\nWe learn the CNN models on the long-tailed recognition datasets in different scenarios: (1) using\na CNN pre-trained on ILSVRC 2012 [1, 59, 60] as the off-the-shelf feature; (2) \ufb01ne-tuning the\npre-trained CNN; and (3) training a CNN from scratch. We use ResNet152 [4] for its state-of-the-art\nperformance and use ResNet50 [4] and AlexNet [1] for their easy computation.\nWhen training the residual block i, we use the corresponding threshold t and obtain Ct head classes.\nWe generate the Ct-way many-shot classi\ufb01ers on Ht. For few-shot models, we learn Ct-way k-shot\nclassi\ufb01ers on random subsets of Ht. Through random sampling, we generate S model mini-batches\nand each model mini-batch consists of Ct weight vector pairs. In addition, to minimize the loss\nfunction (1), we randomly sample 256 image-label pairs as a data mini-batch from Ht.\nWe then use Caffe [59] to train our MetaModelNet on the generated model and data mini-batches\nbased on standard SGD. \u03bb is cross-validated. We use 0.01 as the negative slope for leaky ReLU.\nComputation is naturally divided into two stages: (1) training a collection of few/many-shot models\nand (2) learning MetaModelNet from those models. (2) is equivalent to progressively learning a\nnonlinear regressor. (1) can be made ef\ufb01cient because it is naturally parallelizable across models, and\nmoreover, many models make use of only small training sets.\n\n4 Experimental Evaluation\n\nIn this section, we explore the use of our MetaModelNet on long-tail recognition tasks. We begin with\nextensive evaluation of our approach on scene classi\ufb01cation of the SUN-397 dataset [14], and address\nthe meta-network variations and different design choices. We then visualize and empirically analyze\nthe learned model dynamics. Finally, we evaluate on the challenging large-scale, scene-centric\nPlaces [7] and object-centric ImageNet datasets [5] and show the generality of our approach.\n\n4.1 Evaluation and analysis on SUN-397\n\nDataset and task: We start our evaluation by \ufb01ne-tuning a pre-trained CNN on SUN-397, a medium-\nscale, long-tailed dataset with 397 classes and 100\u20132,361 images per class [14]. To better analyze\ntrends due to skewed distributions, we carve out a more extreme version of the dataset. Following\nthe experimental setup in [61, 62, 63], we \ufb01rst randomly split the dataset into train, validation, and\ntest parts using 50%, 10%, and 40% of the data, respectively. The distribution of classes is uniform\nacross all the three parts. We then randomly discard 49 images per class for the train part, leading\nto a long-tailed training set with 1\u20131,132 images per class (median 47). Similarly, we generate a\nsmall long-tailed validation set with 1\u2013227 images per class (median 10), which we use for learning\nhyper-parameters. We also randomly sample 40 images per class for the test part, leading to a\nbalanced test set. We report 397-way multi-class classi\ufb01cation accuracy averaged over all classes.\n\n4.1.1 Comparison with state-of-the-art approaches\n\nWe \ufb01rst focus on \ufb01ne-tuning the classi\ufb01er module while freezing the representation module of a\npre-trained ResNet152 CNN model [4, 63] for its state-of-the-art performance. Using MetaModelNet,\nwe learn the model dynamics of the classi\ufb01er module, i.e., how the classi\ufb01er weight vectors change\nduring \ufb01ne-tuning. Following the design choices in Section 3.2, our MetaModelNet consists of 7\nresidual blocks. For few-shot models, we generate S = 1000 1-shot, S = 500 2-shot, and S = 200\n4-shot till 64-shot models from the head classes for learning MetaModelNet. At test time, given\nthe weight vectors of all the classes learned through \ufb01ne-tuning, we feed them as inputs to the\ndifferent residual blocks according to their training sample size of the corresponding class. We then\n\u201challucinate\u201d the dynamics of these weight vectors and use the outputs of MetaModelNet to modify\nthe parameters of the \ufb01nal recognition model as in [21].\nBaselines: In addition to the \u201cplain\u201d baseline that \ufb01ne-tunes on the target data following the standard\npractice, we compare against three state-of-the-art baselines that are widely used to address the\nimbalanced distributions. (1) Over-sampling [16, 17], which uses the balanced sampling via label\nshuf\ufb02ing as in [16, 17]. (2) Under-sampling [27], which reduces the number of samples per class to\n47 at most (the median value). (3) Cost-sensitive [28], which introduces additional weights in the loss\nfunction for each class with inverse class frequency. For a fair comparison, \ufb01ne-tuning is performed\n\n6\n\n\fMethod\nAcc (%)\n\nPlain [4] Over-Sampling [16, 17] Under-Sampling [27] Cost-Sensitive [28] MetaModelNet (Ours)\n48.03\n\n57.34\n\n52.61\n\n51.72\n\n52.37\n\nTable 1: Performance comparison between our MetaModelNet and state-of-the-art approaches for\nlong-tailed scene classi\ufb01cation when \ufb01ne-tuning the pre-trained ILSVRC ResNet152 on the SUN-397\ndataset. We focus on learning the model dynamics of the classi\ufb01er module while freezing the CNN\nrepresentation module. By bene\ufb01ting from the learned generic model dynamics from head classes,\nours signi\ufb01cantly outperforms all the baselines for the long-tail recognition.\n\nFigure 3: Detailed per class performance comparison between our MetaModelNet and the state-of-\nthe-art over-sampling approach for long-tailed scene classi\ufb01cation on the SUN-397 dataset. X-axis:\nclass index. Y-axis (Left): per class classi\ufb01cation accuracy improvement relative to the plain baseline.\nY-axis (Right): number of training examples. Ours signi\ufb01cantly improves for the few-shot tail classes.\n\nfor around 60 epochs using SGD with an initial learning rate of 0.01, which is reduced by a factor of\n10 around every 30 epochs. All the other hyper-parameters are the same for all approaches.\nTable 1 summarizes the performance comparison averaged over all classes and Fig. 3 details the per\nclass comparison. Table 1 shows that our MetaModelNet provides a promising way of encoding\nthe shared structure across classes in model space. It outperforms existing approaches for long-tail\nrecognition by a large margin. Fig. 3 shows that our approach signi\ufb01cantly improves accuracy in the\ntail.\n\n4.1.2 Ablation analysis\n\nWe now evaluate variations of our approach and provide ablation analysis. Similar as in Section 4.1.1,\nwe use ResNet152 in the \ufb01rst two sets of experiments and only \ufb01ne-tune the classi\ufb01er module. In the\nlast set of experiments, we use ResNet50 [4] for easy computation and \ufb01ne-tune through the entire\nnetwork. Tables 2 and 3 summarize the results.\nSample-size dependent transformation and identity regularization: We compare to [21], which\nlearns a single transformation for a variety of sample sizes and k-shot models, and importantly,\nlearns a network without identity regularization. For a fair comparison, we consider a variant of\nMetaModelNet trained on a \ufb01xed head and tail split, selected by cross-validation. Table 2 shows that\ntraining for a \ufb01xed sample size and identity regularization provide a noticeable performance boost\n(2%).\nRecursive class splitting: Adding multiple head-tail splits through recursion further improves\naccuracy by a small but noticeable amount (0.5% as shown in Table 2). We posit that progressive\nknowledge transfer outperforms the traditional approach because ordering classes by frequency is a\nnatural form of curriculum learning.\nJoint feature \ufb01ne-tuning and model dynamics learning: We also explore (nonlinear) \ufb01ne-tuning\nof the \u201centire network\u201d during head-to-tail transfer by jointly learning the classi\ufb01er dynamics and the\nfeature representation using ResNet50. We explore two approaches as follows. (1) We \ufb01rst \ufb01ne-tune\n\n7\n\n050100150200250300350400Class index-40-20020406080Relative accuracy gain (%)020040060080010001200# OccurrencesMetaModelNet (Ours) Over-Sampling\fMethod Model Regression [21] MetaModelNet+Fix Split (Ours) MetaModelNet+ Recur Split (Ours)\nAcc (%)\n\n57.34\n\n56.86\n\n54.68\n\nTable 2: Ablation analysis of variations of our MetaModelNet. In a \ufb01xed head-tail split, ours outper-\nforms [21], showing the merit of learning a sample-size dependent transformation. By recursively\npartitioning the entire classes into different head-tail splits, our performance is further improved.\n\nScenario\nMethod\nAcc (%)\n\nPre-Trained Features\n\nPlain [4] MetaModelNet (Ours)\n46.90\n\n54.99\n\nPlain [4]\n49.40\n\nFine-Tuned Features (FT)\n\nFix FT + MetaModelNet (Ours) Recur FT + MetaModelNet (Ours)\n\n58.53\n\n58.74\n\nTable 3: Ablation analysis of joint feature \ufb01ne-tuning and model dynamics learning on a ResNet50\nbase network. Though results with pre-trained features underperform those with a deeper base\nnetwork (ResNet152, the default in our experiments), \ufb01ne-tuning such features signi\ufb01cantly improves\nresults, even outperforming the deeper base network. By progressively \ufb01ne-tuning the representation\nduring the recursive training of MetaModelNet, performance signi\ufb01cantly improves from 54.99%\n(changing only the classi\ufb01er weights) to 58.74% (changing the entire CNN).\n\nthe whole CNN on the entire long-tailed training dataset, and then learn the classi\ufb01er dynamics using\nthe \ufb01xed, \ufb01ne-tuned representation. (2) During the recursive head-tail splitting, we \ufb01ne-tune the\nentire CNN on the current head classes in Ht (while learning the many-shot parameters \u03b8\u2217), and then\nlearn classi\ufb01er dynamics using the \ufb01ne-tuned features. Table 3 shows that progressively learning\nclassi\ufb01er dynamics while \ufb01ne-tuning features performs the best.\n\n4.2 Understanding model dynamics\n\nBecause model dynamics are highly nonlinear, a theoretical proof is rather challenging and outside\nthe scope of this work. Here we provide some empirical analysis of model dynamics. When analyzing\nthe \u201cdual model (parameter) space\u201d, in which models parameters \u03b8 can be viewed as points, Fig. 4\nshows that our MetaModelNet learns an approximately-smooth, nonlinear warping of this space that\ntransforms (few-shot) input points to (many-shot) output points. For example, iceberg and mountain\nscene classes are more similar to each other than to bedrooms. This implies that few-shot iceberg\nand mountain scene models lie near each other in parameter space, and moreover, they transform in\nsimilar ways (when compared to bedrooms). This single meta-network hence encodes class-speci\ufb01c\nmodel transformations. We posit that the transformation may capture some form of (class-speci\ufb01c)\ndata-augmentation. Finally, we \ufb01nd that some properties of the learned transformations are quite\nclass-agnostic and apply in generality. Many-shot model parameters tend to have larger magnitudes\nand norms than few-shot ones (e.g., on SUN-397, the average norm of 1-shot models is 0.53; after\ntransformations through MetaModelNet, the average norm of the output models becomes 1.36). This\nis consistent with the common empirical observation that classi\ufb01er weights tend to grow with the\namount of training data, showing that they become more con\ufb01dent about their prediction.\n\n4.3 Generalization to other tasks and datasets\n\nWe now focus on the more challenging, large-scale scene-centric Places [7] and object-centric\nImageNet [5] datasets. While we mainly addressed the model dynamics when \ufb01ne-tuning a pre-\ntrained CNN in the previous experiments, here we train AlexNet models [1] from scratch on the target\ntasks. Table 4 shows the generality of our approach and shows that MetaModelNets facilitate the\nrecognition of other long-tailed datasets with signi\ufb01cantly different visual concepts and distributions.\nScene classi\ufb01cation on the Places dataset: Places-205 [7] is a large-scale dataset which contains\n2,448,873 training images approximately evenly distributed across 205 classes. To generate its\nlong-tailed version and better analyze trends due to skewed distributions, we distribute it according to\nthe distribution of SUN and carve out a more extreme version (p2, or 2\u00d7 the slope in log-log plot)\nout of the Places training portion, leading to a long-tailed training set with 5\u20139,900 images per class\n(median 73). We use the provided validation portion as our test set with 100 images per class.\nObject classi\ufb01cation on the ImageNet dataset: The ILSVRC 2012 classi\ufb01cation dataset [5] con-\ntains 1,000 classes with 1.2 million training images (approximately balanced between the classes)\nand 50K validation images. There are 200 classes used for object detection which are de\ufb01ned as\n\n8\n\n\f(a) PCA visualization.\n\n(b) t-SNE visualization.\n\nFigure 4: Visualizing model dynamics. Recall that \u03b8 is a \ufb01xed-dimensional vector of model parameters\n\u2014 e.g., \u03b8 \u2208 R2048 when considering parameters from the last layer of ResNet. We visualize models\nas points in this \u201cdual\u201d space. Speci\ufb01cally, we examine the evolution of parameters predicted by\nMetaModelNet with dimensionality reduction \u2014 PCA (Fig. 4a) or t-SNE [64] (Fig. 4b). 1-shot\nmodels (purple) to many-shot models (red) are plotted in a rainbow order. These visualizations show\nthat MetaModelNet learns an approximately-smooth, nonlinear warping of this space that transforms\n(few-shot) input points to (many-shot) output points. PCA suggests that many-shot models tend\nto have larger norms, while t-SNE (which nonlinearly maps nearby points to stay close) suggests\nthat similar semantic classes tend to be close and transform in similar ways, e.g., the blue rectangle\nencompasses \u201croom\u201d classes while the red rectangle encompasses \u201cwintry outdoor\u201d classes.\n\nDataset\nMethod\nAcc (%)\n\nPlaces-205 [7]\n\nPlain [1] MetaModelNet (Ours)\n23.53\n\n30.71\n\nILSVRC-2012 [5]\n\nPlain [1] MetaModelNet (Ours)\n68.85\n\n73.46\n\nTable 4: Performance comparisons on long-tailed, large-scale scene-centric Places [7] and object-\ncentric ImageNet [5] datasets. Our MetaModelNets facilitate the long-tail recognition with signi\ufb01-\ncantly diverse visual concepts and distributions.\n\nhigher-level classes of the original 1,000 classes. Taking the ILSVRC 2012 classi\ufb01cation dataset\nand merging the 1,000 classes into the 200 higher-level classes, we obtain a natural long-tailed\ndistribution.\n\n5 Conclusions\n\nIn this work we proposed a conceptually simple but powerful approach to address the problem of\nlong-tail recognition through knowledge transfer from the head to the tail of the class distribution.\nOur key insight is to represent the model dynamics through meta-learning, i.e., how a recognition\nmodel transforms and evolves during the learning process when gradually encountering more training\nexamples. To do so, we introduce a meta-network that learns to progressively transfer meta-knowledge\nfrom the head to the tail classes. We present several state-of-the-art results on benchmark datasets\n(SUN, Places, and Imagenet) tuned for the long-tailed setting, that signi\ufb01cantly outperform common\nheuristics, such as data resampling or reweighting.\nAcknowledgments. We thank Liangyan Gui, Olga Russakovsky, Yao-Hung Hubert Tsai, and\nRuslan Salakhutdinov for valuable and insightful discussions. This work was supported in part by\nONR MURI N000141612007 and U.S. Army Research Laboratory (ARL) under the Collaborative\nTechnology Alliance Program, Cooperative Agreement W911NF-10-2-0016. DR was supported\nin part by the National Science Foundation (NSF) under grant number IIS-1618903, Google, and\nFacebook. We also thank NVIDIA for donating GPUs and AWS Cloud Credits for Research program.\n\nReferences\n[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\n\n[2] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nneural networks. In NIPS, 2012.\n\nrecognition. In ICLR, 2015.\n\n9\n\n-0.4-0.200.20.40.6-0.4-0.200.20.40.6-40-200204060-40-2002040MountainMountainSnowyIcebergHotelRoomBedroomLivingRoom\f2016.\n\n[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition\nchallenge. IJCV, 115(3):211\u2013252, 2015.\n\n[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft COCO: Common objects in context. In ECCV, 2014.\n\n[7] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image\n\ndatabase for scene recognition. TPAMI, 2017.\n\n[8] X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories.\n\n[9] X. Zhu, C. Vondrick, C. C. Fowlkes, and D. Ramanan. Do we need more training data? IJCV,\n\nIn CVPR, 2014.\n\n119(1):76\u201392, 2016.\n\n[10] G. Van Horn and P. Perona. The devil is in the tails: Fine-grained classi\ufb01cation in the wild.\n\narXiv preprint arXiv:1709.01450, 2017.\n\n[11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual\n\nobject classes (VOC) challenge. IJCV, 88(2):303\u2013338, 2010.\n\n[12] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalanditis, L.-J. Li,\nD. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision\nusing crowdsourced dense image annotations. IJCV, 123(1):32\u201373, 2017.\n\n[13] W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in \ufb01netuning deep model for object\n\ndetection with long-tail distribution. In CVPR, 2016.\n\n[14] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. SUN database: Exploring a large\n\ncollection of scene categories. IJCV, 119(1):3\u201322, 2016.\n\n[15] S. Bengio. Sharing representations for long tail computer vision problems. In ICMI, 2015.\n[16] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolu-\n\ntional neural networks. In ECCV, 2016.\n\n[17] Q. Zhong, C. Li, Y. Zhang, H. Sun, S. Yang, D. Xie, and S. Pu. Towards good practices for\n\nrecognition & detection. In CVPR workshops, 2016.\n\n[18] S. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 22(10):1345\u20131359, 2010.\n[19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural\n\n[20] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\n[21] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample\n\n[22] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas.\n\nLearning to learn by gradient descent by gradient descent. In NIPS, 2016.\n\n[23] Y.-X. Wang and M. Hebert. Learning from small sample sets by combining unsupervised\n\nmeta-training with CNNs. In NIPS, 2016.\n\n[24] K. Li and J. Malik. Learning to optimize. In ICLR, 2017.\n[25] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.\n[26] A. Sinha, M. Sarkar, A. Mukherjee, and B. Krishnamurthy. Introspection: Accelerating neural\n\nnetwork training by learning weight evolution. In ICLR, 2017.\n\n[27] H. He and E. A. Garcia. Learning from imbalanced data. TKDE, 21(9):1263\u20131284, 2009.\n[28] C. Huang, Y. Li, C. C. Loy, and X. Tang. Learning deep representation for imbalanced\n\nclassi\ufb01cation. In CVPR, 2016.\n\n[29] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.\n[30] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm,\nadaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105\u2013130,\n1997.\n\n[31] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[32] J. Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn:\n\nThe meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.\n\n[33] J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992.\n\n[34] J. Schmidhuber. A neural network that embeds its own meta-levels. In IEEE International\n\nConference on Neural Networks, 1993.\n\nnetworks? In NIPS, 2014.\n\nScience, 313(5786):504\u2013507, 2006.\n\nlearning. In ECCV, 2016.\n\n10\n\n\f[35] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward\n\none-shot learners. In NIPS, 2016.\n\n[36] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. In ICLR, 2017.\n[37] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\n[38] S.-A. Rebuf\ufb01, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters.\n\n[39] T. Munkhdalai and H. Yu. Meta networks. In ICML, 2017.\n[40] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal\n\nnetworks. In ICML, 2017.\n\nIn NIPS, 2017.\n\ntransfer. In NIPS, 2013.\n\n[41] J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Predicting deep zero-shot convolutional\n\nneural networks using textual descriptions. In ICCV, 2015.\n\n[42] H. Noh, P. H. Seo, and B. Han. Image question answering using convolutional neural network\n\nwith dynamic parameter prediction. In CVPR, 2016.\n\n[43] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594\u2013\n\n[44] Y.-X. Wang and M. Hebert. Model recommendation: Generating object detectors from few\n\n611, 2006.\n\nsamples. In CVPR, 2015.\n\n[45] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image\n\nrecognition. In ICML Workshops, 2015.\n\n[46] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[47] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot learning with\n\nmemory-augmented neural networks. In ICML, 2016.\n\n[48] Y.-X. Wang and M. Hebert. Learning by transferring from unsupervised universal sources. In\n\n[49] Z. Li and D. Hoiem. Learning without forgetting. In ECCV, 2016.\n[50] B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating\n\nAAAI, 2016.\n\nfeatures. In ICCV, 2017.\n\n[51] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S\u00e4ckinger, and R. Shah.\nSignature veri\ufb01cation using a \"siamese\" time delay neural network. International Journal of\nPattern Recognition and Arti\ufb01cial Intelligence, 7(4):669\u2013688, 1993.\n\n[52] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for\n\n[53] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In NIPS,\n\none shot learning. In NIPS, 2016.\n\n2017.\n\n[54] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong. Recent advances in zero-shot\nrecognition: Toward data-ef\ufb01cient understanding of visual content. IEEE Signal Processing\nMagazine, 35(1):112\u2013125, 2018.\n\n[55] D. George, W. Lehrach, K. Kansky, M. L\u00e1zaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng,\nY. Liu, H. Wang, A. Lavin, and D. S. Phoenix. A generative vision model that trains with high\ndata ef\ufb01ciency and breaks text-based CAPTCHAs. Science, 2017.\n\n[56] E. Trianta\ufb01llou, R. Zemel, and R. Urtasun. Few-shot learning through an information retrieval\n\n[57] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,\n\nlens. In NIPS, 2017.\n\n2016.\n\n[58] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in\n\ndeep learning era. In ICCV, 2017.\n\n[59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and\nT. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.\n[60] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep\n\nconvolutional activation feature for generic visual recognition. In ICML, 2014.\n\n[61] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks\n\nfor object recognition. In ECCV, 2014.\n\n[62] M. Huh, P. Agrawal, and A. A. Efros. What makes ImageNet good for transfer learning? In\n\n[63] Y.-X. Wang, D. Ramanan, and M. Hebert. Growing a brain: Fine-tuning by increasing model\n\n[64] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. JMLR, 9(Nov):2579\u20132605,\n\nNIPS workshops, 2016.\n\ncapacity. In CVPR, 2017.\n\n2008.\n\n11\n\n\f", "award": [], "sourceid": 3544, "authors": [{"given_name": "Yu-Xiong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Deva", "family_name": "Ramanan", "institution": "Carnegie Mellon University"}, {"given_name": "Martial", "family_name": "Hebert", "institution": "cmu"}]}