{"title": "Maximum-Entropy Fine Grained Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 637, "page_last": 647, "abstract": "Fine-Grained Visual Classification (FGVC) is an important computer vision problem that involves small diversity within the different classes, and often requires expert annotators to collect data. Utilizing this notion of small visual diversity, we revisit Maximum-Entropy learning in the context of fine-grained classification, and provide a training routine that maximizes the entropy of the output probability distribution for training convolutional neural networks on FGVC tasks. We provide a theoretical as well as empirical justification of our approach, and achieve state-of-the-art performance across a variety of classification tasks in FGVC, that can potentially be extended to any fine-tuning task. Our method is robust to different hyperparameter values, amount of training data and amount of training label noise and can hence be a valuable tool in many similar problems.", "full_text": "Maximum Entropy Fine-Grained Classi\ufb01cation\n\nAbhimanyu Dubey Otkrist Gupta Ramesh Raskar Nikhil Naik\n\n{dubeya, otkrist, raskar, naik}@mit.edu\n\nMassachusetts Institute of Technology\n\nCambridge, MA, USA\n\nAbstract\n\nFine-Grained Visual Classi\ufb01cation (FGVC) is an important computer vision prob-\nlem that involves small diversity within the different classes, and often requires\nexpert annotators to collect data. Utilizing this notion of small visual diversity,\nwe revisit Maximum-Entropy learning in the context of \ufb01ne-grained classi\ufb01cation,\nand provide a training routine that maximizes the entropy of the output probability\ndistribution for training convolutional neural networks on FGVC tasks. We provide\na theoretical as well as empirical justi\ufb01cation of our approach, and achieve state-\nof-the-art performance across a variety of classi\ufb01cation tasks in FGVC, that can\npotentially be extended to any \ufb01ne-tuning task. Our method is robust to different\nhyperparameter values, amount of training data and amount of training label noise\nand can hence be a valuable tool in many similar problems.\n\n1\n\nIntroduction\n\nFor ImageNet [7] classi\ufb01cation and similar large-scale classi\ufb01cation tasks that span numerous diverse\nclasses and millions of images, strongly discriminative learning by minimizing the cross-entropy\nfrom the labels improves performance for convolutional neural networks (CNNs). Fine-grained\nvisual classi\ufb01cation problems differ from such large-scale classi\ufb01cation in two ways: (i) the classes\nare visually very similar to each other and are harder to distinguish between (see Figure 1a), and\n(ii) there are fewer training samples and therefore the training dataset might not be representative\nof the application scenario. Consider a technique that penalizes strongly discriminative learning,\nby preventing a CNN from learning a model that memorizes speci\ufb01c artifacts present in training\nimages in order to minimize the cross-entropy loss from the training set. This is helpful in \ufb01ne-\ngrained classi\ufb01cation: for instance, if a certain species of bird is mostly photographed against a\ndifferent background compared to other species, memorizing the background will lower generalization\nperformance while lowering training cross-entropy error, since the CNN will associate the background\nto the bird itself.\nIn this paper, we formalize this intuition and revisit the classical Maximum-Entropy regime, based on\nthe following underlying idea: the entropy of the probability logit vector produced by the CNN is a\nmeasure of the \u201cpeakiness\u201d or \u201ccon\ufb01dence\u201d of the CNN. Learning CNN models that have a higher\nvalue of output entropy will reduce the \u201ccon\ufb01dence\u201d of the classi\ufb01er, leading in better generalization\nabilities when training with limited, \ufb01ne-grained training data. Our contributions can be listed as\nfollows: (i) we formalize the notion of \u201c\ufb01ne-grained\u201d vs \u201clarge-scale\u201d image classi\ufb01cation based on a\nmeasure of diversity of the features, (ii) we derive bounds on the (cid:96)2 regularization of classi\ufb01er weights\nbased on this diversity and entropy of the classi\ufb01er, (iii) we provide uniform convergence bounds on\nestimating entropy from samples in terms of feature diversity, (iv) we formulate a \ufb01ne-tuning objective\nfunction that obtains state-of-the-art performance on \ufb01ve most-commonly used FGVC datasets across\nsix widely-used CNN architectures, and (v) we analyze the effect of Maximum-Entropy training\nover different hyperparameter values, amount of training data, and amount of training label noise to\ndemonstrate that our method is consistently robust to all the above.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Samples from the CUB-200-2011 FGVC (top) and ImageNet (bottom) datasets. (b) Plot of top 2\nprincipal components (obtained from ILSVRC-training set on GoogleNet pool5 features) on ImageNet (red) and\nCUB-200-2011 (blue) validation sets. CUB-200-2011 data is concentrated with less diversity, as hypothesized.\n2 Related Work\n\nMaximum-Entropy Learning: The principle of Maximum-Entropy, proposed by Jaynes [16] is a\nclassic idea in Bayesian statistics, and states that the probability distribution best representing the\ncurrent state of knowledge is the one with the largest entropy, in context of testable information (such\nas accuracy). This idea has been explored in different domains of science, from statistical mechan-\nics [1] and Bayesian inference [12] to unsupervised learning [8] and reinforcement learning [29, 27].\nRegularization methods that penalize minimum entropy predictions have been explored in the context\nof semi-supervised learning [11], and on deterministic entropy annealing [36] for vector quantization.\nIn the domain of machine learning, the regularization of the entropy of classi\ufb01er weights has been\nused empirically [4, 42] and studied theoretically [37, 49].\nIn most treatments of the Maximum-Entropy principle in classi\ufb01cation, emphasis has been given to\nthe entropy of the weights of classi\ufb01ers themselves [37]. In our formulation, we focus instead on the\nMaximum-Entropy principle applied to the prediction vectors. This formulation has been explored\nexperimentally in the work of Pereyra et al.[33] for generic image classi\ufb01cation. Our work builds on\ntheir analysis by providing a theoretical treatment of \ufb01ne-grained classi\ufb01cation problems, and justi\ufb01es\nthe application of Maximum-Entropy to target scenarios with limited diversity between classes with\nlimited training data. Additionally, we obtain large improvements in \ufb01ne-grained classi\ufb01cation, which\nmotivates the usage of the Maximum-Entropy training principle in the \ufb01ne-tuning setting, opening up\nthis idea to much broader range of applied computer vision problems. We also note the related idea\nof label smoothing regularization [41], which tries to prevent the largest logit from becoming much\nlarger than the rest and shows improved generalization in large scale image classi\ufb01cation problems.\nFine-Grained Classi\ufb01cation: Fine-Grained Visual Classi\ufb01cation (FGVC) has been an active area\nof interest in the computer vision community. Typical \ufb01ne-grained problems such as differentiating\nbetween animal and plant species, or types of food. Since background context can act as a distraction\nin most cases of FGVC, there has been research in improving the attentional and localization\ncapabilities of CNN-based algorithms. Bilinear pooling [25] is an instrumental method that combines\npairwise local features to improve spatial invariance. This has been extended by Kernel Pooling [6]\nthat uses higher-order interactions instead of dot products proposed originally, and Compact Bilinear\nPooling [9] that speeds up the bilinear pooling operation. Another approach to localization is the\nprediction of an af\ufb01ne transformation of the original image, as proposed by Spatial Transformer\nNetworks [15]. Part-based Region CNNs [35] use region-wise attention to improve local features.\nLeveraging additional information such as pose and regions have also been explored [3, 46], along\nwith robust image representations such as CNN \ufb01lter banks [5], VLAD [17] and Fisher vectors [34].\nSupplementing training data [21] and model averaging [30] have also had signi\ufb01cant improvements.\nThe central theme among current approaches is to increase the diversity of relevant features that\nare used in classi\ufb01cation, either by removing irrelevant information (such as background) by better\nlocalization or pooling, or supplementing features with part and pose information, or more training\ndata. Our method focuses on the classi\ufb01cation task after obtaining features (and is hence compatible\nwith existing approaches), by selecting the classi\ufb01er that assumes the minimum information about\nthe task by principle of Maximum-Entropy. This approach is very useful in context of \ufb01ne-grained\ntasks, especially when \ufb01ne-tuning from ImageNet CNN models that are already over-parameterized.\n\n2\n\nFine Grained Classification samples (Stanford Dogs) with typically low visual diversityLarge-Scale classification samples (ImageNet LSVRC12) with very high visual diversity\f3 Method\n\nIn the case of Maximum Entropy \ufb01ne-tuning, we optimize the following objective:\n\n(cid:98)Ex\u223cD [DKL (\u00afy(x)||p(y|x; \u03b8)) \u2212 \u03b3H[p(y|x; \u03b8)]]\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\n(1)\n\nWhere \u03b8 represents the model parameters, and is initialized using a pretrained model such as\nImageNet [7] and \u03b3 is a hyperparameter. The entropy can be understood as a measure of the\n\u201cpeakiness\u201d or \u201cindecisiveness\u201d of the classi\ufb01er in its prediction for the given input. For instance,\nif the classi\ufb01er is strongly con\ufb01dent in its belief of a particular class k, then all the mass will be\nconcentrated at class k, giving us an entropy of 0. Conversely, if a classi\ufb01er is equally confused\nbetween all C classes, we will obtain a value of log(C) of the entropy, which is the maximum value\nit can take. In problems such as \ufb01ne-grained classi\ufb01cation, where samples that belong to different\nclasses can be visually very similar, it is a reasonable idea to prevent the classi\ufb01er from being too\ncon\ufb01dent in its outputs (have low entropy), since the classes themselves are so similar.\n\n3.1 Preliminaries\n\nConsider the multi-class classi\ufb01cation problem over C classes. The input domain is given by\nX \u2282 RZ, with an accompanying probability metric px(\u00b7) de\ufb01ned over X . The training data is given\nby N i.i.d. samples D = {x1, ..., xN} drawn from X . Each point x \u2208 X has an associated label\n\u00afy(x) = [0, ..., 1, ...0] \u2208 RC. We learn a CNN such that for each point in X , the CNN induces a\nconditional probability distribution over the m classes whose mode matches the label \u00afy(x).\nA CNN architecture consists of a series of convolutional and subsampling layers that culminate in an\nactivation \u03a6(\u00b7), which is fed to an C-way classi\ufb01er with weights w = {w1, ..., wC} such that:\n\np(yi|x; w, \u03a6(\u00b7)) =\n\ni \u03a6(x)(cid:1)\nexp(cid:0)w(cid:62)\nj=1 exp(cid:0)w(cid:62)\nj \u03a6(x)(cid:1)\n(cid:80)C\n\nDuring training, we learn parameters w and feature extractor \u03a6(\u00b7) (collectively referred to as \u03b8),\nby minimizing the expected KL (Kullback-Liebler)-divergence of the CNN conditional probability\ndistribution from the true label vector over the training set D:\n\n\u03b8\u2217 = arg min\n\n(3)\nDuring \ufb01ne-tuning, we learn a feature map \u03a6(\u00b7) from a large training set (such as ImageNet), discard\nthe original classi\ufb01er w (referred now onwards as wS) and learn new weights w on the smaller\ndataset (note that the number of classes, and hence the shape of w, may also change for the new task).\nThe entropy of conditional probability distribution in Equation 2 is given by:\n\n\u03b8\n\n(cid:98)Ex\u223cD [DKL (\u00afy(x)||p(y|x; \u03b8))]\n\np(yi|x; \u03b8) log(p(yi|x; \u03b8))\n\n(4)\n\nH[p(\u00b7|x; \u03b8)] (cid:44) \u2212 m(cid:88)\n\ni=1\n\n(cid:90)\n\n(2)\n\n(5)\n\n(6)\n\nTo minimize the overall entropy of the classi\ufb01er over a data distribution x \u223c px(\u00b7), we would be\ninterested in the expected value of the entropy over the distribution:\n\nSimilarly, the empirical average of the conditional entropy over the training set D is:\n\nH[p(\u00b7|x; \u03b8)]px(x)dx\n\nEx\u223cpx [H[p(\u00b7|x; \u03b8)]] =\n(cid:98)Ex\u223cD[H[p(\u00b7|x; \u03b8)]] =\n\nx\u223cpx\n\n1\nN\n\nN(cid:88)\n\ni=1\n\nH[p(\u00b7|xi; \u03b8)]\n\nTo have high training accuracy, we do not need to learn a model that gives zero cross-entropy loss.\nInstead, we only require a classi\ufb01er to output a conditional probability distribution whose arg max\ncoincides with the correct class. Next, we show that for problems with low diversity, higher validation\naccuracy can be obtained with a higher entropy (and higher training cross-entropy). We now formalize\nthe notion of diversity in feature vectors over a data distribution.\n\n3\n\n\f3.2 Diversity and Fine-Grained Visual Classi\ufb01cation\nWe assume the pretrained n-dimensional feature map \u03a6(\u00b7) to be a multivariate mixture of m Gaussians,\nwhere m is unknown (and may be very large). Using an overall mean subtraction, we can re-center\nthe Gaussian distribution to be zero-mean. \u03a6(x) for x \u223c px is then given by:\n\n\u03b1iN (\u00b5i, \u03a3i), where x \u223c px, \u03b1i > 0 \u2200i and Ex\u223cpx[\u03a6(x)] = 0,\n\n(7)\n\nfor class i. The zero-mean implies that \u00af\u00b5 =(cid:80)m\n\nwhere \u03a3is are n-dimensional covariance matrices for each class i, and \u00b5i is the mean feature vector\ni=1 \u03b1i\u00b5i = 0. For this distribution, the equivalent\n\ni=1\n\ncovariance matrix can be given by:\n\n\u03a6(x) \u223c m(cid:88)\n\nm(cid:88)\n\nm(cid:88)\n\nm(cid:88)\n\nVar[\u03a6(x)] =\n\n\u03b1i\u03a3i +\n\n\u03b1i(\u00b5i \u2212 \u00af\u00b5)(\u00b5i \u2212 \u00af\u00b5)(cid:62) =\n\n\u03b1i(\u03a3i + \u00b5i\u00b5(cid:62)\n\ni ) (cid:44) \u03a3\u2217\n\n(8)\n\ni=1\n\ni=1\n\ni=1\n\nNow, the eigenvalues \u03bb1, ..., \u03bbn of the overall covariance matrix \u03a3\u2217 characterize the variance of the\ndistribution across n dimensions. Since \u03a3\u2217 is positive-de\ufb01nite, all eigenvalues are positive (this can\nbe shown using the fact that each covariance matrix is itself positive-de\ufb01nite, and diag(\u00b5i\u00b5(cid:62)\ni )k =\ni )2 \u2265 0 \u2200i, k). Thus, to describe the variance of the feature distribution we de\ufb01ne Diversity.\n(\u00b5k\nDe\ufb01nition 1. Let the data distribution be px over space X , and feature extractor be given by \u03a6(\u00b7).\nThen, the Diversity \u03bd of the features is de\ufb01ned as:\n\n\u03bbi, where {\u03bb1, ..., \u03bbn} satisfy det(\u03a3\u2217 \u2212 \u03bbiIn) = 0\n\ni=1\n\nThis de\ufb01nition of diversity is consistent with multivariate analysis, and is a common measure of\nx (\u00b7) denote the data distribution under a\nthe total variance of a data distribution [18]. Now, let pL\nx (\u00b7) denote the data distribution\nlarge-scale image classi\ufb01cation task such as ImageNet, and let pF\nunder a \ufb01ne-grained image classi\ufb01cation task. We can then characterize \ufb01ne-grained problems as\ndata distributions pF\n\nx (\u00b7) for any feature extractor \u03a6(\u00b7) that have the property:\n\nx ) (cid:28) \u03bd(\u03a6, pL\nx )\n\n\u03bd(\u03a6, pF\n\n(9)\nOn plotting pretrained \u03a6(\u00b7) for both the ImageNet validation set and the validation set of CUB-200-\n2011 (a \ufb01ne-grained dataset), we see that the CUB-200-2011 features are concentrated with a lower\nvariance compared to the ImageNet training set (see Figure 1b), consistent with Equation 9. In the\nnext section, we describe the connections of Maximum-Entropy with model selection in \ufb01ne-grained\nclassi\ufb01cation.\n\n\u03bd(\u03a6, px) (cid:44) n(cid:88)\n\n3.3 Maximum-Entropy and Model Selection\n\n(cid:80)\nj(cid:107)wj(cid:107)2\n\nBy the Tikhonov regularization of a linear classi\ufb01er [10], we would want to select w such that\n2 is small ((cid:96)2 regularization), to get higher generalization performance. This technique is\nalso implemented in neural networks trained using stochastic gradient descent (SGD) by the process of\n\u201cweight-decay\u201d. Several recent works around obtaining spectrally-normalized risk bounds for neural\nnetworks have demonstrated that the excess risk scales with the Frobenius norm of the weights [31, 2].\nOur next result provides some insight into how \ufb01ne-grained problems can potentially limit model\nselection, by analysing the best-case generalization gap (difference between training and expected\nrisk). We use the following result to lower-bound the norm of the weights (cid:107)w(cid:107)2 =\n2 in\nterms of the expected entropy and the feature diversity:\nTheorem 1. Let the \ufb01nal layer weights be denoted by w = {w1, ..., wC}, the data distribution be\npx over X , and feature extractor be given by \u03a6(\u00b7). For the expected condtional entropy, the following\nholds true:\n\n(cid:113)(cid:80)C\n\ni=1(cid:107)wi(cid:107)2\n\n(cid:107)w(cid:107)2 \u2265 log(C) \u2212 Ex\u223cpx[H[p(\u00b7|x; \u03b8)]]\n\n2(cid:112)\u03bd(\u03a6, px)\n\n4\n\n\fA full proof of Theorem 1 is included in the supplement. Let us consider the case when \u03bd(\u03a6, px) is\nlarge (ImageNet classi\ufb01cation). In this case, this lower bound is very weak and inconsequential.\nHowever, in the case of small \u03bd(\u03a6, px) (\ufb01ne-grained classi\ufb01cation), the denominator is small, and\nthis lower bound can subsequently limit the space of model selection, by only allowing models with\nlarge values of weights, leading to a larger best-case generalization gap (that is, when, Theorem 1\nholds with equality). We see that if the numerator is small, the diversity of the features has a smaller\nimpact on limiting the model selection, and hence, it can be advantageous to maximize prediction\nentropy. We note that since this is a lower bound, the proof is primarily expository and we can only\ncomment on best-case generalization performance.\n\nMore intuitively, however, it can be understood that problems that are \ufb01ne-grained will often require\nmore information to distinguish between classes, and regularizing the prediction entropy prevents\ncreating models that memorize a lot of information about the training data, and thus can potentially\nbene\ufb01t generalization. In this sense, using a Maximum-Entropy objective function is similar to an\nonline calibration of neural network predictions [13], to account for \ufb01ne-grained problems. Now,\nTheorem 1 involves the expected conditional entropy over the data distribution. However, during\ntraining we only have sample access to the data distribution, which we can use as a surrogate. It\nis essential to then ensure that the empirical estimate of the conditional entropy (from N training\nsamples) is an accurate estimate of the true expected conditional entropy. The next result ensures\nthat for large N, in a \ufb01ne-grained classi\ufb01cation problem, the sample estimate of average conditional\nentropy is close to the expected conditional entropy.\nTheorem 2. Let the \ufb01nal layer weights be denoted by w = {w1, ..., wC}, the data distribution\nbe px over X , and feature extractor be given by \u03a6(\u00b7). With probability at least 1 \u2212 \u03b4 > 1\n2 and\n(cid:107)w(cid:107)\u221e = max ((cid:107)w1(cid:107)2, ...,(cid:107)wC(cid:107)2), we have:\n\n(cid:12)(cid:12)(cid:12)(cid:98)ED[H[p(\u00b7|x; \u03b8)]] \u2212 Ex\u223cpx[H[p(\u00b7|x; \u03b8)]]\n\n(cid:12)(cid:12)(cid:12) \u2264 (cid:107)w(cid:107)\u221e\n\n(cid:16)(cid:114) 2\n\n) + (cid:101)O(cid:0)N\u22120.75(cid:1)(cid:17)\n\n\u03bd(\u03a6, px) log(\n\n4\n\u03b4\n\nN\n\nA full proof of Theorem 2 is included in the supplement. We see that as long as the diversity of\nfeatures is small, and N is large, our estimate for entropy will be close to the expected value. Using\nthis result, we can express Theorem 1 in terms of the empirical mean conditional entropy.\nCorollary 1. With probability at least 1 \u2212 \u03b4 > 1\n\n2 , the empirical mean conditional entropy follows:\n\n(cid:107)w(cid:107)2 \u2265\n\n(cid:0)2 \u2212(cid:113) 2\n\nlog(C) \u2212(cid:98)Ex\u223cD[H[p(\u00b7|x; \u03b8)]]\n\n\u03b4 )(cid:1)(cid:112)\u03bd(\u03a6, px) \u2212 (cid:101)O(cid:0)N\u22120.75(cid:1)\n\nN log( 2\n\nA full proof of Corollary 1 is included in the supplement. We see that we recover the result from\nTheorem 1 as N \u2192 \u221e. Corollary 1 shows that as long as the diversity of features is small, and N is\nlarge, the same conclusions drawn from Theorem 1 apply in the case of the empirical mean entropy\nas well. We will now proceed to describing the results obtained from maximum-entropy \ufb01ne-grained\nclassi\ufb01cation.\n\n4 Experiments\n\nWe perform all experiments using the PyTorch [32] framework over a cluster of NVIDIA Titan X\nGPUs. We now describe our results on benchmark datasets in \ufb01ne-grained recognition and some\nablation studies.\n\n4.1 Fine-Grained Visual Classi\ufb01cation\n\nMaximum-Entropy training improves performance across \ufb01ve standard \ufb01ne-grained datasets, with\nsubstantial gains in low-performing models. We obtain state-of-the-art results on all \ufb01ve datasets\n(Table 1-(A-E)). Since all these datasets are small, we report numbers averaged over 6 trials.\nClassi\ufb01cation Accuracy: First, we observe that Maximum-Entropy training obtains signi\ufb01cant\nperformance gains when \ufb01ne-tuning from models trained on the ImageNet dataset (e.g., GoogLeNet\n\n5\n\n\f(A) CUB-200-2011 [44]\n\n(B) Cars [22]\n\n(C) Aircrafts [28]\n\nMethod\n\nTop-1 \u2206\n\nMethod\n\nTop-1 \u2206\n\nMethod\n\nPrior Work\n\nPrior Work\n\nPrior Work\n\nSTN[15]\nZhang et al. [47]\nLin et al. [24]\nCui et al. [6]\n\n84.10\n84.50\n85.80\n86.20\n\n-\n-\n-\n-\n\nWang et al. [45]\nLiu et al. [26]\nLin et al. [24]\nCui et al. [6]\n\n85.70\n86.80\n92.00\n92.40\n\n-\n-\n-\n-\n\nSimon et al. [38]\nCui et al. [6]\nLRBP [20]\nLin et al. [24]\n\nTop-1 \u2206\n\n85.50\n86.90\n87.30\n88.50\n\n-\n-\n-\n-\n\nOur Results\n\nGoogLeNet\nMaxEnt-GoogLeNet\nResNet-50\nMaxEnt-ResNet-50\nVGGNet16\nMaxEnt-VGGNet16\nBilinear CNN [25]\nMaxEnt-BilinearCNN 85.27\nDenseNet-161\nMaxEnt-DenseNet-161 86.54\n\n68.19 (6.18)\n74.37\n75.15 (5.22)\n80.37\n73.28 (3.74)\n77.02\n84.10 (1.17)\n84.21 (2.33)\n\nOur Results\n\nGoogLeNet\nMaxEnt-GoogLeNet\nResNet-50\nMaxEnt-ResNet-50\nVGGNet16\nMaxEnt-VGGNet16\nBilinear CNN [25]\nMaxEnt-Bilinear CNN 92.81\nDenseNet-161\nMaxEnt-DenseNet-161 93.01\n\n84.85 (2.17)\n87.02\n91.52 (2.33)\n93.85\n80.60 (3.28)\n83.88\n91.20 (1.61)\n91.83 (1.18)\n\nOur Results\n\nGoogLeNet\nMaxEnt-GoogLeNet\nResNet-50\nMaxEnt-ResNet-50\nVGGNet16\nMaxEnt-VGGNet16\nBilinearCNN [25]\nMaxEnt-BilinearCNN 86.12\nDenseNet-161\nMaxEnt-DenseNet-161 89.76\n\n74.04 (5.12)\n79.16\n81.19 (2.67)\n83.86\n74.17 (3.91)\n78.08\n84.10 (2.02)\n86.30 (3.46)\n\n(D) NABirds [43]\n\n(E) Stanford Dogs [19]\n\nPrior Work\n\nOur Results\n\nTop-1 \u2206\n\n80.43\n80.60\n\n-\n-\n\nPrior Work\n\nOur Results\n\nMethod\n\nTop-1 \u2206\n\nMethod\n\nBranson et al. [3]\nVan et al. [43]\n\n35.70\n75.00\n\n-\n-\n\nZhang et al. [48]\nKrause et al. [21]\n\nGoogLeNet\nMaxEnt-GoogLeNet\nResNet-50\nMaxEnt-ResNet-50\nVGGNet16\nMaxEnt-VGGNet16\nBilinearCNN [25]\nMaxEnt-BilinearCNN 82.66\nDenseNet-161\nMaxEnt-DenseNet-161 83.02\n\n70.66 (2.38)\n73.04\n63.55 (5.66)\n69.21\n68.34 (4.28)\n72.62\n80.90 (1.76)\n79.35 (3.67)\n\nGoogLeNet\nMaxEnt-GoogLeNet\nResNet-50\nMaxEnt-ResNet-50\nVGGNet16\nMaxEnt-VGGNet16\nBilinearCNN [25]\nMaxEnt-BilinearCNN 83.18\nDenseNet-161\nMaxEnt-DenseNet-161 83.63\n\n55.76 (6.25)\n62.01\n69.92 (3.64)\n73.56\n61.92 (3.52)\n65.44\n82.13 (1.05)\n81.18 (2.45)\n\nTable 1: Maximum-Entropy training (MaxEnt) obtains state-of-the-art performance on \ufb01ve widely-\nused \ufb01ne-grained visual classi\ufb01cation datasets (A-E). Improvement over the baseline model is reported\nas (\u2206). All results averaged over 6 trials.\n\n[40], Resnet-50 [14]). For example, on the CUB-200-2011 dataset, \ufb01ne-tuning GoogLeNet by\nstandard \ufb01ne-tuning gives an accuracy of 68.19%. Fine-tuning with Maximum-Entropy gives an\naccuracy of 74.37%\u2014which is a large improvement, and it is persistent across datasets. Since a lot of\n\ufb01ne-tuning tasks use general base models such as GoogLeNet and ResNet, this result is relevant to\nthe large number of applications that involve \ufb01ne-tuning on specialized datasets.\nMaximum-Entropy classi\ufb01cation also improves prediction performance for CNN architectures specif-\nically designed for \ufb01ne-grained visual classi\ufb01cation. For instance, it improves the performance of the\nBilinear CNN [25] on all 5 datasets and obtains state-of-the-art results, to the best of our knowledge.\nThe gains are smaller, since these architectures improve diversity in the features by localization, and\nhence maximizing entropy is less crucial in this case. However, it is important to note that most\npooling architectures [25] use a large model as a base-model (such as VGGNet [39]) and have an\nexpensive pooling operation. Thus they are computationally very expensive, and infeasible for tasks\nthat have resource constraints in terms of data and computation time.\nIncrease in Generality of Features: We hypothesize that Maximum-Entropy training will encourage\nthe classi\ufb01er to reduce the speci\ufb01city of the features. To evaluate this hypothesis, we perform the\neigendecomposition of the covariance matrix on the pool5 layer features of GoogLeNet trained on\nCUB-200-2011, and analyze the trend of sorted eigenvalues (Figure 2a). We examine the features\nfrom CNNs with (i) no \ufb01ne-tuning (\u201cBasic\u201d), (ii) regular \ufb01ne-tuning, and (iii) \ufb01ne-tuning with\nMaximum-Entropy.\nFor a feature matrix with large covariance between the features of different classes, we would\nexpect the \ufb01rst few eigenvalues to be large, and the rest to diminish quickly, since fewer orthogonal\ncomponents can summarize the data. Conversely, in a completely uncorrelated feature matrix, we\nwould see a longer tail in the decreasing magnitudes of eigenvalues. Figure 2a shows that for the\nBasic features (with no \ufb01ne-tuning), there is a fat tail in both training and test sets due to the presence\nof a large number of uncorrelated features. After \ufb01ne-tuning on the training data, we observe a\n\n6\n\n\fMethod\nGoogLeNet\nMaxEnt + GoogLeNet\nDenseNet-121\nMaxEnt + DenseNet-121\n\nCIFAR-10\n\n84.16\n84.10\n92.19\n92.22\n\n\u2206\n\n(-0.06)\n\n(0.03)\n\nCIFAR-100\n\n70.24\n73.50\n75.01\n76.22\n\n\u2206\n\n(3.26)\n\n(1.21)\n\nTable 2: Maximum Entropy obtains larger gains on the \ufb01ner CIFAR-100 dataset as compared to\nCIFAR-10. Improvement over the baseline model is reported as (\u2206).\n\nMethod\nGoogLeNet\nMaxEnt + GoogLeNet\nResNet-50\nMaxEnt + ResNet-50\n\nRandom-ImageNet\n\n71.85\n72.20\n82.01\n82.29\n\n\u2206\n\n(0.35)\n\n(0.28)\n\nDogs-ImageNet\n\n62.28\n64.91\n73.81\n75.66\n\n\u2206\n\n(2.63)\n\n(1.86)\n\nTable 3: Maximum Entropy obtains larger gains on the a subset of ImageNet containing dog\nsub-classes versus a randomly chosen subset of the same size which has higher visual diversity.\nImprovement over the baseline model (in cross-validation) is reported as (\u2206).\n\nreduction in the tail of the curve, implying that some generality in features has been introduced in the\nmodel through the \ufb01ne-tuning. The test curve follows a similar decrease, justifying the increase in\ntest accuracy. Finally, for Maximum-Entropy, we observe a substantial decrease in the width of the\ntail of eigenvalue magnitudes, suggesting a larger increase in generality of features in both training\nand test sets, which con\ufb01rms our hypothesis.\nEffect on Prediction Probabilities: For Maximum-Entropy training, the predicted logit vector is\nsmoother, leading to a higher cross entropy during both training and validation. We observe that the\naverage value of the logit probability of the top predicted class decreases signi\ufb01cantly with Maximum-\nEntropy, as predicted by the mathematical formulation (for \u03b3 = 1). On CUB-200-2011 dataset for\nGoogLeNet architecture, with Maximum-Entropy, the mean probability of the top class is 0.34, as\ncompared to 0.77 without it. Moreover, the tail of probability values is fatter with Maximum-Entropy,\nas depicted in Figure 2b.\n\n(a)\n\n(b)\n\nFigure 2: (a) Maximum-Entropy training encourages the network to reduce the speci\ufb01city of the features, which\nis re\ufb02ected in the longer tail of eigenvalues for the covariance matrix of pool5 GoogLeNet features for both\ntraining and test sets of CUB-200-2011. We plot the value of log(\u03bbi) for the ith eigenvalue \u03bbi obtained after\ndecomposition of test set (dashed) and training set (solid) (for \u03b3 = 1). (b) For Maximum-Entropy training, the\npredicted logit vector is smoother with a fatter tail (GoogleNet on CUB-200-2011).\n\n4.2 Ablation Studies\n\nCIFAR-10 and CIFAR-100: We evaluate Maximum-Entropy on the CIFAR-10 and CIFAR-100\ndatasets [23]. CIFAR-100 has the same set of images as CIFAR-10 but with \ufb01ner category distinction\nin the labels, with each \u201csuperclass\u201d of 20 containing \ufb01ve \ufb01ner divisions, and a 100 categories in\ntotal. Therefore, we expect (and observe) that Maximum-Entropy training provides stronger gains on\nCIFAR-100 as compared to CIFAR-10 across models (Table 2).\n\n7\n\n2004006008001000i1510505101520log(\u03bbi)Train BasicTrain Fine-TunedTrain Fine-Tuned with EntropyTest BasicTest Fine-TunedTest Fine-Tuned with Entropy5101520top-k index0.00.20.40.60.81.0mean logit valueStandard SGDSGD + Maximum-Entropy\fMethod\n\nVGG-Net16 MaxEnt\nLSR\nMaxEnt\n\nResNet-50\nLSR\nDenseNet-161 MaxEnt\nLSR\n\nCUB-200-2011\n\n77.02\n70.03\n80.37\n78.20\n86.54\n84.86\n\nCars\n83.88\n81.45\n93.85\n92.04\n93.01\n91.96\n\nAircrafts NABirds\n\n78.08\n75.06\n83.86\n81.26\n89.76\n87.05\n\n72.62\n69.28\n69.21\n64.02\n83.02\n80.11\n\nStanford Dogs\n\n65.44\n63.06\n73.56\n70.03\n83.63\n82.98\n\nTable 4: Maximum-Entropy training obtains much large gains on Fine-grained Visual Classi\ufb01cation\nas compared to Label Smoothing Regularization (LSR) [40].\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) Classi\ufb01cation performance is robust to the choice of \u03b3 over a large region as shown here for\nCUB-200-2011 with models VGGNet-16 and BilinearCNN. (b) Maximum-Entropy is more robust to increasing\namounts of label noise (CUB-200-2011 on GoogleNet with \u03b3 = 1). (c) Maximum-Entropy obtains higher\nvalidation performance despite higher training cross-entropy loss.\n\nImageNet Ablation Experiment: To understand the effect of Maximum-Entropy training on datasets\nwith more samples compared to the small \ufb01ne-grained datasets, we create two synthetic datasets: (i)\nRandom-ImageNet, which is formed by selecting 116K images from a random subset of 117 classes\nof ImageNet [7], and (ii) Dogs-ImageNet, which is formed by selecting all classes from ImageNet\nthat have dogs as labels, which has the same number of images and classes as Random-ImageNet.\nDogs-ImageNet has less diversity compared to Random-ImageNet, and thus we expect the gains from\nMaximum-Entropy to be higher. On a 5-way cross-validation on both dataset, we observe higher\ngains on the Dogs-ImageNet dataset for two CNN models (Table 3).\nChoice of Hyperparameter \u03b3: An integral component of regularization is the choice of weighing\nparameter. We \ufb01nd that performance is fairly robust to the choice of \u03b3 (Figure 3a). Please see\nsupplement for experiment-wise details.\nRobustness to Label Noise: In this experiment, we gradually introduce label noise by randomly\npermuting a fraction of labels for increasing fractions of total data. We follow an identical evaluation\nprotocol as the previous experiment, and observe that Maximum-Entropy is more robust to label\nnoise (Figure 3b).\nTraining Cross-Entropy and Validation Accuracy: We expect Maximum-Entropy training to\nprovide higher accuracy at the cost of higher training cross-entropy. In Figure 3c, we show that\nwe achieve a higher validation accuracy when training with Maximum-Entropy despite the training\ncross-entropy loss converging to a higher value.\nComparison with Label-Smoothing Regularization: Label-Smoothing Regularization [40] penal-\nizes the KL-divergence of the classi\ufb01er logits from the uniform distribution \u2013 and is also a method to\nprevent peaky distributions. On comparing performance with Label-Smoothing Regularization, we\nfound that Maximum-Entropy provides much larger gains on \ufb01ne-grained recognition (see Table 4).\n\n5 Discussion and Conclusion\n\nMany real-world applications of computer vision models involve extensive \ufb01ne-tuning on small,\nrelatively imbalanced datasets with much smaller diversity in the training set compared to the large-\nscale models they are \ufb01ne-tuned from, a notable example of which is \ufb01ne-grained recognition. In\nthis domain, Maximum-Entropy training provides an easy-to-implement and simple to understand\ntraining schedule that consistently improves performance. There are several extensions, however, that\n\n8\n\n10-410-310-210-1100101102103104105\u03b30.00.10.20.30.40.50.60.70.8test accuracyVGGNet-16BilinearCNN0.00.10.20.30.40.50.60.70.80.9percentage of label noise0.00.20.40.60.81.0test accuracySGDSGD + Maximum-Entropy050100150200250300training epoch0246810training cross-entropy\u03b3=0\u03b3=0.1\u03b3=1\u03b3=100.00.20.40.60.81.0validation accuracy\fcan be explored: explicitly enforcing a large diversity in the features through a different regularizer\nmight be an interesting extension to this study, as well as potential extensions to large-scale problems\nby tackling clusters of diverse objects separately. We leave these as a future study with our results as\na starting point.\nAcknowledgements: We thank Ryan Farrell, Pei Guo, Xavier Boix, Dhaval Adjodah, Spandan\nMadan, and Ishaan Grover for their feedback on the project and Google\u2019s TensorFlow Research\nCloud Program for providing TPU computing resources.\n\nReferences\n[1] Sumiyoshi Abe and Yuko Okamoto. Nonextensive statistical mechanics and its applications, volume 560.\n\nSpringer Science & Business Media, 2001.\n\n[2] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 6240\u20136249, 2017.\n\n[3] Steve Branson, Grant Van Horn, Serge Belongie, and Pietro Perona. Bird species categorization using pose\n\nnormalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.\n\n[4] Yihua Chen, Eric K Garcia, Maya R Gupta, Ali Rahimi, and Luca Cazzanti. Similarity-based classi\ufb01cation:\n\nConcepts and algorithms. Journal of Machine Learning Research, 10(Mar):747\u2013776, 2009.\n\n[5] Mircea Cimpoi, Subhransu Maji, and Andrea Vedaldi. Deep \ufb01lter banks for texture recognition and\nsegmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3828\u20133836, 2015.\n\n[6] Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie. Kernel pooling for\n\nconvolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR09, 2009.\n\n[8] Mario A. T. Figueiredo and Anil K. Jain. Unsupervised learning of \ufb01nite mixture models. IEEE Transac-\n\ntions on pattern analysis and machine intelligence, 24(3):381\u2013396, 2002.\n\n[9] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 317\u2013326, 2016.\n\n[10] Gene H Golub, Per Christian Hansen, and Dianne P O\u2019Leary. Tikhonov regularization and total least\n\nsquares. SIAM Journal on Matrix Analysis and Applications, 21(1):185\u2013194, 1999.\n\n[11] Yves Grandvalet and Yoshua Bengio. Entropy regularization.\n[12] Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian\n\nmethods in science and engineering, pages 53\u201374. Springer, 1988.\n\n[13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks.\n\narXiv preprint arXiv:1706.04599, 2017.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[15] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[16] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n[17] Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge S\u00e1nchez, Patrick Perez, and Cordelia Schmid.\nAggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and\nmachine intelligence, 34(9):1704\u20131716, 2012.\n\n[18] Dag Jonsson. Some limit theorems for the eigenvalues of a sample covariance matrix. Journal of\n\nMultivariate Analysis, 12(1):1\u201338, 1982.\n\n[19] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for \ufb01ne-grained\n\nimage categorization: Stanford dogs.\n\n[20] Shu Kong and Charless Fowlkes. Low-rank bilinear pooling for \ufb01ne-grained classi\ufb01cation. IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, pages 7025\u20137034, 2017.\n\n[21] Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James\nPhilbin, and Li Fei-Fei. The unreasonable effectiveness of noisy data for \ufb01ne-grained recognition. In\nEuropean Conference on Computer Vision, pages 301\u2013320. Springer, 2016.\n\n9\n\n\f[22] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for \ufb01ne-grained\ncategorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops,\npages 554\u2013561, 2013.\n\n[23] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset, 2014.\n[24] Tsung-Yu Lin and Subhransu Maji. Improved bilinear pooling with cnns. arXiv preprint arXiv:1707.06772,\n\n2017.\n\n[25] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\nrecognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1449\u20131457,\n2015.\n\n[26] Maolin Liu, Chengyue Yu, Hefei Ling, and Jie Lei. Hierarchical joint cnn-based models for \ufb01ne-grained\ncars recognition. In International Conference on Cloud Computing and Security, pages 337\u2013347. Springer,\n2016.\n\n[27] Yuping Luo, Chung-Cheng Chiu, Navdeep Jaitly, and Ilya Sutskever. Learning online alignments with\n\ncontinuous rewards policy gradient. arXiv preprint arXiv:1608.01281, 2016.\n\n[28] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual\n\nclassi\ufb01cation of aircraft. arXiv preprint arXiv:1306.5151, 2013.\n\n[29] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational Conference on Machine Learning, pages 1928\u20131937, 2016.\n\n[30] Mohammad Moghimi, Mohammad Saberian, Jian Yang, Li-Jia Li, Nuno Vasconcelos, and Serge Belongie.\nBoosted convolutional neural networks. In British Machine Vision Conference (BMVC), York, UK, 2016.\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach\n\nto spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.\n\n[32] Adam Paskze and Soumith Chintala. Tensors and Dynamic neural networks in Python with strong GPU\n\nacceleration. https://github.com/pytorch. Accessed: [January 1, 2017].\n\n[33] Gabriel Pereyra, George Tucker, Jan Chorowski, \u0141ukasz Kaiser, and Geoffrey Hinton. Regularizing neural\n\nnetworks by penalizing con\ufb01dent output distributions. arXiv preprint arXiv:1701.06548, 2017.\n\n[34] Florent Perronnin, Jorge S\u00e1nchez, and Thomas Mensink. Improving the \ufb01sher kernel for large-scale image\n\nclassi\ufb01cation. Computer Vision\u2013ECCV 2010, pages 143\u2013156, 2010.\n\n[35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n[36] Kenneth Rose. Deterministic annealing for clustering, compression, classi\ufb01cation, regression, and related\n\noptimization problems. Proceedings of the IEEE, 86(11):2210\u20132239, 1998.\n\n[37] John Shawe-Taylor and David Hardoon. Pac-bayes analysis of maximum entropy classi\ufb01cation. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 480\u2013487, 2009.\n\n[38] Marcel Simon, Erik Rodner, Yang Gao, Trevor Darrell, and Joachim Denzler. Generalized orderless\n\npooling performs implicit salient matching. arXiv preprint arXiv:1705.00487, 2017.\n\n[39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[41] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\ninception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2818\u20132826, 2016.\n\n[42] Martin Szummer and Tommi Jaakkola. Partially labeled classi\ufb01cation with markov random walks. In\n\nAdvances in neural information processing systems, pages 945\u2013952, 2002.\n\n[43] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona,\nand Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The\n\ufb01ne print in \ufb01ne-grained dataset collection. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 595\u2013604, 2015.\n\n[44] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd\n\nbirds-200-2011 dataset. 2011.\n\n[45] Yaming Wang, Jonghyun Choi, Vlad Morariu, and Larry S. Davis. Mining discriminative triplets of patches\nfor \ufb01ne-grained classi\ufb01cation. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2016.\n\n10\n\n\f[46] Ning Zhang, Ryan Farrell, and Trever Darrell. Pose pooling kernels for sub-category recognition. In\nComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3665\u20133672. IEEE,\n2012.\n\n[47] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. Picking deep \ufb01lter responses\nfor \ufb01ne-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 1134\u20131142, 2016.\n\n[48] Yu Zhang, Xiu-Shen Wei, Jianxin Wu, Jianfei Cai, Jiangbo Lu, Viet-Anh Nguyen, and Minh N Do. Weakly\nsupervised \ufb01ne-grained categorization with part-based image representation. IEEE Transactions on Image\nProcessing, 25(4):1713\u20131725, 2016.\n\n[49] Jun Zhu and Eric P Xing. Maximum entropy discrimination markov networks. Journal of Machine\n\nLearning Research, 10(Nov):2531\u20132569, 2009.\n\n11\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "Abhimanyu", "family_name": "Dubey", "institution": "MIT"}, {"given_name": "Otkrist", "family_name": "Gupta", "institution": "MIT"}, {"given_name": "Ramesh", "family_name": "Raskar", "institution": "MIT"}, {"given_name": "Nikhil", "family_name": "Naik", "institution": "Massachusetts Institute of Technology"}]}