{"title": "Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 1143, "page_last": 1151, "abstract": "Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for image classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) to characterize the generation process of local features. This choice has shown to be sufficient for traditional low dimensional local features, e.g., SIFT; and typically, good performance can be achieved with only a few hundred Gaussian distributions. However, the same number of Gaussians is insufficient to model the feature space spanned by higher dimensional local features, which have become popular recently. In order to improve the modeling capacity for high dimensional features, it turns out to be inefficient and computationally impractical to simply increase the number of Gaussians. In this paper, we propose a model in which each local feature is drawn from a Gaussian distribution whose mean vector is sampled from a subspace. With certain approximation, this model can be converted to a sparse coding procedure and the learning/inference problems can be readily solved by standard sparse coding methods. By calculating the gradient vector of the proposed model, we derive a new fisher vector encoding strategy, termed Sparse Coding based Fisher Vector Coding (SCFVC). Moreover, we adopt the recently developed Deep Convolutional Neural Network (CNN) descriptor as a high dimensional local feature and implement image classification with the proposed SCFVC. Our experimental evaluations demonstrate that our method not only significantly outperforms the traditional GMM based Fisher vector encoding but also achieves the state-of-the-art performance in generic object recognition, indoor scene, and fine-grained image classification problems.", "full_text": "Encoding High Dimensional Local Features by Sparse\n\nCoding Based Fisher Vectors\n\nLingqiao Liu1, Chunhua Shen1,2, Lei Wang3, Anton van den Hengel1,2, Chao Wang3\n\n3 School of Computer Science and Software Engineering, University of Wollongong, Australia\n\n1 School of Computer Science, University of Adelaide, Australia\n\n2 ARC Centre of Excellence for Robotic Vision\n\nAbstract\n\nDeriving from the gradient vector of a generative model of local features, Fisher\nvector coding (FVC) has been identi\ufb01ed as an effective coding method for im-\nage classi\ufb01cation. Most, if not all, FVC implementations employ the Gaussian\nmixture model (GMM) to characterize the generation process of local features.\nThis choice has shown to be suf\ufb01cient for traditional low dimensional local fea-\ntures, e.g., SIFT; and typically, good performance can be achieved with only a few\nhundred Gaussian distributions. However, the same number of Gaussians is insuf-\n\ufb01cient to model the feature space spanned by higher dimensional local features,\nwhich have become popular recently. In order to improve the modeling capacity\nfor high dimensional features, it turns out to be inef\ufb01cient and computationally\nimpractical to simply increase the number of Gaussians.\nIn this paper, we propose a model in which each local feature is drawn from a\nGaussian distribution whose mean vector is sampled from a subspace. With cer-\ntain approximation, this model can be converted to a sparse coding procedure and\nthe learning/inference problems can be readily solved by standard sparse coding\nmethods. By calculating the gradient vector of the proposed model, we derive\na new \ufb01sher vector encoding strategy, termed Sparse Coding based Fisher Vec-\ntor Coding (SCFVC). Moreover, we adopt the recently developed Deep Convo-\nlutional Neural Network (CNN) descriptor as a high dimensional local feature\nand implement image classi\ufb01cation with the proposed SCFVC. Our experimen-\ntal evaluations demonstrate that our method not only signi\ufb01cantly outperforms\nthe traditional GMM based Fisher vector encoding but also achieves the state-of-\nthe-art performance in generic object recognition, indoor scene, and \ufb01ne-grained\nimage classi\ufb01cation problems.\n\n1\n\nIntroduction\n\nFisher vector coding is a coding method derived from the Fisher kernel [1] which was originally pro-\nposed to compare two samples induced by a generative model. Since its introduction to computer\nvision [2], many improvements and variants have been proposed. For example, in [3] the normaliza-\ntion of Fisher vectors is identi\ufb01ed as an essential step to achieve good performance; in [4] the spatial\ninformation of local features is incorporated; in [5] the model parameters are learned through a end-\nto-end supervised training algorithm and in [6] multiple layers of Fisher vector coding modules are\nstacked into a deep architecture. With these extensions, Fisher vector coding has been established\nas the state-of-the-art image classi\ufb01cation approach.\nAlmost all of these methods share one common component:\nthey all employ Gaussian mixture\nmodel (GMM) as the generative model for local features. This choice has been proved effective in\nmodeling standard local features such as SIFT, which are often of low dimension. Usually, using a\n\n1\n\n\fmixture of a few hundred Gaussians has been suf\ufb01cient to guarantee good performance. Generally\nspeaking, the distribution of local features can only be well captured by a Gaussian distribution\nwithin a local region due to the variety of local feature appearances and thus the number of Gaussian\nmixtures needed is essentially determined by the volume of the feature space of local features.\nRecently, the choice of local features has gone beyond the traditional local patch descriptors such as\nSIFT or SURF [7] and higher dimensional local features such as the activation of a pre-trained deep\nneural-network [8] or pooled coding vectors from a local region [9, 10] have demonstrated promis-\ning performance. The higher dimensionality and rich visual content captured by those features make\nthe volume of their feature space much larger than that of traditional local features. Consequently, a\nmuch larger number of Gaussian mixtures will be needed in order to model the feature space accu-\nrately. However, this would lead to the explosion of the resulted image representation dimensionality\nand thus is usually computationally impractical.\nTo alleviate this dif\ufb01culty, here we propose an alternative solution. We model the generation process\nof local features as randomly drawing features from a Gaussian distribution whose mean vector is\nrandomly drawn from a subspace. With certain approximation, we convert this model to a sparse\ncoding model and leverage an off-the-shelf solver to solve the learning and inference problems.\nWith further derivation, this model leads to a new Fisher vector coding algorithm called Sparse\nCoding based Fisher Vector Coding (SCFVC). Moreover, we adopt the recently developed Deep\nConvolutional Neural Network to generate regional local features and apply the proposed SCFVC\nto these local features to build an image classi\ufb01cation system.\nTo demonstrate its effectiveness in encoding the high dimensional local feature, we conduct a series\nof experiments on generic object, indoor scene and \ufb01ne-grained image classi\ufb01cation datasets, it is\nshown that our method not only signi\ufb01cantly outperforms the traditional GMM based Fisher vector\ncoding in encoding high dimensional local features but also achieves state-of-the-art performance in\nthese image classi\ufb01cation problems.\n\n2 Fisher vector coding\n\n2.1 General formulation\n\nGiven two samples generated from a generative model, their similarity can be evaluated by using a\nFisher kernel [1]. The sample can take any form, including a vector or a vector set, as long as its gen-\neration process can be modeled. For a Fisher vector based image classi\ufb01cation approach, the sample\nis a set of local features extracted from an image which we denote it as X = {x1, x2,\u00b7\u00b7\u00b7 , xT}.\nAssuming each xi is modeled by a p.d.f P (x|\u03bb) and is drawn i.i.d, in Fisher kernel a sample X can\nbe described by the gradient vector over the model parameter \u03bb\n\n\u03bb = \u2207\u03bb log P (X|\u03bb) =\nGX\n\n\u2207\u03bb log P (xi|\u03bb).\n\n(1)\n\n(cid:88)\n\ni\n\nT\n\nF\u22121GX\n\nT\n\n\u03bb GX\n\u03bb\n\n\u03bb , where F is the information matrix\nThe Fisher kernel is then de\ufb01ned as K(X, Y) = GX\n\u03bb\nand is de\ufb01ned as F = E[GX\n]. In practice, the role of the information matrix is less signi\ufb01cant\nand is often omitted for computational simplicity [3]. As a result, two samples can be directly\ncompared by the linear kernel of their corresponding gradient vectors which are often called Fisher\nvectors. From a bag-of-features model perspective, the evaluation of the Fisher kernel for two\nimages can be seen as \ufb01rst calculating the gradient or Fisher vector of each local feature and then\nperforming sum-pooling. In this sense, the Fisher vector calculation for each local feature can be\nseen as a coding method and we call it Fisher vector coding in this paper.\n\n2.2 GMM based Fisher vector coding and its limitation\n\nTo implement the Fisher vector coding framework introduced above, one needs to specify the dis-\ntribution P (x|\u03bb). In the literature, most, if not all, works choose GMM to model the generation\nprocess of x, which can be described as follows:\n\n\u2022 Draw a Gaussian model N (\u00b5k, \u03a3k) from the prior distribution P (k), k = 1, 2,\u00b7\u00b7\u00b7 , m .\n\u2022 Draw a local feature x from N (\u00b5k, \u03a3k).\n\n2\n\n\fGenerally speaking, the distribution of x resembles a Gaussian distribution only within a local region\nof feature space. Thus, for a GMM, each of Gaussian essentially models a small partition of the\nfeature space and many of them are needed to depict the whole feature space. Consequently, the\nnumber of mixtures needed will be determined by the volume of the feature space. For the commonly\nused low dimensional local features, such as SIFT, it has been shown that it is suf\ufb01cient to set the\nnumber of mixtures to few hundreds. However, for higher dimensional local features this number\nmay be insuf\ufb01cient. This is because the volume of feature space usually increases quickly with the\nfeature dimensionality. Consequently, the same number of mixtures will result in a coarser partition\nresolution and imprecise modeling.\nTo increase the partition resolution for higher dimensional feature space, one straightforward so-\nlution is to increase the number of Gaussians. However, it turns out that the partition resolution\nincreases slowly (compared to our method which will be introduced in the next section) with the\nnumber of mixtures. In other words, much larger number of Gaussians will be needed and this will\nresult in a Fisher vector whose dimensionality is too high to be handled in practice.\n\n3 Our method\n\n3.1\n\nIn\ufb01nite number of Gaussians mixture\n\nOur solution to this issue is to go beyond a \ufb01xed number of Gaussian distributions and use an\nin\ufb01nite number of them. More speci\ufb01cally, we assume that a local feature is drawn from a Gaussian\ndistribution with a randomly generated mean vector. The mean vector is a point on a subspace\nspanned by a set of bases (which can be complete or over-complete) and is indexed by a latent\ncoding vector u. The detailed generation process is as follows:\n\n\u2022 Draw a coding vector u from a zero mean Laplacian distribution P (u) = 1\n\u2022 Draw a local feature x from the Gaussian distribution N (Bu, \u03a3),\n\n2\u03bb exp(\u2212|u|\n\u03bb ).\n\nwhere the Laplace prior for P (u) ensures the sparsity of resulting Fisher vector which can be helpful\nfor coding. Essentially, the above process resembles a sparse coding model. To show this relation-\nship, let\u2019s \ufb01rst write the marginal distribution of x:\nP (x, u|B)du =\n\nP (x|u, B)P (u)du.\n\nP (x) =\n\n(cid:90)\n\n(cid:90)\n\n(2)\n\nu\n\nu\n\nThe above formulation involves an integral operator which makes the likelihood evaluation dif\ufb01cult.\nTo simplify the calculation, we use the point-wise maximum within the integral term to approximate\nthe likelihood, that is,\n\nP (x) \u2248 P (x|u\u2217, B)P (u\u2217).\n\nu\u2217 = argmax\n\nP (x|u, B)P (u)\n\nu\n\n(3)\n\nBy assumming that \u03a3 = diag(\u03c32\nlogarithm of P (x) is written as\n\nm = \u03c32 as a constant. The\n\n1,\u00b7\u00b7\u00b7 , \u03c32\n\nm) and setting \u03c32\n\n1 = \u00b7\u00b7\u00b7 = \u03c32\n\nlog(P (x|B)) = min\n\nu\n\n1\n\n\u03c32(cid:107)x \u2212 Bu(cid:107)2\n\n2 + \u03bb(cid:107)u(cid:107)1,\n\n(4)\nwhich is exactly the objective value of a sparse coding problem. This relationship suggests that we\ncan learn the model parameter B and infer the latent variable u by using the off-the-shelf sparse\ncoding solvers.\nOne question for the above method is that compared to simply increasing the number of models\nin traditional GMM, how much improvement is achieved by increasing the partition resolution. To\nanswer this question, we designed an experiment to compare these two schemes. In our experiment,\nthe partition resolution is roughly measured by the average distance (denoted as d ) between a feature\nand its closest mean vector in the GMM or the above model. The larger d is, the lower the partition\nresolution is. The comparison is shown in Figure 1. In Figure 1 (a), we increase the dimensionality\n\n3\n\n\fof local features 1 and for each dimension we calculate d in a GMM model with 100 mixtures. As\nseen, d increases quickly with the feature dimensionality. In Figure 1 (b), we try to reduce d by\nintroducing more mixture distributions in GMM model. However, as can be seen, d drops slowly\nwith the increase in the number of Gaussians. In contrast, with the proposed method, we can achieve\nmuch lower d by using only 100 bases. This result illustrates the advantage of our method.\n\n(a)\n\n(b)\n\nFigure 1: Comparison of two ways to increase the partition resolution. (a) For GMM, d (the av-\nerage distance between a local feature and its closest mean vector) increases with the local feature\ndimensionality. Here the GMM is \ufb01xed at 100 Gaussians. (b) d is reduced in two ways (1) simply\nincreasing the number of Gaussian distributions in the mixture. (2) using the proposed generation\nprocess. As can be seen, the latter achieves much lower d even with a small number of bases.\n\n3.2 Sparse coding based Fisher vector coding\n\nOnce the generative model of local features is established, we can readily derive the corresponding\nFisher coding vector by differentiating its log likelihood, that is,\n\u03c32(cid:107)x \u2212 Bu\u2217(cid:107)2\n\u2202 1\n\n\u2202 log(P (x|B))\n\n2 + \u03bb(cid:107)u\u2217(cid:107)1\n\nC(x) =\n\n\u2202B\nu\u2217 = argmax\n\nP (x|u, B)P (u).\n\n=\n\n\u2202B\n\n(5)\nNote that the differentiation involves u\u2217 which implicitly interacts with B. To calculate this term,\nwe notice that the sparse coding problem can be reformulated as a general quadratic programming\nproblem by de\ufb01ning u+ and u\u2212 as the positive and negative parts of u, that is, the sparse coding\nproblem can be rewritten as\n\nu\n\nu+,u\u2212 (cid:107)x \u2212 B(u+ \u2212 u\u2212)(cid:107)2\nmin\ns.t. u+ \u2265 0 u\u2212 \u2265 0\n\n2 + \u03bb1T (u+ + u\u2212)\n\n(6)\nBy further de\ufb01ning u(cid:48) = (u+, u\u2212)T , log(P (x|B)) can be expressed in the following general form,\n\nlog(P (x|B)) = L(B) = max\n\nu(cid:48) u(cid:48)T v(B) \u2212 1\n\n2\n\nu(cid:48)T P(B)u(cid:48),\n\n(7)\n\nwhere P(B) and v(B) are a matrix term and a vector term depending on B respectively. The\nderivative of L(B) has been studied in [11]. According to the Lemma 2 in [11], we can differentiate\nL(B) with respect to B as if u(cid:48) did not depend on B. In other words, we can \ufb01rstly calculate u(cid:48) or\nequivalently u\u2217 by solving the sparse coding problem and then obtain the Fisher vector \u2202 log(P (x|B))\nas\n\n\u2202B\n\n\u03c32(cid:107)x \u2212 Bu\u2217(cid:107)2\n\u2202 1\n\n2 + \u03bb(cid:107)u\u2217(cid:107)1\n\n\u2202B\n\n= (x \u2212 Bu\u2217)u\u2217T .\n\n(8)\n\n1This is achieved by performing PCA on a 4096-dimensional CNN regional descriptor. For more details\n\nabout the feature we used, please refer to Section 3.4\n\n4\n\n10020030040050060070080090010001.822.22.42.62.83Dimensionality of regional local features d GMM with 100 mixtures10020030040050060070080090010002.252.32.352.42.452.52.55Number of Gaussian mixturesd GMMProposed model (with 100 bases)\fTable 1: Comparison of results on Pascal VOC 2007. The lower part of this table lists some results\nreported in the literature. We only report the mean average precision over 20 classes. The average\nprecision for each class is listed in Table 2.\n\nMethods\nSCFVC (proposed)\nGMMFVC\nCNNaug-SVM [8]\nCNN-SVM [8]\nNUS [13]\nGHM [14]\nAGS [15]\n\nmean average precision Comments\n76.9%\n73.8%\n77.2%\n73.9%\n70.5%\n64.7%\n71.1%\n\nsingle scale, no augmented data\nsingle scale, no augmented data\nwith augmented data, use CNN for whole image\nno augmented data.use CNN for whole image\n-\n-\n-\n\nNote that the Fisher vector expressed in Eq. (8) has an interesting form: it is simply the outer product\nbetween the sparse coding vector u\u2217 and the reconstruction residual term (x \u2212 Bu\u2217). In traditional\nsparse coding, only the kth dimension of a coding vector uk is used to indicate the relationship\nbetween a local feature x and the kth basis. Here in the sparse coding based Fisher vector, the\ncoding value uk multiplied by the reconstruction residual is used to capture their relationship.\n\n3.3 Pooling and normalization\n\nFrom the i.i.d assumption in Eq. (1), the Fisher vector of the whole image is 2\n(xi \u2212 Bu\u2217\n\n\u2202 log(P (xi|B))\n\n\u2202 log(P (I|B))\n\n=\n\n=\n\ni )u\u2217\n\ni\n\n(cid:62)\n\n.\n\n(9)\n\n(cid:88)\n\nxi\u2208I\n\n\u2202B\n\n\u2202B\n\n(cid:88)\n\nxi\u2208I\n\nThis is equivalent to performing the sum-pooling for the extracted Fisher coding vectors. However,\nit has been observed [3, 12] that the image signature obtained using sum-pooling tends to over-\nemphasize the information from the background [3] or bursting visual words [12]. It is important to\napply normalization when sum-pooling is used. In this paper, we apply intra-normalization [12] to\nnormalize the pooled Fisher vectors. More speci\ufb01cally, we apply l2 normalization to the subvectors\ni . Besides\n\ni,k \u2200k, where k indicates the kth dimension of the sparse coding u\u2217\n\n(cid:80)\nxi\u2208I(xi \u2212 Bu\u2217\n\ni )u\u2217\n\nintra-normalization, we also utilize the power normalization as suggested in [3].\n\n3.4 Deep CNN based regional local features\n\nRecently, the middle-layer activation of a pre-trained deep CNN has been demonstrated to be a\npowerful image descriptor [8, 16]. In this paper, we employ this descriptor to generate a number\nof local features for an image. More speci\ufb01cally, an input image is \ufb01rst resized to 512\u00d7512 pixels\nand regions with the size of 227\u00d7227 pixels are cropped with the stride 8 pixels. These regions\nare subsequently feed into the deep CNN and the activation of the sixth layer is extracted as local\nfeatures for these regions. In our implementation, we use the Caffe [17] package which provides a\ndeep CNN pre-trained on ILSVRC2012 dataset and its 6-th layer is a 4096-dimensional vector. This\nstrategy has demonstrated better performance than directly using deep CNN features for the whole\nimage recently [16].\nOnce regional local features are extracted, we encoded them using the proposed SCFVC method and\ngenerate an image level representation by sum-pooling and normalization. Certainly, our method is\nopen to the choice of other high-dimensional local features. The reason for choosing deep CNN\nfeatures in this paper is that by doing so we can demonstrate state-of-the-art image classi\ufb01cation\nperformance.\n\n4 Experimental results\n\nWe conduct experimental evaluation of the proposed sparse coding based Fisher vector coding\n(SCFVC) on three large datasets: Pascal VOC 2007, MIT indoor scene-67 and Caltech-UCSD Birds-\n\n2the vectorized form of \u2202 log(P (I|B))\n\n\u2202B\n\nis used as the image representation.\n\n5\n\n\fTable 2: Comparison of results on Pascal VOC 2007 for each of 20 classes. Besides the proposed\nSCFVC and the GMMFVC baseline, the performance obtained by directly using CNN as global\nfeature is also compared.\n\naero\n89.5\nSCFVC\nGMMFVC\n87.1\nCNN-SVM 88.5\ntable\n72.0\nSCFVC\nGMMFVC\n66.9\nCNN-SVM 66.5\n\nbike\n84.1\n80.6\n81.0\ndog\n77.1\n75.1\n77.8\n\nboat\n83.7\n79.7\n82.0\n\nbird\n83.7\n80.3\n83.5\nhorse mbike\n88.7\n84.9\n81.8\n\n82.1\n81.2\n78.8\n\nbottle\n43.9\n42.8\n42.0\nperson\n94.4\n93.1\n90.2\n\nbus\n76.7\n72.2\n72.5\nplant\n56.8\n53.1\n54.8\n\ncar\n87.8\n87.4\n85.3\nsheep\n71.4\n70.8\n71.1\n\ncat\n82.5\n76.1\n81.6\nsofa\n67.7\n66.2\n62.6\n\nchair\n60.6\n58.6\n59.9\ntrain\n90.9\n87.9\n87.2\n\ncow\n69.6\n64.0\n58.5\nTV\n75.0\n71.3\n71.8\n\nTable 3: Comparison of results on MIT-67. The lower part of this table lists some results reported\nin the literature.\n\nMethods\nSCFVC (proposed)\nGMMFVC\nMOP-CNN [16]\nVLAD level2 [16]\nCNN-SVM [8]\nFV+Bag of parts [19]\nDPM [20]\n\nClassi\ufb01cation Accuracy Comments\n68.2%\n64.3%\n68.9%\n65.5%\n58.4%\n63.2%\n37.6%\n\nwith single scale\nwith single scale\nwith three scales\nwith single best scale\nuse CNN for whole image\n-\n-\n\n200-2011. These are commonly used evaluation benchmarks for generic object classi\ufb01cation, scene\nclassi\ufb01cation and \ufb01ne-grained image classi\ufb01cation respectively. The focus of these experiments is\nto examine that whether the proposed SCFVC outperforms the traditional Fisher vector coding in\nencoding high dimensional local features.\n\n4.1 Experiment setup\n\nIn our experiments, the activations of the sixth layer of a pre-trained deep CNN are used as regional\nlocal features. PCA is applied to further reduce the regional local features from 4096 dimensions to\n2000 dimensions. The number of Gaussian distributions and the codebook size for sparse coding is\nset to 100 throughout our experiments unless otherwise mentioned.\nFor the sparse coding, we use the algorithm in [18] to learn the codebook and perform the coding\nvector inference. For all experiments, linear SVM is used as the classi\ufb01er.\n\n4.2 Main results\n\nPascal-07 Pascal VOC 2007 contains 9963 images with 20 object categories which form 20 binary\n(object vs. non-object) classi\ufb01cation tasks. The use of deep CNN features has demonstrated the\nstate-of-the-art performance [8] on this dataset. In contrast to [8], here we use the deep CNN fea-\ntures as local features to model a set of image regions rather than as a global feature to model the\nwhole image. The results of the proposed SCFVC and traditional Fisher vector coding, denoted as\nGMMFVC, are shown in Table 1 and Table 2. As can be seen from Table 1, the proposed SCFVC\nleads to superior performance over the traditional GMMFVC and outperforms GMMFVC by 3%.\nBy cross-referencing Table 2, it is clear that the proposed SCFVC outperforms GMMFVC in all\nof 20 categories. Also, we notice that the GMMFVC is merely comparable to the performance of\ndirectly using deep CNN as global features, namely, CNN-SVM in Table 1. Since both the proposed\nSCFVC and GMMFVC adopt deep CNN features as local features, this observation suggests that\nthe advantage of using deep CNN features as local features can only be clearly demonstrated when\nthe appropriate coding method, i.e. the proposed SCFVC is employed. Note that to further boost the\n\n6\n\n\fTable 4: Comparison of results on Birds-200 2011. The lower part of this table lists some results\nreported in the literature.\n\nMethods\nSCFVC (proposed)\nGMMFVC\nCNNaug-SVM [8]\nCNN-SVM [8]\nDPD+CNN+LogReg [21]\nDPD [22]\n\nClassi\ufb01cation Accuracy Comments\n66.4%\n61.7%\n61.8%\n53.3%\n65.0%\n51.0%\n\nwith single scale\nwith single scale\nwith augmented data, use CNN for the whole image\nno augmented data, use CNN as global features\nuse part information\n-\n\nFigure 2: The performance comparison of classi\ufb01cation accuracy vs. local feature dimensionality\nfor the proposed SCFVC and GMMFVC on MIT-67.\n\nperformance, one can adopt some additional approaches like introducing augmented data or com-\nbining multiple scales. Some of the methods compared in Table 1 have employed these approaches\nand we have commented this fact as so inform readers that whether these methods are directly com-\nparable to the proposed SCFVC. We do not pursue these approaches in this paper since the focus of\nour experiment is to compare the proposed SCFVC against traditional GMMFVC.\nMIT-67 MIT-67 contains 6700 images with 67 indoor scene categories. This dataset is quite chal-\nlenging because the differences between some categories are very subtle. The comparison of classi-\n\ufb01cation results are shown in Table 3. Again, we observe that the proposed SCFVC signi\ufb01cantly out-\nperforms traditional GMMFVC. To the best of our knowledge, the best performance on this dataset\nis achieved in [16] by concatenating the features extracted from three different scales. The proposed\nmethod achieves the same performance only using a single scale. We also tried to concatenate the\nimage representation generated from the proposed SCFVC with the global deep CNN feature. The\nresulted performance can be as high as 70% which is by far the best performance achieved on this\ndataset.\nBirds-200-2011 Birds-200-2011 contains 11788 with 200 different birds species, which is a com-\nmonly used benchmark for \ufb01ne-grained image classi\ufb01cation. The experimental results on this\ndataset are shown in Table 4. The advantage of SCFVC over GMMFVC is more pronounced on\nthis dataset: SCFVC outperforms GMMFVC by over 4%. We also notice two interesting obser-\nvations: (1) GMMFVC even achieves comparable performance to the method of using the global\ndeep CNN feature with augmented data, namely, CNNaug-SVM in Table 4. (2) Although we do\nnot use any parts information (of birds), our method outperforms the result using parts information\n(DPD+CNN+LogReg in Table 4). These two observations suggest that using deep CNN features\nas local features is better for \ufb01ne-grained problems and the proposed method can further boost its\nadvantage.\n\n7\n\n0200400600800100012001400160018002000616263646566676869Dimensionality of regional local featuresClassification Accuracy % SCFVGMMFV\fTable 5: Comparison of results on MIT-67 with three different settings: (1) 100-basis codebook with\n1000 dimensional local features, denoted as SCFV-100-1000D (2) 400 Gaussian mixtures with 300\ndimensional local features, denoted as GMMFV-400-300D (3) 1000 Gaussian mixtures with 100\ndimensional local features denoted as GMMFV-1000-100D. They have the same/similar total image\nrepresentation dimensionality.\n\nSCFV-100-1000D GMMFV-400-300D GMMFV-1000-100D\n68.1%\n\n64.0%\n\n60.8%\n\n4.3 Discussion\n\nIn the above experiments, the dimensionality of local features is \ufb01xed to 2000. But how about the\nperformance comparison between the proposed SCFV and traditional GMMFV on lower dimen-\nsional features? To investigate this issue, we vary the dimensionality of the deep CNN features from\n100 to 2000 and compare the performance of the two Fisher vector coding methods on MIT-67. The\nresults are shown in Figure 2. As can be seen, for lower dimensionality (like 100), the two methods\nachieve comparable performance and in general both methods bene\ufb01t from using higher dimensional\nfeatures. However, for traditional GMMFVC, the performance gain obtained from increasing fea-\nture dimensionality is lower than that obtained by the proposed SCFVC. For example, from 100 to\n1000 dimensions, the traditional GMMFVC only obtains 4% performance improvement while our\nSCFVC achieves 7% performance gain. This validates our argument that the proposed SCFVC is\nespecially suited for encoding high dimensional local features.\nSince GMMFVC works well for lower dimensional features, how about reducing the higher dimen-\nsional local features to lower dimensions and use more Gaussian mixtures? Will it be able to achieve\ncomparable performance to our SCFVC which uses higher dimensional local features but a smaller\nnumber of bases? To investigate this issue, we also evaluate the classi\ufb01cation performance on MIT-\n67 using 400 Gaussian mixtures with 300-dimension local features and 1000 Gaussian mixtures with\n100-dimension local features. Thus the total dimensionality of these two image representations will\nbe similar to that of our SCFVC which uses 100 bases and 1000-dimension local features. The com-\nparison is shown in Table 5. As can be seen, the performance of these two settings are much inferior\nto the proposed one. This suggests that some discriminative information may have already been lost\nafter the PCA dimensionality reduction and the discriminative power can not be re-boosted by sim-\nply introducing more Gaussian distributions. This veri\ufb01es the necessity of using high dimensional\nlocal features and justi\ufb01es the value of the proposed method.\nIn general, the inference step in sparse coding can be slower than the membership assignment in\nGMM model. However, the computational ef\ufb01ciency can be greatly improved by using an approx-\nimated sparse coding algorithm such as learned FISTA [23] or orthogonal matching pursuit [10].\nAlso, the proposed method can be easily generalized to several similar coding models, such as local\nlinear coding [24]. In that case, the computational ef\ufb01ciency is almost identical (or even faster if\napproximated k-nearest neighbor algorithms are used) to the traditional GMMFVC.\n\n5 Conclusion\n\nIn this work, we study the use of Fisher vector coding to encode high-dimensional local features.\nOur main discovery is that traditional GMM based Fisher vector coding is not particular well suited\nto modeling high-dimensional local features. As an alternative, we proposed to use a generation\nprocess which allows the mean vector of a Gaussian distribution to be chosen from a point in a\nsubspace. This model leads to a new Fisher vector coding method which is based on sparse coding\nmodel. Combining with the activation of the middle layer of a pre-trained CNN as high-dimensional\nlocal features, we build an image classi\ufb01cation system and experimentally demonstrate that the\nproposed coding method is superior to the traditional GMM in encoding high-dimensional local\nfeatures and can achieve state-of-the-art performance in three image classi\ufb01cation problems.\nAcknowledgements This work was in part supported by Australian Research Council grants\nFT120100969, LP120200485, and the Data to Decisions Cooperative Research Centre. Correspon-\ndence should be addressed to C. Shen (email: chhshen@gmail.com).\n\n8\n\n\fReferences\n[1] T. Jaakkola and D. Haussler, \u201cExploiting generative models in discriminative classi\ufb01ers,\u201d in Proc. Adv.\n\nNeural Inf. Process. Syst., 1998, pp. 487\u2013493.\n\n[2] F. Perronnin and C. R. Dance, \u201cFisher kernels on visual vocabularies for image categorization,\u201d in Proc.\n\nIEEE Conf. Comp. Vis. Patt. Recogn., 2007.\n\n[3] F. Perronnin, J. S\u00b4anchez, and T. Mensink, \u201cImproving the Fisher kernel for large-scale image classi\ufb01ca-\n\ntion,\u201d in Proc. Eur. Conf. Comp. Vis., 2010.\n\n[4] J. Krapac, J. J. Verbeek, and F. Jurie, \u201cModeling spatial layout with \ufb01sher vectors for image categoriza-\n\ntion.\u201d in Proc. IEEE Int. Conf. Comp. Vis., 2011, pp. 1487\u20131494.\n\n[5] K. Simonyan, A. Vedaldi, and A. Zisserman, \u201cDeep \ufb01sher networks for large-scale image classi\ufb01cation,\u201d\n\nin Proc. Adv. Neural Inf. Process. Syst., 2013.\n\n[6] V. Sydorov, M. Sakurada, and C. H. Lampert, \u201cDeep \ufb01sher kernels\u2014end to end learning of the Fisher\n\nkernel GMM parameters,\u201d in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2014.\n\n[7] H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool, \u201cSpeeded-up robust features (SURF),\u201d Computer Vision\n\n& Image Understanding, vol. 110, no. 3, pp. 346\u2013359, 2008.\n\n[8] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, \u201cCNN features off-the-shelf: an astounding\n\nbaseline for recognition,\u201d 2014, http://arxiv.org/abs/1403.6382.\n\n[9] S. Yan, X. Xu, D. Xu, S. Lin, and X. Li, \u201cBeyond spatial pyramids: A new feature extraction framework\nwith dense spatial sampling for image classi\ufb01cation,\u201d in Proc. Eur. Conf. Comp. Vis., 2012, pp. 473\u2013487.\n[10] L. Bo, X. Ren, and D. Fox, \u201cHierarchical matching pursuit for image classi\ufb01cation: Architecture and fast\n\nalgorithms,\u201d in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 2115\u20132123.\n\n[11] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, \u201cChoosing multiple parameters for support vector\n\nmachines,\u201d Machine Learning, vol. 46, no. 1\u20133, pp. 131\u2013159, 2002.\n\n[12] R. Arandjelovi\u00b4c and A. Zisserman, \u201cAll about VLAD,\u201d in Proc. IEEE Int. Conf. Comp. Vis., 2013.\n[13] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, \u201cContextualizing object detection and classi\ufb01cation.\u201d in\n\nProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011.\n\n[14] Q. Chen, Z. Song, Y. Hua, Z. Huang, and S. Yan, \u201cHierarchical matching with side information for image\n\nclassi\ufb01cation.\u201d in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 3426\u20133433.\n\n[15] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan, \u201cSubcategory-aware object classi\ufb01cation.\u201d in\n\nProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 827\u2013834.\n\n[16] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, \u201cMulti-scale orderless pooling of deep convolutional activa-\n\ntion features,\u201d in Proc. Eur. Conf. Comp. Vis., 2014.\n\n[17] Y. Jia, \u201cCaffe,\u201d 2014, https://github.com/BVLC/caffe.\n[18] H. Lee, A. Battle, R. Raina, and A. Y. Ng, \u201cEf\ufb01cient sparse coding algorithms,\u201d in Proc. Adv. Neural Inf.\n\nProcess. Syst., 2007, pp. 801\u2013808.\n\n[19] C. Doersch, A. Gupta, and A. A. Efros, \u201cMid-level visual element discovery as discriminative mode\n\nseeking,\u201d in Proc. Adv. Neural Inf. Process. Syst., 2013.\n\n[20] M. Pandey and S. Lazebnik, \u201cScene recognition and weakly supervised object localization with de-\n\nformable part-based models,\u201d in Proc. IEEE Int. Conf. Comp. Vis., 2011, pp. 1307\u20131314.\n\n[21] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, \u201cDeCAF: A deep convo-\n\nlutional activation feature for generic visual recognition,\u201d in Proc. Int. Conf. Mach. Learn., 2013.\n\n[22] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, \u201cDeformable part descriptors for \ufb01ne-grained recognition\n\nand attribute prediction,\u201d in Proc. IEEE Int. Conf. Comp. Vis., December 2013.\n\n[23] K. Gregor and Y. LeCun, \u201cLearning fast approximations of sparse coding,\u201d in Proc. Int. Conf. Mach.\n\nLearn., 2010, pp. 399\u2013406.\n\n[24] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, \u201cLocality-constrained linear coding for image\n\nclassi\ufb01cation,\u201d in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2010.\n\n9\n\n\f", "award": [], "sourceid": 664, "authors": [{"given_name": "Lingqiao", "family_name": "Liu", "institution": "Univeristy of Adelaide"}, {"given_name": "Chunhua", "family_name": "Shen", "institution": "NICTA (National ICT Australia)"}, {"given_name": "Lei", "family_name": "Wang", "institution": "University of Wollongong"}, {"given_name": "Anton", "family_name": "van den Hengel", "institution": "University of Adelaide"}, {"given_name": "Chao", "family_name": "Wang", "institution": "University of Wollongong"}]}