{"title": "Graphical Gaussian Vector for Image Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1555, "abstract": "This paper proposes a novel image representation called a Graphical Gaussian Vector, which is a counterpart of the codebook and local feature matching approaches. In our method, we model the distribution of local features as a Gaussian Markov Random Field (GMRF) which can efficiently represent the spatial relationship among local features. We consider the parameter of GMRF as a feature vector of the image. Using concepts of information geometry, proper parameters and a metric from the GMRF can be obtained. Finally we define a new image feature by embedding the metric into the parameters, which can be directly applied to scalable linear classifiers. Our method obtains superior performance over the state-of-the-art methods in the standard object recognition datasets and comparable performance in the scene dataset. As the proposed method simply calculates the local auto-correlations of local features, it is able to achieve both high classification accuracy and high efficiency.", "full_text": "Graphical Gaussian Vector for Image Categorization\n\nTatsuya Harada\n\nThe University of Tokyo/JST PRESTO\n7-3-1 Hongo Bunkyo-ku, Tokyo Japan\n\nharada@isi.imi.i.u-tokyo.ac.jp\n\nkuniyosh@isi.imi.i.u-tokyo.ac.jp\n\nYasuo Kuniyoshi\n\nThe University of Tokyo\n\n7-3-1 Hongo Bunkyo-ku, Tokyo Japan\n\nAbstract\n\nThis paper proposes a novel image representation called a Graphical Gaussian\nVector (GGV), which is a counterpart of the codebook and local feature matching\napproaches. We model the distribution of local features as a Gaussian Markov\nRandom Field (GMRF) which can ef\ufb01ciently represent the spatial relationship\namong local features. Using concepts of information geometry, proper parameters\nand a metric from the GMRF can be obtained. Then we de\ufb01ne a new image feature\nby embedding the proper metric into the parameters, which can be directly applied\nto scalable linear classi\ufb01ers. We show that the GGV obtains better performance\nover the state-of-the-art methods in the standard object recognition datasets and\ncomparable performance in the scene dataset.\n\n1\n\nIntroduction\n\nThe Bag of Words (BoW) [7] is the de facto standard image feature for the image categorization.\nIn a BoW, each local feature is assigned to the nearest codeword and an image is represented by\na histogram of the quantized features. Several approaches inspired by a BoW have been proposed\nin recent years [9], [23], [28], [27], [29]. While it is well established that using a large number\nof codewords improves classi\ufb01cation performance, the drawback is that assigning local features to\nthe nearest codeword is computationally expensive. To overcome this problem, some studies have\nproposed building an ef\ufb01cient image representation with a smaller number of codewords [22], [24].\nFinding an explicit correspondence between local features is another way of categorizing images\nusing a BoW [4], [12], [26], and this approach has been improved by representing a spatial layout\nof local features as a graph [11], [2], [16], [8]. Explicit correspondences between features have an\nadvantage over a BoW as information loss in the vector quantization can be avoided. However, the\ndrawback with this approach is that the identi\ufb01cation of corresponding points with minimum dis-\ntortion is also computationally expensive. Therefore, the aim of our research is to build an ef\ufb01cient\nimage representation without using codewords or explicit correspondences between local features,\nwhile still achieving high classi\ufb01cation accuracy.\n\nSince having a spatial layout of local features is important for an image to have semantic meaning,\nit is natural that embedding spatial information into an image feature improves classi\ufb01cation perfor-\nmance [18], [5], [14], [17]. Several approaches take advantage of this fact, ranging from local (e.g.,\nSIFT) to global (e.g., Spatial Pyramid). Meanwhile, we will focus on the spatial layout of local\nfeatures, which is the midlevel of the spatial information.\n\nIn this paper, we model an image as a graph representing the spatial layout of local features and\nde\ufb01ne a new image feature based on this graph, where a proper metric is embedded into the feature.\nWe show that the new feature provides high classi\ufb01cation accuracy, even with a linear classi\ufb01er.\nSpeci\ufb01cally, we model an image as a Gaussian Markov Random Field (GMRF) whose nodes cor-\nrespond to local features and consider the GMRF parameters as the image feature. Although the\nGMRF is commonly used for image segmentation, it is rarely used in modern image categorization\npipelines despite being an effective way of modeling the spatial layout. In order to extract the repre-\n\n1\n\n\fsentative feature vector from the GMRF, the choice of coordinates for the parameters and the metric\nbetween them needs to be carefully made. We de\ufb01ne the proper coordinates and the metric from\nan information geometry standpoint [1] and derive an optimal feature vector. The resultant feature\nvector is called a Graphical Gaussian Vector.\n\nThe contributions of this study are summarized as follows: 1) A novel and ef\ufb01cient image feature\nis developed by utilizing the GMRF as a tool for object categorization. 2) This approach is imple-\nmented by developing the Graphical Gaussian Vector feature, which is based on the GMRF and the\ninformation geometry. 3) Using standard image categorization benchmarks, we demonstrate that\nthe proposed feature has better performance over the state-of-the-art methods, even though it is not\nbased on mainstream modules (such as codebooks and correspondence between local features). To\nthe best of our knowledge, this is the \ufb01rst image feature for the object categorization that utilizes\nthe expectation parameters of the GMRF with its Fisher information metric, and achieves a level of\naccuracy comparable to that of the codebook and local feature matching approaches.\n\n2 Graphical Gaussian Vector\n\n2.1 Overview of Proposed Method\n\n(a) Densely sampled \nlocal features\n\nLocal features\n\ni R\u2208x\n\nd\n\n(b) Multivariate Gaussian\nMarkov Random Field\n(MGMRF)\n3x\n2x\n9x\nx\n\n4x\n1x\n8x\n(\nx\nL=\nT\n1\n\n5x\n6x\n7x\n)TT\nx\n\n9\n\n(e) Parameter space\n\nMGMRF)\n1\u03bexp\n;(\n\nMGMRF\n4\u03bexp\n)\n;(\n\n3\u03be\n\nMGMRF\n2\u03bexp\n)\n;(\n\nMGMRF\n5\u03bexp\n)\n;(\n\n(c) PDF (d) Feature Vector\n\n);( \u03bexp\n\n\u03be\n\nParameter of MGMRF\n\nMGMRF\n3\u03bexp\n)\n;(\n\n3\u03be\n\n4\u03be\n\n2\u03be\n6\u03be\n\n5\u03be\n\n1\u03be\n\nManifold\n\nMGMRF\n6\u03bexp\n)\n;(\n\n2\u03be\n\n1\u03be\n\nGeodesic distance ?\n\nFigure 1: Overview of image feature extraction based on a multivariate GMRF.\n\nIn this section, we present an overview of our method. Initially, local features {xi \u2208 Rd}M\ni=1 are\nextracted using a dense sampling strategy (Fig. 1(a)). We then use a multivariate GMRF to model\nthe spatial relationships among local features (Fig. 1(b)). The GMRF is represented as a graph\nG(V,E), whose vertices V and edges E correspond to local features and the dependent relationships\nbetween those features, respectively. Let the vector x be a concatenation of local features in V and\nlet \u03be\nj be a parameter of the GMRF of an image Ij, the image Ij can be represented by a probability\ndistribution p(x; \u03be\nj of the GMRF to\nbe a feature vector of the image Ij (Fig. 1(d)). Assuming that \u03be is a coordinate system, the whole\nprobability distribution model can be considered as a manifold, where each probability distribution is\nrepresented as a point in that space (Fig. 1(e)). However, because the space spanned by parameters\nof a probability distribution is not a Euclidean space, we have to be very careful when choosing\nparameters for the probability distribution and the metric among them. We make use of concepts\nfrom the information geometry [1] and extract proper parameters and a metric from the GMRF.\nFinally, we de\ufb01ne the new image feature by embedding the metric into the extracted parameters to\nbuild an image categorization system with a scalable linear classi\ufb01er. In the following sections, we\ndescribe this process in more detail.\n\nj) of the GMRF (Fig. 1(c)). We consider the parameter \u03be\n\n2.2\nImage Model and Parameters\nGiven M local features {xi \u2208 Rd}M\ni=1, the aim is to model a probability distribution of the local\nfeatures representing the spatial layout of the image using the multivariate GMRF G = (V,E).\nFirst, a vector x is built by concatenating the local features corresponding to the vertices V of the\nGMRF. Let {xi}n\ni=1 are local features that we are focusing on, we obtain the concatenated vector as\n1 \u00b7\u00b7\u00b7 x(cid:2)\nn )(cid:2)\nx = (x(cid:2)\n(e.g., Fig. 1(b), where n = 9). Note that the dimensionality of x is nd and does\nnot depend on the number of local features M, the image size, or the aspect ratio. However, since\nall results valid for a scalar local feature are also valid for a multivariate local feature, in this section\nwe consider the dimensionality of local features is 1 (d = 1) for simplicity. That is dim(x) = n.\n\n2\n\n\fLet \u03bc = E[x], P = E[(x \u2212 \u03bc)(x \u2212 \u03bc)(cid:2)], and J = P\n\u22121. A random vector x is called a Gaussian\nMarkov Random Field (GMRF) with respect to G = (V,E), if and only if its density has the form\n2(x \u2212 \u03bc)(cid:2)\nJ(x \u2212 \u03bc)) and Jij = 0 for all {i, j} /\u2208 E. Because the\np(x) = (2\u03c0)\u2212n/2|J|1/2 exp(\u2212 1\n(cid:2)\nGaussian distribution can be represented as an exponential family, here, we consider an exponential\nfamily as follows:\n\n(cid:3)\n\u03b8(cid:2)\u03c6(x) \u2212 \u03a6(\u03b8)\n\np(x) = exp\n\n,\n\n(1)\n\n(cid:4)\n\nwhere \u03b8 are the natural parameters, \u03c6(x)\nlog\n\nexp(\u03b8(cid:2)\u03c6(x))dx is the log-normalizer. \u03b8 and \u03c6(x) of the GMRF are obtained as [15]:\n\nthe suf\ufb01cient\n\nstatistic,\n\nand \u03a6(\u03b8) =\n\nis\n\n(2)\n\n\u03b8i = hi, \u03b8ii = \u22121\n\u03c6i(x) = xi, \u03c6ii(x) = x2\n\n2 Jii, \u03b8jk = \u2212Jjk, (i \u2208 V,{j, k} \u2208 E),\ni , \u03c6jk(x) = xjxk, (i \u2208 V,{j, k} \u2208 E),\n\n(3)\nwhere h = J\u03bc. The expectation parameter \u03b7 = E[\u03c6(x)] is an implicit parameterization belonging\nto the exponential family. The expectation parameters are obtained as [15]:\n\ni , \u03b7jk = Pjk + \u03bcj\u03bck, (i \u2208 V,{j, k} \u2208 E).\n\n\u03b7i = \u03bci, \u03b7ii = Pii + \u03bc2\n\n\u2202\u03b7j , where G\n\n\u2217ij(\u03b7) = \u2202\u03b8i\n\n(4)\nThe natural and expectation parameters can be transformed into each other [1]. They are called\nmutually dual as each is the dual coordinate system of the other. The two coordinate systems are\n\u2217ij(\u03b7): Gij(\u03b8) = \u2202\u03b7i\nclosely related through the Fisher information matrices (FIMs) Gij(\u03b8) and G\n\u2202\u03b8j ,\n\u22121(\u03b8). If we take the natural parameters or the expectation\nand G\nparameters as a coordinate system for an exponential family, a \ufb02at structure can be realized [1].\nIn particular, \u03b8 is called a 1-af\ufb01ne coordinate system, and the space spanned by \u03b8 is called 1-\ufb02at.\nSimilarly, \u03b7 is called a (-1)-af\ufb01ne coordinate system, and the space spanned by \u03b7 is called (-1)-\ufb02at.\nThose spaces are similar to a Euclidean space, but we need to be careful that the spaces spanned by\nthe natural or expectation parameters are different from a Euclidean space, as the metrics vary for\ndifferent parameters. We will discuss how to determine the metrics in those spaces in Sections. 2.4\nand 2.5.\n\n\u2217(\u03b7) = G\n\nTo summarize this section, the natural and expectation parameters are similar and interchangeable\nthrough the FIMs. By using these parameters, we can obtain \ufb02atness similar to the Euclidean space.\nAlthough it does not matter whether we choose natural or expectation parameters, we use expecta-\ntion parameters (Eq. (4)) as feature vectors because they can be calculated directly from the mean\nand covariance of local features. We will see a multivariate extension of the GMRF and its calcula-\ntion in the next section.\n\n2.3 Calculation of Expectation Parameters\n\nIn this section, we describe the calculations of the expectation parameters of the multivariate GMRF.\nFirst, we de\ufb01ne the graph structure of the GMRF. We use star graphs shown in Fig. 2, where\nfour neighbors (Fig. 2(a)) or eight neighbors (Fig. 2(b)) are usually used. While a graph having\nmore neighbors is obviously able to represent richer spatial information, the compact structure is\npreferable for ef\ufb01ciency. Therefore, we employed the approximated graph structures shown in Fig.\n2(c), which represents the vertical and horizontal relationships among local features, and Fig. 2(d),\nwhich represents vertical, horizontal and, diagonal relationships.\n\nr +k\n\n2a\n\nr +k\n\n3a\n\nr +k\n\n4a\n\nr +k\n\n2a\n\nr +k\n\nkr1a\n\nr +k\n\nkr1a\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Structures of the GMRF.\n\n(cid:42)(cid:78)(cid:66)(cid:72)(cid:70)(cid:1)\n(cid:83)(cid:70)(cid:72)(cid:74)(cid:80)(cid:79)(cid:1)(cid:43)\n\n(cid:9)(cid:70)(cid:10)\n\nNext, we show a method for estimating the expectation parameters of each image.\nIn practice,\nEq. (4) in a multivariate case can be determined by calculating the local auto-correlations of local\n\n3\n\n\f(cid:5)\n\nfeatures. Here we present the detailed calculations of Eq. (4) using Fig. 2(c) as an example. Let\nx(rk) \u2208 Rd be the local feature at a reference point rk and let ai and aj be the displacement\nvectors, which are de\ufb01ned by the structure of the GMRF. Then, the local auto-correlation matrices\nare obtained as: Ci,j = 1\n, where NJ is the number of local features\nNJ\nin the image region J. Especially if we de\ufb01ne a0 = 0, C0,i = 1\n. Let a\nvector concatenating local features in the vertices at the reference point rk be x(cid:2)\nk = (x(rk)(cid:2)x(rk+\na1)(cid:2)x(rk + a2)(cid:2)), P + \u03bc\u03bc(cid:2)\n\nx(rk+ai)x(rk+aj)(cid:2)\n\nx(rk)x(rk + ai)(cid:2)\n(cid:8)\n\nis calculated to be:\n\n(cid:5)\n\n(cid:7)\n\nk\u2208J\n\nk\u2208J\n\nNJ\n\nP + \u03bc\u03bc(cid:2)\n\n=\n\n1\nNJ\n\nxkx(cid:2)\n\nk =\n\nC0,0 C0,1 C0,2\nC1,0 C1,1 C1,2\nC2,0 C2,1 C2,2\n\n.\n\n(5)\n\n(cid:6)\n\nk\u2208J\n\n(cid:9)\n\n0 \u03bc(cid:2)\n\n1 \u03bc(cid:2)\n2 f\n\n(cid:2)(C0,0) f\n\n(cid:2)(C1,1) f\n\n2(c) can be obtained as: \u03b7 =\nThe expectation parameters of the GMRF depicted in Fig.\n, where f(\u00b7) returns a column vector\n(cid:2)(C0,2))(cid:2)\n(\u03bc(cid:2)\nconsisting of the elements of the upper triangular portion of the input matrix, g(\u00b7) returns a column\nx(rk + ai). Note that\nvector containing all the elements of the input matrix and \u03bc\nC1,2 is omitted, because there is no edge between the vertices at rk + a1 and rk + a2. In general,\nthe expectation parameters (Eq. (4)) on the star graph can be calculated by:\n\n(cid:2)(C2,2) g\n\n(cid:2)(C0,1) g\n\ni = 1\n\n(cid:5)\n\nk\u2208J\n\nNJ\n\n0 \u00b7\u00b7\u00b7 \u03bc(cid:2)\n\u03bc(cid:2)\n\n(cid:2)\n\n(C0,0)\u00b7\u00b7\u00b7 f\n\n(cid:2)\n\n(cid:2)\n\n(C0,1)\u00b7\u00b7\u00b7 g\n\n(cid:2)\n\n\u03b7 =\n\nn\u22121 f\n\n(Cn\u22121,n\u22121) g\n\n(6)\nwhere n = |V| is the number of vertices. The dimensionality of \u03b7 is: nd + n(d + 1)d/2 + (n\u2212 1)d2,\nwhere d is the dimensionality of the local feature. Also note that {Ci,j}i(cid:5)=j\u2227i,j(cid:5)=0 can be omitted.\n}n\u22121\nBy scanning the image region J (Fig. 2(e)), if we have enough local features, the means {\u03bc\nand covariance matrices {Ci,i}n\u22121\ni=0\ni=0 of local features in the region J come to the vector \u03bc0 and matrix\nC0,0, respectively. The expectation parameters (Eq. (4)) can be approximated by:\n(cid:11)\n(cid:12)(cid:13)\n(C0,0)\u00b7\u00b7\u00b7 f\n(cid:2)\n)\n\n(cid:11)\n\u03b7 = (\u03bc(cid:2)\n\n(cid:12)(cid:13)\n(cid:14)\n0 \u00b7\u00b7\u00b7 \u03bc(cid:2)\n\n(C0,n\u22121)\n\n(C0,0)\n\n(cid:14)\n\n(cid:14)\n\n(cid:11)\n\n(7)\n\n(C0,n\u22121)\n\nf\n\n0\n\n(cid:2)\n\n(cid:2)\n\ng\n\n(cid:2)\n\n(cid:2)\n\n.\n\ni\n\n(cid:12)(cid:13)\n(C0,1)\u00b7\u00b7\u00b7 g\nn\u22121\n\nn\n\nn\n\n(cid:10)(cid:2)\n\n,\n\nEquation (7) is calcuated more ef\ufb01ciently than Eq. (6) and comes to the same vector as Eq. (6).\nHowever, in the preliminary experiment, Eq. (6) is better than Eq. (7) in terms of the classi\ufb01cation\naccuracies. In the following sections, we use Eq. (6) for the expectation parameters.\n\n2.4 Metric\n\n(cid:5)\nIn Section 2.2, we mentioned that the metric varies depending on the parameters. We now derive\na metric between the expectation parameters [1]. Let ds represent the length of the small line-\nelement connecting \u03b7 and \u03b7 + d\u03b7. d\u03b7 is represented by using basis vectors e\u2217i: d\u03b7 =\ni \u03b7ie\u2217i.\nThe squared distance can be calculated by: ds2 = (cid:3)d\u03b7, d\u03b7(cid:4) =\n(cid:3)e\u2217i, e\u2217j(cid:4)d\u03b7id\u03b7j, where (cid:3)\u00b7,\u00b7(cid:4)\nis the inner product of two vectors. By applying the Taylor expansion to KL divergence between\np(x; \u03b7) and p(x; \u03b7 + d\u03b7), ds2 can be represented as follows: ds2 = KL[p(x; \u03b7) : p(x; \u03b7 + d\u03b7)] =\n2 d\u03b7(cid:2)\n1\nis the FIM. By comparing these equations, it is clear\nthat the metric matrix consisting of the inner products of the basis vectors corresponds to the FIM:\n\n\u2217\n\u2217ijd\u03b7id\u03b7j, where G\n\nd\u03b7 = 1\n2\n\n(cid:5)\n\n(cid:5)\n\ni,j G\n\n\u2217\nG\n\ni,j\n\n\u2217ij = (cid:3)e\u2217i, e\u2217j(cid:4).\nG\n\n(8)\n\nThus, the FIM is a proper metric for the feature vectors (the expectation parameters) obtained from\nthe GMRF.\nThe Cram\u00b4er-Rao inequality gives us a better understanding of the FIM. Assuming that \u02c6\u03b7 is an un-\nbiased estimator, the variance-covariance matrix of \u02c6\u03b7 satis\ufb01es: Var[\u02c6\u03b7] \u2265 1\n\u2217)\u22121. Consequently,\nthe FIM is considered to be the inverse of the variance of an estimator, making it natural to use the\nmatrix as a distance metric between the parameters.\n\nN (G\n\n2.5\nImplementation of Graphical Gaussian Vector\nAt \ufb01rst, we build the concatenated vector as x = (x(cid:2)\n, where each xi corresponds to\nthe local feature of the vertex i. By using all training data, the mean \u03bc = E[x] and the precision\n\n1 \u00b7\u00b7\u00b7 x(cid:2)\n\nn )(cid:2)\n\n4\n\n\fT\n\n)\n\n2\n\n1=z\nzz\n(\nz\n\nTyy\n)\n\n2\n\n(\n\n1=y\ny\n\nx\n1=x\nTxx\n(\n)\n\n2\n\n(cid:9)(cid:66)(cid:10)\n\n*F =\n\nx\n1\nx\n2\nM\nz\n1\nz\n2\nxx\n11\nxx\n22\nM\nzz\n11\nzz\n22\nxx\n21\nyy\n21\nzz\n21\nyx\n11\nyx\n21\nyx\n12\nyx\n2\n2\nM\nzy\n11\nzy\n21\nzy\n12\nzy\n22\n\nxx L\n1\n\n,\n\n,\n\n2\n\n,\n\nzz\n,\n1\n\n2\n\nxxxx\n22\n11\n\n,\n\n,\n\nL\n\nzzzz\n22\n11\n\n,\n\nzzyyxx\n21\n21\n\n21\n\n,\n\n,\n\nyxyxyxyx\n2\n11\n\n21\n\n12\n\n,\n\n,\n\n,\n\n,\n\nL\n\n,\n\nzyzy\n21\n11\n\n,\n\n,\n\nzyzy\n22\n12\n\n,\n\n2\n\nijF *\n\nppiF ,*\n\nppiF ,*\n\n*\n\nppF ,\n\nrr\n\n*\n\nGGF\n\npqiF ,*\n\nrrF ,*\n\npq\n\npqiF ,*\n\nrrF ,*\n\npq\n\n*\n\npqF ,\n\nrs\n\n*\n\n\\GGF\n\n(cid:9)(cid:67)(cid:10)\n\n*G =\n\n*\n\n\\GGF\n\nGGF\n*\nGGF\n*\n\n*\n\n\\ GGF\n\n\\\n\n-1\n\n\\GGF\n*\n\\GGF\n*\n\n\\ GGF\n*\n\\ GGF\n*\n\\\n\n\\\n\n\\GGF\n*\n\\GGF\n*\n\n(cid:9)(cid:68)(cid:10)\n\n, z = (z1, z2)(cid:2)\n\n, y = (y1, y2)(cid:2)\n\nFigure 3: Here V = {x, y, z} and E = {{x, y},{x, z}}. The dimensionality of the local features is\n). A vector concatenating the local features in V is\n2 (x = (x1, x2)(cid:2)\nv = (x1, x2, y1, y2, z1, z2)(cid:2)\n. Using the training data, we calculate a mean \u03bc and a precision matrix\nJ of v. Using \u03bc and J, the Fisher information matrix of the full Gaussian family can be calculated\nas in (b), whose rows and columns correspond to the elements of the expectation parameters. In (b),\n\u2217\n\u2217(\u03b7) can be partitioned into the submatrices F\n\\G,\\G(\u03b7). The\nF\n\u2217(\u03b7).\nFisher information matrix of the GMRF is obtained as shown in (c) using the submatrices of F\n\n\u2217\n\\G,G(\u03b7) and F\n\n\u2217\nG,\\G(\u03b7), F\n\n\u2217\nG,G(\u03b7), F\n\n(cid:5)\n\nk \u03bckJki\n\n\u2217(\u03b7) of the GMRF\nmatrix J = P\n\u2217(\u03b7) with \u03bc and\nis derived from the FIM of the full Gaussian family F\n(cid:5)\nJ. Let e\u2217i, e\u2217ij denote the basis vectors corresponding to \u03bci and Pij + \u03bci\u03bcj in Eq. (4) respec-\n(cid:5)\n\u2217(\u03b7) are obtained by [20]: F\nJ\u03bc) +\ntively. The elements of F\n\u2217i,pq(\u03b7) = (cid:3)e\u2217i, e\u2217pq(cid:4) = \u2212Jpi\n\u2217i,pp(\u03b7) =\nk \u03bckJkj, F\n(cid:3)e\u2217i, e\u2217pp(cid:4) = \u2212Jpi\n\u2217pq,rr(\u03b7) =\nk \u03bckJkp, F\n(cid:3)e\u2217pq, e\u2217rr(cid:4) = JprJrq, F\n\u2217pp,rr(\u03b7) = (cid:3)e\u2217pp, e\u2217rr(cid:4) = 1\nNext we derive G\n\n\u22121, P = E[(x \u2212 \u03bc)(x \u2212 \u03bc)(cid:2)] are obtained. Since the FIM G\n\u2217(\u03b7), we now calculate F\n(cid:5)\n(cid:5)\n\u2217ij(\u03b7) = (cid:3)e\u2217i, e\u2217j(cid:4) = Jij(1 + \u03bc(cid:2)\nk \u03bckJkq \u2212 Jqi\n(cid:16)\n\nk \u03bckJkp, F\n\u2217pq,rs(\u03b7) = (cid:3)e\u2217pq, e\u2217rs(cid:4) = JpsJqr + JqsJpr, F\n(cid:15)\n\u2217(\u03b7) can be partitioned according to the graphs G and \\G:\n\u2217\n\u2217\nG,\\G(\u03b7)\nG,G(\u03b7)\nF\nF\n\u2217\n\u2217\n\\G,G(\u03b7) F\n\\G,\\G(\u03b7)\nF\n\n\u2217(\u03b7). F\n\u2217\n\n\u2217(\u03b7) from F\n\n2 J 2\npr.\n\n(\u03b7) =\n\n(9)\n\nF\n\n.\n\n\u2217(\u03b7) is obtained as the Schur complement of F\n\n\u2217(\u03b7) with respect to the\n\n\u2217\n\\G,G(\u03b7).\n\nG,G(\u03b7) \u2212 F\n\u2217\n\n\u2217\nG,\\G(\u03b7)\n\n\u2217\n\\G,\\G(\u03b7)\n\n(\u03b7) = F\n\n(10)\nAs these calculations may be complicated, we present a simple example using a GMRF with n = 3\nvertices, shown in Fig. 3.\n\nF\n\nF\n\n\u2217(\u03b7) is dif\ufb01cult to deal with as it depends on the expectation parameters. Thus, we\nHowever, G\napproximate the model space using the tangent space at the center point of all training data [20]:\n\u2217(\u03b7) \u2248 G\n\u2217(\u03b7\ni, and N is the number of training images. In order to embed\nG\nthe proper metric into the expectation parameters, we multiply G\n\u2217\n\\G,G(\u03b7\n\n(cid:16)1/2\n\nc)1/2 by \u03b7:\n\n\u2217\n\\G,\\G(\u03b7\n\nc) where \u03b7\n\n(cid:3)\u22121\n\nc) \u2212 F\n\n\u2217\nG,\\G(\u03b7\n\n\u2217\nG,G(\u03b7\n\n(cid:5)\n\n\u2217(\u03b7\n\nN\n\ni=1 \u03b7\n\n(cid:15)\n\nc =\n\n\u03b6 =\n\n(cid:2)\n\n(11)\n\nc)\n\nc)\n\nc)\n\n\u03b7.\n\nF\n\nF\n\nF\n\nWe call \u03b6 Graphical Gaussian Vector (GGV). This vector is used directly to build sophisticated linear\nclassi\ufb01ers.\n\nWe have a derivation of GGV, and the algorithm for it is very simple, consisting of the following\nthree steps: 1) calculation of local auto-correlations of local features; 2) estimation of the expectation\nparameters of the GMRF; and 3) embedding the distance metric (the Fisher information metric) into\nthe expectation parameters. The calculation of GGV is given in Algorithm 1. Before the calculation\nof GGV, we have to estimate the FIM of GMRF by decomposing the FIM of the full Gaussian. As a\nconsequence, we obtain one common FIM for all expectation parameters. In practice, since using all\ntraining data is infeasible to estimate the FIM, we use a subset of local features randomly sampled\nfrom training data. Note that since the calculation of the FIM is done in the preprocessing stage, it\nis not necessary to calculate the FIM when extracting GGVs.\n\n5\n\nA FIM of the GMRF G\n\u2217\n\\G,\\G(\u03b7) [15]:\nsubmatrix F\n\n\u2217\nG\n\n(cid:2)\n\n(cid:3)\u22121\n\n\fAlgorithm 1 Calculation of GGV.\nInput: An image region J, and the Fisher information matrix of the GMRF G\nOutput: GGV \u03b6\n\n\u2217(\u03b7\n\nc)\n\n1. Calculate local auto-correlations of local features:\n\n(cid:5)\n\n(cid:5)\ni = 1\n\u03bc\nk\u2208J\n0 \u00b7\u00b7\u00b7 \u03bc(cid:2)\n\u03b7 = (\u03bc(cid:2)\n\u2217(\u03b7\n\n\u03b6 = (G\n\nNJ\n\nc))1/2 \u03b7\n\nx(rk + ai), Ci,j = 1\nNJ\n\n2. Estimate the expectation parameters:\n\n(cid:2)(C0,0)\u00b7\u00b7\u00b7 f\n\nn\u22121 f\n\nk\u2208J\n(cid:2)(Cn\u22121,n\u22121) g\n\nx(rk + ai)x(rk + aj)(cid:2)\n(cid:2)(C0,1)\u00b7\u00b7\u00b7 g\n\n(cid:2)(C0,n\u22121))(cid:2)\n\n3. Embed the Fisher information metric into the expectation parameters:\n\n3 Experiment\n\n(cid:2)(C0,0))(cid:2)\n\nglc = (\u03bc(cid:2)\n0 f\n(cid:2)(C0,0) g\n\nWe tested our method on the standard object and scene datasets (Caltech101, Caltech256, and\n15-Scenes). For the \ufb01rst experiment, we evaluated the effects of the graph structure (i.e. spatial\ninformation) and the FIM. As baseline methods, we used Generalized Local Correlation (GLC)\n[19]: \u03b7\nwithout the FIM, Local Auto-Correlation features (LAC) [21], [14]:\n(cid:2)(C0,1)\u00b7\u00b7\u00b7 g\nlac = (\u03bc(cid:2)\n\u03b7\nwithout the FIM, and the Global Gaussian with\n0 f\nc). The comparison among these methods are\na center linear kernel (GG) [20]: \u03b7\nshown in Table 1. Two types of graph structures were utilized for the GGVs. The \ufb01rst is shown in\nFig. 2(c) (GGV, n = 3), which models a horizontal and vertical spatial layout of the local features.\nThe second is shown in Fig. 2(d) (GGV, n = 5), which adds diagonal spatial layouts of the features\nto Fig. 2(c). We also compared L2 normalized GGVs (i.e., \u02c6\u03b6 = \u03b6/||\u03b6||). To embed the global\nspatial information, we used the spatial pyramid representation with a 1\u00d7 1 + 2\u00d7 2 + 3\u00d7 3 pyramid\nstructure.\n\n(cid:2)(C0,n\u22121))(cid:2)\n\u2217(\u03b7\nglc with F\n\nTable 1: The relationships between GLC, LAC, GG, and GGV in terms of spatial information and\nFisher information metrics.\n\nMethod\nGLC\nLAC\nGG\n\nGGV (proposed)\n\nSpatial information\n\n\u221a\n-\n\u221a\n-\n\nFisher information metric\n\n-\n\u221a\n-\n\u221a\n\nIn the second experiment, we compared GGVs with the Improved Fisher kernel (IFK) [24], [25],\nwhich is the best image representation available at the time of writing. In this experiment, we used\nthe spatial pyramid representation with a 1\u00d7 1 + 2\u00d7 2 + 3\u00d7 1 structure. The number of components\nc in GMMs is an important parameter for IFK. We tested GMMs with c = 32, 64, 128, and 256\nGaussians to compute IFKs and compared them with GGVs.\nFor all datasets, SIFT features were densely sampled and were described for 16 \u00d7 16 patches. We\ndownsized images if their longest side was more than 300 pixels. As the aforementioned features\ndepend on the dimensionality of the local feature, we reduced its dimensionality using PCA and\ncompared performance as a function of its new dimensionality. As a linear classi\ufb01er, we used the\nmulti-class Passive-Aggressive Algorithm (PA) [6].\n\n3.1 Caltech101\n\nCaltech101 is the de facto standard object-recognition dataset [10]. To evaluate the classi\ufb01cation\nperformance, we followed the most commonly used methodology. Fifteen images were randomly\nselected from all 102 categories for training purposes and the remaining images were used for test-\ning. The classi\ufb01cation score was averaged over 10 trials.\n\nBefore comparison between GGVs and the baselines, we evaluate the sensitivities of the sampling\nstep of local features. The sampling step is one of the important parameters of GGV, because GGV\ncalculates auto-correlations of the neighboring local features. In this preliminary experiment, we \ufb01x\nthe number of vertices is 5 (n = 5) and the dimensionality of local feature is 32. We do not use\n\n6\n\n\f75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n45\n(cid:1)\n0\n\n5\n\nCaltech101\n\n(cid:1)\n\nGeneralized local correlation\nLocal auto\u2212correlation\nGlobal gaussian\nGGV (proposed), n=3\nGGV (proposed), n=5\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\n25\n\ndimensionality of local feature\n\n30\n\n35\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n58\n\n56\n\n54\n\n52\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n50\n(cid:1)\n0\n\n5\n\nCaltech101\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\ndimensionality of local feature\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n58\n\n56\n\n54\n\n52\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n50\n(cid:1)\n0\n\n2\n\nCaltech101\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n4\n10\ndimensionality of image feature\n\n6\n\n8\n\n12\n\n14\nx 104\n\nFigure 4: A comparison of classi\ufb01cation accuracies of: (left) GGV, GLC, LAC and GG; (center)\nGGV and IFK with respect to the dimensionality of \u201clocal features\u201d; (right) GGV and IFK with\nrespect to the dimensionality of \u201cimage features\u201d in the Caltech101 dataset.\n\nthe spatial pyramid. The results are as follows: 56.7 % (step = 4 pixels), 57.7 % (step = 6 pixels) ,\n57.7 % (step = 8 pixels) , 57.2 % (step = 10 pixels) , 56.5 % (step = 12 pixels). There is no clear\ndifference between step sizes of 6 and 8 pixels. Therefore in the following experiments, we use 6\npixels sampling step for local feature extraction.\n\nFigure 4 (left) shows the classi\ufb01cation accuracies as a function of the dimensionality of the local\nfeatures. A large dimensionality yielded better performance, and the proposed method (GGV) out-\nperformed the other methods (GLC, LAC, and GG). By comparing GGV with LAC, and GG with\nGLC, it is clear that embedding the Fisher information metric improved the classi\ufb01cation accuracy\nsigni\ufb01cantly. By comparing GGV with GG, as well as LAC with GLC, it can also be seen that\nembedding the spatial layout of local features also improved the accuracy. In a comparison between\nthe graph structures, the four-neighbor structure (Fig. 2(d)) performed slightly better than the two-\nneighbor structure (Fig. 2(c)). If we compare the regular GGVs with the L2 normalized GGVs, we\n\ufb01nd that the L2 normalization improved the accuracy by almost 2 %.\n\nIn the second experiment, we compared the L2 normalized GGVs with IFKs. The results are shown\nin Fig. 4 (center). For all the dimensionalities and numbers of components, GGVs performed better\nthan IFKs. Fig. 4 (right) shows the classi\ufb01cation accuracy as a function of the dimensionality of\nthe image features which are converted from the results shown in Fig. 4 (center). We see that\nGGVs achieved higher accuracy for a lower dimensionality of image features. The results were\nalso compared against those of leading methods that use a linear classi\ufb01er. The performance scores\nare referenced from the original papers. LLC [27] scored 65.4 % and ScSPM [28] scored 67.0 %,\nwhereas our method achieved 71.3 % when the dimensionality of the local feature is 32 and the\nnumber of vertices is 5. Therefore, our method is better than the best available methods in this\ndataset, despite using a linear classi\ufb01er and not requiring a codebook or descriptor matching.\n\n3.2 Caltech256\n\nCaltech256 consists of images from 256 object categories [13]. This database is signi\ufb01cant for\nits large inter-class variability, as well as an intra-class variability greater than that found in Cal-\ntech101. To evaluate performance, we followed a commonly used methodology. Fifteen images\nwere randomly selected from all categories for training purposes and the remaining images were\nused for testing. The classi\ufb01cation score was averaged over 10 trials.\n\nFigure 5 (left) shows a comparison of classi\ufb01cation accuracies of GGV, GLC, LAC and GG. Fig. 5\n(center) and (right) show comparisons of the L2 normalized GGVs and IFKs using the Caltech256\ndataset with respect to the dimensionality of local features and image features, respectively. The\nresults show the same trends as for Caltech101. Our method is better than all baseline methods and\nIFKs. [24] reported that IFK achieved 34.7% and [27] reported that LLC scored 34.4%, while GGV\nobtained 33.4%. However, a fair comparison is dif\ufb01cult because our method used only single-scale\nSIFT whereas [24] and [27] used 5-scale SIFT and 3-scale HOG, respectively. It is known that\nusing multi-scale local features improves classi\ufb01cation accuracies (e.g. [3]). To be fair comparison,\nwe used 3-scale SIFT (patch size = 16 \u00d7 16, 24 \u00d7 24, 32 \u00d7 32) for GGV with n = 5, and L2\nnormalization. GGV with 3-scale SIFT achieved 36.2% which is better than those leading methods.\n\n7\n\n\f34\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n14\n(cid:1)\n0\n\n5\n\nCaltech256\n\n(cid:1)\n\nGeneralized local correlation\nLocal auto\u2212correlation\nGlobal gaussian\nGGV (proposed), n=3\nGGV (proposed), n=5\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\ndimensionality of local feature\n\n34\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n18\n(cid:1)\n0\n\n5\n\nCaltech256\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\ndimensionality of local feature\n\n25\n\n30\n\n35\n\n34\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n18\n(cid:1)\n0\n\n2\n\nCaltech256\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n4\n10\ndimensionality of image feature\n\n6\n\n8\n\n12\n\n14\nx 104\n\nFigure 5: A comparison of classi\ufb01cation accuracies of: (left) GGV, GLC, LAC and GG; (center)\nGGV and IFK with respect to the dimensionality of \u201clocal features\u201d; (right) GGV and IFK with\nrespect to the dimensionality of \u201cimage features\u201d in the Caltech256 dataset.\n\n3.3\n\n15-Scenes\n\nWe experimented with 15-Scenes, a commonly used scene classi\ufb01cation dataset [18]. We randomly\nselected 100 training images for each class and used the remaining samples as test data. We calcu-\nlated the mean of the classi\ufb01cation rate for each class. This score was averaged over 10 trials, where\nthe training and test sets were randomly re-selected for each trial. This is the same methodology as\nthat used in previous studies.\n\n84\n\n82\n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n]\n\n%\n\n[\n \n\nt\n\ne\na\nr\n \nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n64\n(cid:1)\n0\n\n5\n\n15scenes\n\n(cid:1)\n\nGeneralized local correlation\nLocal auto\u2212correlation\nGlobal gaussian\nGGV (proposed), n=3\nGGV (proposed), n=5\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\n25\n\ndimensionality of local feature\n\n30\n\n35\n\n84\n\n82\n\n80\n\n78\n\n76\n\n74\n\n]\n\n%\n\n[\n \n\nt\n\ne\na\nr\n \nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n72\n(cid:1)\n0\n\n5\n\n15scenes\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n10\n\n15\n\n20\n\ndimensionality of local feature\n\n25\n\n30\n\n35\n\n84\n\n82\n\n80\n\n78\n\n76\n\n74\n\n]\n\n%\n\n[\n \n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n72\n(cid:1)\n0\n\n2\n\n15scenes\n\n(cid:1)\n\nIFK c=32\nIFK c=64\nIFK c=128\nIFK c=256\nGGV L2 norm (proposed), n=3\nGGV L2 norm (proposed), n=5\n\n4\n10\ndimensionality of image feature\n\n8\n\n6\n\n12\n\n14\nx 104\n\nFigure 6: A comparison of classi\ufb01cation accuracies of: (left) GGV, GLC, LAC and GG; (center)\nGGV and IFK with respect to the dimensionality of \u201clocal features\u201d; (right) GGV and IFK with\nrespect to the dimensionality of \u201cimage features\u201d in the 15-Scenes dataset.\n\nFigure 6 (left) shows a comparison of classi\ufb01cation accuracies of GGV, GLC, LAC and GG using\nthe 15-Scenes dataset. The results show similar trends as for Caltech101 and Caltech256, except\nthat there is no difference between the scores of the graph structures. In the second experiment,\nthe results with respect to the dimensionality of local features and image features are shown in\nFigs. 6 (center) and (right), respectively. In contrast to the results for Caltech101 and 256, IFKs\nscored slightly higher than GGVs (IFK (c = 256, d = 32): 84.0%, GGV (n = 5, d = 32 and\nL2 normalized): 83.5%). As the leading method, the spatial Fisher kernel [17] reported the highest\nscore (88.1%). However, since [17] used 8-scale SIFT descriptors, which provide richer information\nthan the single-scale SIFT descriptors we used, it is dif\ufb01cult to make a direct comparison.\n\n4 Conclusion\n\nIn this paper, we proposed an ef\ufb01cient image feature called a Graphical Gaussian Vector, which uses\nneither codebook nor local feature matching. In the proposed method, spatial information about\nlocal features and the Fisher information metric are embedded into a feature by modeling the image\nas the Gaussian Markov Random Field (GMRF). Experimental results using three standard datasets\ndemonstrated that the proposed method offers a performance that is superior or comparable to other\nstate-of-the-art methods. The proposed image feature calculates the expectation parameters of the\nGMRF simply and effectively while maintaining a high classi\ufb01cation rate.\n\n8\n\n\fReferences\n[1] S. Amari and H. Nagaoka. Methods of Information Geometry, volume 191 of Translations of mathemati-\n\ncal monographs. American Mathematical Society, 2001.\n\n[2] A.C. Berg, T.L. Berg, and J. Malik. Shape matching and object recognition using low distortion corre-\n\nspondence. In CVPR, 2005.\n\n[3] L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. In NIPS, 2010.\n[4] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classi\ufb01cation. In\n\nCVPR, 2008.\n\n[5] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang. Spatial-bag-of-features. In CVPR, 2010.\n[6] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms.\n\nJMLR, 7:551\u2013585, 2006.\n\n[7] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of key-\n\npoints. In ECCV International Workshop on SLCV, 2004.\n\n[8] O. Duchenne, A. Joulin, and J. Ponce. A graph-matching kernel for object categorization. In ICCV, 2011.\n[9] J.D.R. Farquhar, S. Szedmak, H. Meng, and J. Shawe-Taylor. Improving \u201cbag-of-keypoints\u201d image cate-\n\ngorisation: Generative models and pdf-kernels. Technical report, University of Southampton, 2005.\n\n[10] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an\n\nincremental bayesian approach tested on 101 object categories. In CVPR, Workshop on GMBV, 2004.\n\n[11] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nIn CVPR, 2003.\n\n[12] R. Fergus, P. Zisserman, and A. Perona. Weakly supervised scale-invariant learning of models for visual\n\nrecognition. IJCV, 71(3):273\u2013303, 2007.\n\n[13] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, Cali-\n\nfornia Institute of Technology, 2007.\n\n[14] T. Harada, H. Nakayama, and Y. Kuniyoshi. Improving local descriptors by embedding global and local\n\nspatial information. In ECCV, 2010.\n\n[15] Jason K. Johnson. Convex Relaxation Methods for Graphical Models: Lagrangian and Maximum Entropy\n\nApproaches. PhD thesis, MIT, 2008.\n\n[16] J. Kim and K. Grauman. Asymmetric region-to-image matching for comparing images with generic object\n\ncategories. In CVPR, 2010.\n\n[17] J. Krapac, J. Verbeek, and F. Jurie. Modeling spatial layout with \ufb01sher vectors for image categorization.\n\nIn ICCV, 2011.\n\n[18] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR, 2006.\n\n[19] H. Nakayama, T. Harada, and Y. Kuniyoshi. Dense sampling low-level statistics of local features. In\n\nCIVR , 2009.\n\n[20] H. Nakayama, T. Harada, and Y. Kuniyoshi. Global gaussian approach for scene categorization using\n\ninformation geometry. In CVPR, 2010.\n\n[21] N. Otsu and T. Kurita. A new scheme for practical, \ufb02exible and intelligent vision systems. In Proc. IAPR\n\nWorkshop on Computer Vision, 1988.\n\n[22] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR,\n\n2007.\n\n[23] F. Perronnin, C. Dance, G. Csurka, and M. Bressan. Adapted vocabularies for generic visual categoriza-\n\ntion. In ECCV, 2006.\n\n[24] F. Perronnin, J. S\u00b4anchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\nIn ECCV, 2010.\n\n[25] J. S\u00b4anchez and F. Perronnin. High-dimensional signature compression for large-scale image classi\ufb01cation.\n\nIn CVPR, 2011.\n\n[26] C. Wallraven, B. Caputo, and A. Graf. Recognition with local features: the kernel recipe. In ICCV, 2003.\n[27] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image\n\nclassi\ufb01cation. In CVPR , 2010.\n\n[28] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, 2009.\n\n[29] X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classi\ufb01cation using super-vector coding of local image\n\ndescriptors. In ECCV, 2010.\n\n9\n\n\f", "award": [], "sourceid": 732, "authors": [{"given_name": "Tatsuya", "family_name": "Harada", "institution": null}, {"given_name": "Yasuo", "family_name": "Kuniyoshi", "institution": null}]}