{"title": "Self-Adaptable Templates for Feature Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 864, "page_last": 872, "abstract": "Hierarchical feed-forward networks have been successfully applied in object recognition. At each level of the hierarchy, features are extracted and encoded, followed by a pooling step. Within this processing pipeline, the common trend is to learn the feature coding templates, often referred as codebook entries, filters, or over-complete basis. Recently, an approach that apparently does not use templates has been shown to obtain very promising results. This is the second-order pooling (O2P). In this paper, we analyze O2P as a coding-pooling scheme. We find that at testing phase, O2P automatically adapts the feature coding templates to the input features, rather than using templates learned during the training phase. From this finding, we are able to bring common concepts of coding-pooling schemes to O2P, such as feature quantization. This allows for significant accuracy improvements of O2P in standard benchmarks of image classification, namely Caltech101 and VOC07.", "full_text": "Self-Adaptable Templates for Feature Coding\n\nXavier Boix1,2\u2217 Gemma Roig1,2\u2217\n\nSalomon Diether1\n\nLuc Van Gool1\n\n1Computer Vision Laboratory, ETH Zurich, Switzerland\n\n2LCSL, Massachusetts Institute of Technology & Istituto Italiano di Tecnologia, Cambridge, MA\n\n{boxavier,gemmar,sdiether,vangool}@vision.ee.ethz.ch\n\n{xboix,gemmar}@mit.edu\n\nAbstract\n\nHierarchical feed-forward networks have been successfully applied in object\nrecognition. At each level of the hierarchy, features are extracted and encoded,\nfollowed by a pooling step. Within this processing pipeline, the common trend is\nto learn the feature coding templates, often referred as codebook entries, \ufb01lters, or\nover-complete basis. Recently, an approach that apparently does not use templates\nhas been shown to obtain very promising results. This is the second-order pooling\n(O2P) [1]. In this paper, we analyze O2P as a coding-pooling scheme. We \ufb01nd\nthat at testing phase, O2P automatically adapts the feature coding templates to the\ninput features, rather than using templates learned during the training phase. From\nthis \ufb01nding, we are able to bring common concepts of coding-pooling schemes to\nO2P, such as feature quantization. This allows for signi\ufb01cant accuracy improve-\nments of O2P in standard benchmarks of image classi\ufb01cation, namely Caltech101\nand VOC07.\n\n1\n\nIntroduction\n\nMany object recognition schemes, inspired from biological vision, are based on feed-forward hier-\narchical architectures, e.g. [2, 3, 4]. In each level in the hierarchy, the algorithms can be usually\ndivided into the steps of feature coding and spatial pooling. The feature coding extracts similarities\nbetween the set of input features and a set of templates (the so called \ufb01lters, over-complete basis or\ncodebook), and then, the similarity responses are transformed using some non-linearities. Finally,\nthe spatial pooling extracts one single vector from the set of transformed responses. The speci\ufb01c ar-\nchitecture of the network (e.g. how many layers), and the speci\ufb01c algorithms for the coding-pooling\nat each layer are usually set for a recognition task and dataset, cf. [5].\nSecond-order Pooling (O2P) is an alternative algorithm to the aforementioned coding-pooling\nscheme. O2P is based on tensor representations that were introduced in medical imaging to analyze\nmagnetic resonance images [6, 7]. Lately, tensor representations achieved state-of-the-art in some\ncomputer vision tasks [8, 9], and remarkable results for semantic segmentation, in which tensor rep-\nresentations were adapted and named as O2P [1, 10]. A surprising fact of O2P is that it is formulated\nwithout feature coding templates [1]. This is in contrast to the common coding-pooling schemes, in\nwhich the templates are learned during a training phase, and at testing phase, the templates remain\n\ufb01xed to the learned values.\nMotivated by the intriguing properties of O2P, in this paper we try to re-formulate O2P as a coding-\npooling scheme. In doing so, we \ufb01nd that O2P actually computes similarities to feature coding\ntemplates as the rest of the coding-pooling schemes. Yet, what remains uncommon of O2P, is that\nthe templates are \u201crecomputed\u201d for each speci\ufb01c input, rather than being \ufb01xed to learned values. In\nO2P, the templates are self-adapted to the input, and hence, they do not require learning.\n\n\u2217Both \ufb01rst authors contributed equally.\n\n1\n\n\fFrom our formulation, we are able to bring common concepts of coding-pooling schemes to O2P,\nsuch as feature quantization. This allows us to achieve signi\ufb01cant improvements of the accuracy\nof O2P for image classi\ufb01cation. We report experiments on two challenging benchmarks for image\nclassi\ufb01cation, namely Caltech101 [11], and VOC07 [12].\n\n2 Preliminaries\n\nIn this Section, we revisit several coding-pooling schemes as well as O2P, and identify some com-\nmon terminology in the literature. This will serve as a basis for the new formulation of O2P, that we\nintroduce in the following section.\nThe algorithms that we analyze in this section are usually part of a layer of a hierarchical network\nfor object recognition. The input to these algorithms is a set of feature vectors that come from the\noutput of the previous layer, or from the raw image. Let {xi}N be the set of input feature vectors\nto the algorithm, which is the set of N feature vectors, xi \u2208 RM , indexed by i \u2208 {1, . . . , N}.\nThe output of the algorithm is a single vector, which we denote as y, and it may have a different\ndimensionality than the input vectors.\nIn Subsection 2.1, we present the algorithms and terminology of previous template-based meth-\nods. Then, in Subsection 2.2, we review the formulation of O2P that appears in the literature, that\napparently does not use templates.\n\n2.1 Coding-Pooling based on Evaluating Similarities to Templates\n\nTemplate-based methods are build upon similarities between the input vectors and a set of templates.\nDepending on the terminology of each algorithm, the templates may be denoted as \ufb01lters, codebook,\nor over-complete basis. From now on, we will refer to all of them as templates. We denote the set\nof templates as {bk \u2208 RM}P . In this paper, bk and the input feature vectors xi have the same\ndimensionality, M. The set of templates is \ufb01xed to learned values during the training phase. There\nare many possible learning algorithms, but analyzing them is not necessary here.\nThe algorithms that are interesting for our purposes, start by computing a similarity measure between\nthe input feature vectors {xi}N and the templates {bk}P . Let \u0393(xi, bk) be the similarity function,\nwhich depends on each algorithm. We de\ufb01ne \u03b3i as the vector that contains the similarities of xi to\nthe set of templates {bk}, and \u03b3 \u2208 RM\u00d7P the matrix whose columns are the vectors \u03b3i, i.e.\n\n\u03b3ki = \u0393(xi, bk).\n\n(1)\n\nOnce \u03b3 is computed, the algorithms that we analyze apply some non-linear transformation to \u03b3, and\nthen, the resulting responses are merged together, with the so called pooling operation. The pooling\nconsists on generating one single response value for each template. We denote as gk(\u03b3) the function\nthat includes both the non-linear transformation and the pooling operation, where gk : RM\u00d7P \u2192 R.\nWe include both operations in the same function, but in the literature it is usually presented as two\nseparate steps. Finally, the output vector y is built using {gk(\u03b3)}P , {bk}P and {xi}N , depending\non the algorithm.\nIt is also quite common to concatenate the outputs of neighboring regions to\ngenerate the \ufb01nal output of the layer.\nWe now show how the presented terminology is applied to some methods based on evaluating sim-\nilarities to templates, namely assignment-based methods and Fisher Vector.\nIn the sequel, these\nalgorithms will be a basis to reformulate O2P.\n\nAssignment-based Methods The popular Bag-of-Words and some of its variants fall into this\ncategory, e.g. [13, 14, 15]. These methods consist on assigning each input vector xi to a set of\ntemplates (the so called vector quantization), and then, building a histogram of the assignments,\nwhich corresponds to the average pooling operation.\nWe now present them using our terminology. After computing the similarities to the templates, \u03b3\n(usually based on (cid:96)2 distance), gk(\u03b3) computes both the vector quantization and the pooling. Let\ns be the number of templates to which each input vector is assigned, and let \u03b3(cid:48)\ni be the resulting\nassignment vector of xi (i.e. \u03b3(cid:48)\ni has s entries\nset to 1 and the rest to 0, that indicate the assignment. Finally, gk(\u03b3) also computes the pooling for\n\ni is the result of applying vector quantisation on xi). \u03b3(cid:48)\n\n2\n\n\fthe assignments corresponding to the template k, i.e. gk(\u03b3) = 1\nN\nis the concatenation of the resulting pooling of the different templates, y = (g1(\u03b3), . . . , gP (\u03b3)).\n\nki. The \ufb01nal output vector\n\n(cid:80)\ni<N \u03b3(cid:48)\n\n(cid:88)\n\nFisher Vectors\n\nIt uses the \ufb01rst and second order statistics of the similarities between the features\n\nand the templates [16]. Fisher Vector builds two vectors for each template bk, which are\n\n\u03b3ki (bk \u2212 xi) \u03a6(2)\n\nk =\n\n\u03a6(1)\n\nk =\n\n1\nAk\n\nwhere \u03b3ki =\n\n(cid:20)\n\ni<N\n1\nZk\n\nexp\n\n\u2212 1\n2\n\n(xi \u2212 bk)tDk(xi \u2212 bk)\n\n.\n\n(cid:0)(bk \u2212 xi)2 \u2212 Ck\n\n(cid:1) ,\n\n(cid:88)\n\ni<N\n\n1\nBk\n\n\u03b3ki\n\n(cid:21)\n\n(2)\n\n(3)\n\nAk, Bk, Ck are learned constants, Zk a normalization factor and Dk is a learned constant matrix of\nthe model. Note that in Eq. (3), \u03b3ki is a similarity between the feature vector xi and the template bk.\nThe \ufb01nal output vector is y = (\u03a6(1)\nP ). For further details we refer the reader\nto [16].\nWe use our terminology to do a very simple re-write of the terms. We de\ufb01ne gk(\u03b3) and bF\nthe super-index F to indicate that are from Fisher vectors, and different from bk) as\n\nk (we use\n\nP , \u03a6(2)\n\n1 , \u03a6(2)\n\n. . . , \u03a6(1)\n\n1\n\ngk(\u03b3) = (cid:107)(\u03a6(1)\n\nk , \u03a6(2)\n\nk )(cid:107)2, bF\n\nk =\n\n1\n\ngk(\u03b3)\n\n(\u03a6(1)\n\nk , \u03a6(2)\nk ).\n\n(4)\n\nWe can see the templates of Fisher vectors, bF\nk , are obtained from computing some transformations\nto the original learned template bk, which involve the input set of features {xi}. gk(\u03b3) is the norm\nk ), which gives an idea of the importance of each template in {xi}, similarly to gk(\u03b3)\nof (\u03a6(1)\nin assignment-based methods. Note that bF\nk and gk(\u03b3) are related to only one \ufb01xed template, bk.\nThe \ufb01nal output vector becomes y = (g1(\u03b3)bF\n\nk , \u03a6(2)\n\n1 , . . . , gP (\u03b3)bF\n\nP ).\n\n2.2 Second-Order Pooling\n\n(cid:88)\n\n1\nN\n\nO2P starts by building a correlation matrix from the set of feature (column) vectors {xi \u2208 RM}N ,\ni.e.\n\ni<N\n\nK =\n\nxixt\ni,\n\n(5)\ni is the transpose vector of xi, and K \u2208 RM\u00d7M is a square matrix. K is a symmetric positive\nwhere xt\nde\ufb01nite (SPD) matrix, and contains second-order statistics of {xi}. The set of SPD matrices form\na Riemannian manifold, and hence, the conventional operations in the Euclidean space can not be\nused. Several metrics have been proposed for SPD matrices, and the most celebrated is the Log-\nEuclidean metric [17]. Such metric consists of mapping the SPD matrices to the tangent space by\nusing the logarithm of the matrix, log(K). In the tangent space, the standard Euclidean metrics can\nbe used.\nThe logarithm of an SPD matrix can be computed in practice by applying the logarithm individually\nto each of the eigenvalues of K [18]. Thus, the \ufb01nal output vector for O2P can be written as\n\n(cid:32)(cid:88)\n\n(cid:33)\n\nlog(\u03bbk)eket\nk\n\ny = vec (log(K)) = vec\n\n(6)\nwhere ek are the eigenvectors of K, and \u03bbk the corresponding eigenvalues. The vec(\u00b7) operator\nvectorizes log(K).\nIn Eq. (6), apparently, there are no similarities to a set of templates. The absence of templates\nmakes O2P look quite different from template-based methods. Recently, O2P achieved state-of-the-\nart results in semantic segmentation [1, 10]. Both reasons, motivates us to further analyze O2P in\nrelation to template-based methods.\n\nk<M\n\n,\n\n3\n\n\f3 Self-Adaptability of the Templates\n\nIn this section, we introduce a formulation that relates O2P and template-based methods. The new\nformulation is based on comparing two \ufb01nal representation vectors, rather than de\ufb01ning how the\n\ufb01nal vector y is built. We denote (cid:104)yr, ys(cid:105) as the inner product between yr and ys, which are the\n\ufb01nal representation vectors from two sets of input feature vectors, {xr\ni}N , respectively,\nwhere we use the superscripts r and s to indicate the respective representation for each set. It will\nbecome clear during this section why we analyze (cid:104)yr, ys(cid:105) instead of y.\nWe divide the analysis in three subsections. In subsection 3.1, we re-write the formulation of the\ntemplate-based methods of Section 2 with the inner product (cid:104)yr, ys(cid:105). In subsection 3.2, we do the\nsame for O2P, and this unveils that O2P is also based on evaluating similarities to templates. In\nsubsection 3.3, we analyze the characteristics of the templates in O2P, which have the particularity\nthat are self-adapted to the input.\n\ni}N and {xs\n\n3.1 Re-Formulation of Template-Based Methods\n\n(cid:88)\n\n(cid:88)\n\n(cid:104)yr, ys(cid:105) =\n\nWe re-write a generic formulation for the template-based methods described in Section 2 with the\ninner product between two \ufb01nal output vectors. The algorithms of Section 2 can be expressed as\n\ngk(\u03b3r)gq(\u03b3s)S(br\n\nk, bs\n\nq),\n\n(7)\n\nk<P\n\nq<P\n\nwhere \u03b3ki = \u0393(xi, bk),\n\nand S(u, v) is a similarity function between the templates that depends on each algorithm. Recall\nthat gk(\u03b3) is a function that includes the non-linearities and the pooling of the similarities between\nthe input feature vectors and the the templates. To see how Eq. (7) arises naturally from the algo-\nrithms of Section 2, we now analyze them in terms of this formulation.\n\nAssignment-Based Methods The inner product between two \ufb01nal output vectors can be written\n\nas\n\n(cid:104)yr, ys(cid:105) =(g1(\u03b3r), . . . , gP (\u03b3r))t(gs\n\n1(\u03b3s), . . . , gs\n\nP (\u03b3s)) =\n\n(cid:88)\n\nk<P\n\n(cid:88)\n\n(cid:88)\n\nk<P\n\nq<P\n\n=\n\ngk(\u03b3r)gk(\u03b3s) =\n\ngk(\u03b3r)gq(\u03b3s)I(br\n\nk = bs\n\nq),\n\n(8)\n\nwhere the last step introduces an outer summation, and the indicator function I(\u00b7) eliminates the\nunnecessary cross terms. Comparing this last equation to Eq. (7), we can identify that S(br\nq) is\nthe indicator function (returns 1 when br\n\nq, and 0 otherwise).\n\nk, bs\n\nk = bs\n\nFisher Vectors The inner product between two \ufb01nal Fisher Vectors is\n\n(cid:104)yr, ys(cid:105) =(g1(\u03b3r)brF\n\n(cid:88)\n\n(cid:88)\n\n=\n\n1 , . . . , gP (\u03b3r)brF\ngk(\u03b3r)gq(\u03b3s)I(br\n\nP )t(g1(\u03b3s)bsF\nk = bs\n\nq)(cid:104)brF\n\nq (cid:105).\nk , bsF\n\n1 , . . . , gP (\u03b3s)bsF\nP )\n\n(9)\n\nk<P\n\nq<P\n\nThe indicator function appears for the same reason as in Assignment-Based Methods. The \ufb01nal\ntemplates for each set of input vectors, brF\nk , respectively, are compared with each other with\nthe similarity (brF\nk , bsF\n\nk , bsF\nq ) in Eq. (7) is equal to I(br\n\nq . Thus, S(brF\n\nq .\nk )tbsF\n\nk )tbsF\n\nk = bs\n\nq)(brF\n\n3.2 O2P as Coding-Pooling based on Pattern Similarities\n\nWe now re-formulate O2P, in the same way as we did for template-based methods in the previous\nsubsection. This will allow relating O2P to template-based methods, and show that O2P also uses\nsimilarities to templates.\nWe re-write the de\ufb01nition of O2P in Eq. (6) with (cid:104)yr, ys(cid:105). Using the property vec(A)tvec(B) =\ntr(AtB), where tr(\u00b7) is the trace function of a matrix, (cid:104)yr, ys(cid:105) becomes (in the supplementary\n\n4\n\n\fMethod\n\nAssignment-based\n\nFisher Vectors\n\nO2P\n\nS(br\nI(br\nk = bs\n(cid:104)br\n\nk, bs\nq)\nk = bs\nq)\nq)(cid:104)bsF\nP (cid:105)\nk , bsF\nq(cid:105)2\nk, bs\n\nI(br\n\n\u03b3ki = \u0393(xi, bk) templates\n\n(cid:104)xi, bk(cid:105)\nEq. (3)\n(cid:104)xi, bk(cid:105)2\n\n\ufb01xed\n\n\ufb01xed/adapted\nself-adapted\n\n(cid:80)\ngk(\u03b3)\ni \u03b3(cid:48)\nlog(cid:0) 1\n(cid:1)\n(cid:80)\nk )(cid:107)2\nk , \u03a6(2)\ni \u03b3ki\n\n(cid:107)(\u03a6(1)\n\n1\nN\n\nki\n\nN\n\nTable 1: Summary Table of the elements of our formulation for Assignment-based methods, Fisher\nVectors and O2P.\n\n(cid:88)\n\ni\n\n1\nN\n\n(cid:32)\n\n(cid:88)\n\ni<N\n\n1\nN\n\n(cid:33)\n\n\u03b3ki\n\n,\n\n(11)\n\n(12)\n\n(13)\n\nmaterial we do the full derivation)\n\n(cid:88)\n\n(cid:88)\n\nk<M\n\nq<M\n\n=\n\nk, es\n\nlog(\u03bbr\n\nq)(cid:104)er\n\nk) log(\u03bbs\n\nk is a square matrix, and the eigenvectors, {er\n\n(cid:104)yr, ys(cid:105) = (cid:104)vec (log(Kr)) , vec (log(Ks))(cid:105) =\nq(cid:105)2,\n(10)\nk, es\nk}M , are compared all against\nk}M and {es\nwhere eket\nq(cid:105)2. Going back to the generic formulation of template-based methods in\neach other with (cid:104)er\nq), can be identi\ufb01ed in\nEq. (7), we can see that the similarity function between the templates, S(er\nO2P as (cid:104)er\nq(cid:105)2. Also, note that in O2P the sums go over M, which is the number of eigenvectors,\nand in Eq. (7), go over P , which is the number of templates. Finally, gk(\u03b3) in Eq. (7) corresponds\nto log(\u03bbk) in O2P.\nAt this point, we have expressed O2P in a similar way as template-based methods. Yet, we still have\nto \ufb01nd the similarity between the input feature vectors and the templates. For that purpose, we use\nthe de\ufb01nition of eigenvalues and eigenvectors, i.e. \u03bbkek = Kek, and also that tr(eket\nk) = 1 (the\neigenvectors are orthonormal). Then, we can derive the following equivalence: \u03bbk = \u03bbktr(eket\nk) =\ni, we \ufb01nd that the eigenvalues, \u03bbk, can be written using the\ntr(Keket\nsimilarity between the input vectors, xi, and the eigenvectors, ek:\n\nk). Replacing K by 1\n\ni xixt\n\nk, es\n\nk, es\n\nN\n\n\u03bbk =\n\n1\nN\n\ntr((xixt\n\ni)(eket\n\nk)) =\n\n(cid:104)xi, ek(cid:105)2.\n\nFinally, we can integrate all the above derivations in Eq. (10), and we obtain that\n\n(cid:104)yr, ys(cid:105) =\n\ngk(\u03b3r)gq(\u03b3s)(cid:104)er\n\nq(cid:105)2,\n\nk, es\n\n(cid:88)\n\n(cid:88)\n\n(cid:80)\n(cid:88)\n\ni\n\nk<M\n\nq<M\n\nwhere gk(\u03b3) = log(\u03bbk) = log\n\nand \u03b3ki = \u0393(xi, ek) = (cid:104)xi, ek(cid:105)2.\n\n(14)\nWe can see by analyzing Eq. (12) that this equation takes the same form as the general equation\nof template-based methods in Eq. (7). Note that the eigenvectors take the same role as the set of\ntemplates, i.e. bk = ek and P = M. Also, observe that S(br\nq) is the square of the inner product\nbetween eigenvectors, \u0393(xi, bk) is the square of the inner product between the input vectors and the\neigenvectors, and the pooling operation is the logarithm of the average of the similarities. In Table 1\nwe summarize the corresponding elements of all the described methods.\n\nk, bs\n\n3.3 Self-Adaptative Templates\n\nWe de\ufb01ne self-adaptative templates as templates that only depend on the input set of feature vec-\ntors, and are not \ufb01xed to prede\ufb01ned values. This is the case in O2P, because the templates in O2P\ncorrespond to the eigenvectors computed from the set of input feature vectors. The templates in\nO2P are not \ufb01xed to values learned during the training phase. Interestingly, the \ufb01nal templates in\nFisher Vectors, bF\nk are obtained by\nmodifying the \ufb01xed learned templates, bk, with the input feature vectors.\nFinally, note that in O2P the number of templates is equal to the dimensionality of the input feature\nvectors. Thus, in O2P the number of templates can not be increased without changing the input\nvectors\u2019 length, M. This begs the following question: do M templates allow for suf\ufb01cient gener-\n\nk , are also partially self-adapted to the input vectors. Note that bF\n\n5\n\n\fAlgorithm 1: Sparse Quantization in O2P\nInput: {xi}N , k\nOutput: y\nforeach i = {1, . . . , N} do\nend\nK = 1\ni \u02c6xi \u02c6xt\ni\nN\ny = vec(log(K))\n\n(cid:80)\n\n\u02c6xi \u2190 Set k highest values of xi to its vector entry: xi, and the rest to 0\n\nalization for object recognition for any set of input vectors? We analyze this question in the next\nsection.\n\n4 Application: Quantization for O2P\n\nWe observe in the experiments section that the performance of O2P degrades when the number of\nvectors in the set of input features increases. It is reasonable that M templates are not suf\ufb01cient\nwhen the number of different vectors in {xi}N increases, specially when they are very different\nfrom each other. We now introduce an algorithm to increase the robustness of O2P to the variability\nof the input vectors.\nWe quantize the input feature vectors, {xi}, before computing O2P. Quantization may discard de-\ntails, and hence, reduce the variability among vectors.\nIn the experiments section it is reported\nthat this allows preventing the degradation of performance in object recognition, when the number\nof input feature vectors increases. The quantization algorithm that we use is sparse quantization\n(SQ) [15, 19], because SQ does not change the dimensionality of the feature vector. Also, SQ is fast\nto compute, and does not increase the computational cost of O2P.\n\n(cid:1). The\n\nk\n\nk, Bq\n\nk = {0, 1}q\n\nk| is equal to(cid:0)q\n\nSparse Quantization for O2P For the quantization of {xi} we use SQ, which is a quantization\nk be the set of k-sparse vectors, i.e. {s \u2208 Rq : (cid:107)s(cid:107)0 \u2264 k}.\nto the set of k-sparse vectors. Let Rq\nk = {s \u2208 {0, 1}q : (cid:107)s(cid:107)0 = k}, which is the set of binary vectors\nAlso, we de\ufb01ne Bq\nwith k elements set to one and (q \u2212 k) set to zero. The cardinality of |Bq\nquantization of a vector v \u2208 Rq into a codebook {ci} is a mapping of v to the closest element in\n{ci}, i.e. \u02c6v(cid:63) = arg min\u02c6v\u2208{ci} (cid:107)\u02c6v \u2212 v(cid:107)2, where \u02c6v(cid:63) is the quantized vector v. In the case of SQ, the\ncodebook {ci} contains the set of k-sparse vectors. These may be any of the previously introduced\ntypes: Rq\nk. An important advantage of SQ over a general quantization is that it can be computed\nmuch more ef\ufb01ciently. The naive way to compute a general quantization is to evaluate the nearest\nneighbor of v in {ci}, which may be costly to compute for large codebooks and high-dimensional\nv. In contrast, SQ can be computed by selecting the k higher values of the set {vi}, i.e. for SQ into\nRq\nk, \u02c6vi = vi if i is one of the k-highest entries of vector v, and 0 otherwise. For SQ into Bq\nk, the\ndimension indexed by the k-highest are set to 1 instead of vi, and 0 otherwise. (We refer the reader\nto [15, 19] for a more detailed explanation on SQ).\nIn Algorithm 1 we depict the implementation of SQ in O2P, which highlights its simplicity. The\ncomputational cost of SQ is negligible compared to the cost of computing O2P. We use the set of\nk-sparse vectors in RM\n\nk for SQ, which worked best in practice, as shown in the following.\n\n5 Experiments\n\nIn this section, we analyze O2P in image classi\ufb01cation from dense sampled SIFT descriptors. This\nsetup is common in image classi\ufb01cation, and it allows direct comparison to previous works on O2P.\nWe report results on the Caltech101 [11] and VOC07 [12] datasets, using the standard evaluation\nbenchmarks, which are the mean average precision accuracy across all classes.\n\n6\n\n\f5.1\n\nImplementation Details\n\nWe use the standard pipeline for image classi\ufb01cation. We never use \ufb02ipped or blurred images to\nextend the training set.\n\nPipeline. For Caltech101, the image is re-sized to take a maximum height and width of 300\npixels, which is the standard resizing protocol for this dataset. For VOC07 the size of the images\nremains the same as the original. We extract SIFT [4] from patches on a regular grid, at different\nscales. In Caltech 101, we extract them at every 8 pixels and at the scales of 16, 32 and 48 pixels\ndiameter. In VOC07, SIFT is sampled at each 4 pixels and at the scales of 12, 24 and 36 pixels\ndiameter. O2P is computed using the SIFT descriptors as input, and using spatial pyramids. In\nCaltech101, we generate the pooling regions dividing the image in 4 \u00d7 4, 2 \u00d7 2 and 1 \u00d7 1 regions,\nand in VOC07 in 3 \u00d7 1, 2 \u00d7 2 and 1 \u00d7 1 regions. To generate the \ufb01nal descriptor for the whole\nimage, we concatenate the descriptors for each pooled region. We apply the power normalization to\nthe \ufb01nal feature dimensions, sign(x)|x|3/4, that was shown to work well in practice [1]. Finally, we\nuse a linear one-versus-rest SVM classi\ufb01er for each class with the parameter C of the SVM set to\n1000. We use the LIBLINEAR library for the SVM[20].\n\nOther Feature Codings. As a sanity check of our results, we replace O2P with the Bag-of-\nWords [13] baseline, without changing any of the parameters. In Caltech101, we replace the average\npooling of Bag-of-Words by max-pooling (without normalization) as it performs better. The code-\nbook is learned by randomly picking a set of patches as codebook entries, which was shown to work\nwell for the encodings we are evaluating [14]. We use a codebook of 8192 entries, since with more\nentries the performance does not increase signi\ufb01cantly, but the computational cost does.\n\n5.2 Results on Caltech101\n\nWe use 3 random splits of 30 images per class for training and the rest for testing. In Fig. 1a, results\nare shown for different spatial pyramid con\ufb01gurations, as well as different levels of quantization.\nNote that SQ with k = 128 is not introducing any quantization, as SIFT features are 128 dimensional\nvectors. Note that using SQ increases the performance more than 5% compared to when not using\nSQ (k = 128), when using only the \ufb01rst level of the pyramid. For the other levels of the pyramid,\nthere is less improvement with SQ. This is in accordance with the observation that in smaller regions\nthere are less SIFT vectors, the variability is smaller, and the limited amount of templates is able to\nbetter capture the meaningful information than in bigger regions. We can also see that for small k\nof SQ, the performance degrades due to the introduction of too much quantization.\nWe also run experiments with Bag-of-Words with max-pooling (74.8%), and O2P without SQ\n(76.52%), and both of them are surpassed by O2P with SQ (78.63%).\nIn [1], O2P accuracy is\nreported to be 79.2% with SIFT descriptor (we do not compare to their version of enriched SIFT,\nsince all our experiments are with normal SIFT). We inspected the code of [1], and we found that\nthe difference of accuracy mainly comes from using a more drastic resizing of the image, that takes\na maximum of 100 pixels of width and height (usually in the literature it is 300 pixels). Note that re-\nsizing is another way of discarding information, and hence, O2P may bene\ufb01t from that. We con\ufb01rm\nthis by resizing the image back to 300 pixels in [1]\u2019s code, and the accuracy is 77.1%, similar to the\none that we report without SQ in our code. The accuracy is not exactly the same due to differences\nin the SIFT parameters in [1]. Also, we tested SQ in [1]\u2019s code with the resizing to a maximum of\n100 pixels, and the accuracy increased to 79.45%, which is higher than reported in [1], and close to\nstate-of-the-art results using SIFT descriptors (80.3%) [21].\n\n5.3 Results on VOC07\n\nIn Fig. 1b, we run the same experiment as in Caltech101. Note that the impact of SQ is even more\nevident than in Caltech101. In Table 2 we report the per-class accuracy, in addition to the mean\naverage precision reported in Fig. 1b. We follow the evaluation procedure as described in [12].\nWith the full pyramid, when we use SQ the accuracy increases from 18.81% to 50.97%. In con-\ntrast to Caltech101, O2P with SQ performance is similar to our implementation of Bag-of-Words\n(51.14%). Thus, under adverse conditions for O2P, i.e. images with high variability such as in\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: Results for different numbers of non-zero entries of SQ. Note that SQ at k = 128 is not\nintroducing any quantization, since SIFT features are 128 dimensional vectors.\n(a) Caltech 101\n(using 30 images per class for training), (b) VOC07.\n\ne\nl\nc\ny\nc\ni\nB\n\ne\nn\na\nl\np\no\nr\ne\nA\n\nr\no\nt\ni\nn\no\nM\nV\nT\n3 Pyr. O2P + SQ 72 53 45 63 23 51 69 52 50 35 44 41 74 56 78 19 35 50 67 45\n3 Pyr. O2P w/o SQ 34 9 12 18 6 19 40 14 26 14 9 21 28 17 55 7 7 10 16 12\n2 Pyr. O2P + SQ 71 50 41 62 20 50 68 47 47 33 41 37 69 56 74 18 36 51 66 44\n1 Pyr. O2P + SQ 66 41 32 58 15 37 58 38 40 27 28 30 61 43 66 20 33 37 56 36\n1 Pyr. O2P w/o SQ 21 7 11 9 6 8 29 10 22 4 7 12 12 8 49 6 5 7 9 9\n\ne\nk\ni\nb\nr\no\nt\no\nM\n\nt\nn\na\nl\n\nP\nd\ne\nt\nt\no\nP\n\nr\na\nC\n\nt\na\nC\n\nd\nr\ni\n\nB\n\nt\na\no\nB\n\na\nf\no\nS\n\nn\ni\na\nr\nT\n\nr\ni\na\nh\nC\n\nw\no\nC\n\ne\ns\nr\no\nH\n\ng\no\nD\n\ne\nl\nt\nt\no\nB\n\ns\nu\nB\n\ne\ng\na\nr\ne\nv\nA\n50.97\n18.81\n49.09\n41.20\n12.53\n\ne\nl\nb\na\nT\ng\nn\ni\nn\nn\ni\nD\n\nn\no\ns\nr\ne\nP\n\np\ne\ne\nh\nS\n\n/\n\nTable 2: PASCAL VOC 2007 classi\ufb01cation results. The average score provides the per-class aver-\nage. We report results for O2P, with and without SQ, with the \ufb01rst plus second plus third levels of\npyramids (3 Pyr.), O2P with SQ with the \ufb01rst plus second levels of pyramids (2 Pyr.), and O2P with\nand without SQ only with the \ufb01rst level of pyramids (1 Pyr.).\n\nVOC07 and with a high number of input vectors, we can use SQ and obtain huge improvements of\nthe O2P\u2019s accuracy. The best reported results [22] in VOC07 are around 10% better than O2P with\nSQ, yet we obtain more than 30% improvement from the baseline.\n\n6 Conclusions\n\nWe found that O2P can be posed as a coding-pooling scheme based on evaluating similarities to tem-\nplates. The templates of O2P self-adapt to the input, while the rest of the analyzed methods do not.\nIn practice, our formulation was used to improve the performance of O2P in image classi\ufb01cation.\nWe are currently analyzing self-adaptative templates in deep hierarchical networks.\nAcknowledgments: We thank the ERC for support from AdG VarCity.\n\nReferences\n[1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, \u201cSemantic segmentation with second-\n\norder pooling,\u201d in ECCV, 2012.\n\n[2] K. Fukushima, \u201cNeocognitron: A self-organizing neural network model for a mechanism of\n\npattern recognition unaffected by shift in position,\u201d Biological cybernetics, 1980.\n\n[3] M. Riesenhuber and T. Poggio, \u201cHierarchical models of object recognition in cortex,\u201d Nature\n\nneuroscience, 1999.\n\n[4] D. G. Lowe, \u201cDistinctive image features from scale-invariant keypoints,\u201d IJCV, 2004.\n\n8\n\n1 pyr.1+2 pyr.1+2+3 pyr.1+2+3 pyr. w/o SQSQ selected in val. set5204060801001280.550.60.650.70.750.8Sparse QuantizationMean accuracyCaltech 10176.52%78.63%75.55%65.14%5204060801001280.10.20.30.40.5Sparse QuantizationMean average precisionPASCAL VOC 200718.81%50.97%49.09%41.20%\f[5] J. Bergstra, D. Yamins, and D. Cox, \u201cMaking a science of model search: Hyperparameter\n\noptimization in hundreds of dimensions for vision architectures,\u201d in ICML, 2013.\n\n[6] D. Le Bihan, J.-F. Mangin, C. Poupon, C. A. Clark, S. Pappata, N. Molko, and H. Chabriat,\n\u201cDiffusion tensor imaging: concepts and applications,\u201d Journal of magnetic resonance imag-\ning, 2001.\n\n[7] J. Weickert and H. Hagen, Visualization and Processing of Tensor Fields. Springer, 2006.\n[8] O. Tuzel, F. Porikli, and P. Meer, \u201cRegion covariance: A fast descriptor for detection and\n\nclassi\ufb01cation,\u201d in ECCV, 2006.\n\n[9] P. Li and Q. Wang, \u201cLocal log-euclidean covariance matrix (L2ECM) for image representation\n\nand its applications,\u201d in ECCV, 2012.\n\n[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, \u201cRich feature hierarchies for accurate object\n\ndetection and semantic segmentation,\u201d in CVPR, 2014.\n\n[11] L. Fei-Fei, R. Fergus, and P. Perona, \u201cOne-shot learning of object categories,\u201d TPAMI, 2006.\n[12] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, \u201cThe PASCAL visual\n\nobject classes (VOC) challenge,\u201d IJCV, 2010.\n\n[13] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, \u201cVisual categorization with bags\n\nof keypoints,\u201d in Workshop on Statistical Learning in Computer Vision, ECCV, 2004.\n\n[14] A. Coates and A. Ng, \u201cThe importance of encoding versus training with sparse coding and\n\nvector quantization,\u201d in ICML, 2011.\n\n[15] X. Boix, G. Roig, and L. Van Gool, \u201cNested sparse quantization for ef\ufb01cient feature coding,\u201d\n\nin ECCV, 2012.\n\n[16] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, \u201cImage classi\ufb01cation with the \ufb01sher\n\nvector: Theory and practice,\u201d IJCV, 2013.\n\n[17] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, \u201cGeometric means in a novel vector space\nstructure on symmetric positive-de\ufb01nite matrices,\u201d Journal on matrix analysis and applica-\ntions, 2007.\n\n[18] R. Bhatia, Positive de\ufb01nite matrices. Princeton University Press, 2009.\n[19] X. Boix, M. Gygli, G. Roig, and L. Van Gool, \u201cSparse quantization for patch description,\u201d in\n\nCVPR, 2013.\n\n[20] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, \u201cLIBLINEAR: A library for\n\nlarge linear classi\ufb01cation,\u201d JMLR, 2008.\n\n[21] O. Duchenne, A. Joulin, and J. Ponce, \u201cA graph-matching kernel for object categorization,\u201d in\n\nICCV, 2011.\n\n[22] X. Zhou, K. Yu, T. Zhang, and T. S. Huang, \u201cImage classi\ufb01cation using super-vector coding of\n\nlocal image descriptors,\u201d in ECCV, 2010.\n\n9\n\n\f", "award": [], "sourceid": 561, "authors": [{"given_name": "Xavier", "family_name": "Boix", "institution": "ETH Zurich"}, {"given_name": "Gemma", "family_name": "Roig", "institution": "ETH Zurich"}, {"given_name": "Salomon", "family_name": "Diether", "institution": "ETHZ"}, {"given_name": "Luc", "family_name": "Gool", "institution": "Computer Vision Lab, ETH Zurich"}]}