{"title": "Global Gated Mixture of Second-order Pooling for Improving Deep Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1277, "page_last": 1286, "abstract": "In most of existing deep convolutional neural networks (CNNs) for classification, global average (first-order) pooling (GAP) has become a standard module to summarize activations of the last convolution layer as final representation for prediction. Recent researches show integration of higher-order pooling (HOP) methods clearly improves performance of deep CNNs. However, both GAP and existing HOP methods assume unimodal distributions, which cannot fully capture statistics of convolutional activations, limiting representation ability of deep CNNs, especially for samples with complex contents. To overcome the above limitation, this paper proposes a global Gated Mixture of Second-order Pooling (GM-SOP) method to further improve representation ability of deep CNNs. To this end, we introduce a sparsity-constrained gating mechanism and propose a novel parametric SOP as component of mixture model. Given a bank of SOP candidates, our method can adaptively choose Top-K (K > 1) candidates for each input sample through the sparsity-constrained gating module, and performs weighted sum of outputs of K selected candidates as representation of the sample. The proposed GM-SOP can flexibly accommodate a large number of personalized SOP candidates in an efficient way, leading to richer representations. The deep networks with our GM-SOP can be end-to-end trained, having potential to characterize complex, multi-modal distributions. The proposed method is evaluated on two large scale image benchmarks (i.e., downsampled ImageNet-1K and Places365), and experimental results show our GM-SOP is superior to its counterparts and achieves very competitive performance. The source code will be available at http://www.peihuali.org/GM-SOP.", "full_text": "Global Gated Mixture of Second-order Pooling for\nImproving Deep Convolutional Neural Networks\n\nQilong Wang1,2,\u2217,\u2020, Zilin Gao2,\u2217, Jiangtao Xie2, Wangmeng Zuo3, Peihua Li2,\u2021\n\n1Tianjin University, 2Dalian University of Technology, 3 Harbin Institute of Technology\n\nqlwang@tju.edu.cn, gzl@mail.dlut.edu.cn, jiangtaoxie@mail.dlut.edu.cn\n\nwmzuo@hit.edu.cn, peihuali@dlut.edu.cn\n\nAbstract\n\nIn most of existing deep convolutional neural networks (CNNs) for classi\ufb01cation,\nglobal average (\ufb01rst-order) pooling (GAP) has become a standard module to sum-\nmarize activations of the last convolution layer as \ufb01nal representation for prediction.\nRecent researches show integration of higher-order pooling (HOP) methods clearly\nimproves performance of deep CNNs. However, both GAP and existing HOP\nmethods assume unimodal distributions, which cannot fully capture statistics of\nconvolutional activations, limiting representation ability of deep CNNs, especially\nfor samples with complex contents. To overcome the above limitation, this paper\nproposes a global Gated Mixture of Second-order Pooling (GM-SOP) method to\nfurther improve representation ability of deep CNNs. To this end, we introduce\na sparsity-constrained gating mechanism and propose a novel parametric SOP as\ncomponent of mixture model. Given a bank of SOP candidates, our method can\nadaptively choose Top-K(K > 1) candidates for each input sample through the\nsparsity-constrained gating module, and performs weighted sum of outputs of K\nselected candidates as representation of the sample. The proposed GM-SOP can\n\ufb02exibly accommodate a large number of personalized SOP candidates in an ef\ufb01cient\nway, leading to richer representations. The deep networks with our GM-SOP can be\nend-to-end trained, having potential to characterize complex, multi-modal distribu-\ntions. The proposed method is evaluated on two large scale image benchmarks (i.e.,\ndownsampled ImageNet-1K and Places365), and experimental results show our\nGM-SOP is superior to its counterparts and achieves very competitive performance.\nThe source code will be available at http://www.peihuali.org/GM-SOP.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks (CNNs) have achieved great success in a variety of computer\nvision tasks, especially image classi\ufb01cation [25]. During the past years, deep CNN architectures have\nbeen widely studied and achieved remarkable progress [34, 36, 13, 17]. As one standard module in\ndeep CNN architectures [36, 13, 17, 5, 16], global average pooling (GAP) summarizes activations of\nthe last convolution layer for \ufb01nal prediction. However, GAP only collects \ufb01rst-order statistics while\nneglecting richer higher-order ones, suffering from limited representation ability [7]. Recently, some\nresearchers propose to integrate trainable higher-order pooling (HOP) methods (e.g., second-order\nand third-order pooling) into deep CNNs [19, 29, 39, 27, 8], which distinctly improve representation\nability of deep CNNs. However, both GAP and existing HOP methods adopt the unimodal distribution\n\n\u2217The \ufb01rst two authors contribute equally to this work.\n\u2020This work was mainly done when he was with Dalian University of Technology.\n\u2021Peihua Li is the corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Overview of deep CNNs with the proposed global Gated Mixture of Second-order Pooling\n(GM-SOP). The sparsity-constrained gating module adaptively selects Top-K parametric SR-SOP\n(indicated by solid rectangles) from a bank of N candidate component models given a sample X, and\nthe \ufb01nal representation is generated by weighted sum of outputs of K selected CMs. For brevity here\nwe take N = 4, K = 2 as an example.\n\nassumption to collect statistics of convolutional activations. As illustrated in Figure 1, input images\noften contain multiple objects or parts, leading their distributions of convolutional activations usually\nare very complex (e.g., mixture of multiple unimodal models). As such, unimodal distributions\ncannot fully capture statistics of convolutional activations, which will limit performance of deep\nCNNs.\nOne natural idea to overcome the above limitation is ensemble of multiple models for summarizing\nconvolutional activations. However, direct ensemble of all component models (CMs) in mixture\nmodel will suffer from very high computational cost as number of CMs gets large, while a small\nnumber of CMs may be insuf\ufb01cient for characterizing complex distributions. Moreover, simple direct\nensemble will make all CMs tend to learn similar characteristics, since they receive identical training\nsamples. These factors heavily limit the representation ability of mixture model (refer to the results\nin Section 3.2). Inspired by recent work [33], we propose an idea of gated mixture model to solve the\nabove issues. Our gated mixture model is composed of a sparsity-constrained gating module and a\nbank of candidate CMs. Given an input sample, the sparsity-constrained gating module adaptively\nselects Top-K CMs from N (N (cid:29) K) candidates according to assigned weights, and then weighted\nsum of outputs of K selected CMs is used to generate \ufb01nal representation. In this way, our gated\nmixture model can accommodate ef\ufb01ciently a large number of CMs because only K ones are trained\nand used to generate representation given a sample. Furthermore, different CMs will receive different\ntraining samples so that they can capture personalized characteristics of convolutional activations\nduring training. As suggested in [33], we employ an extra balance loss to eliminate self-reinforcing\nphenomenon, guaranteeing as many candidate CMs as possible be adequately trained.\nThe CM plays a key role in gated mixture model. Compared with \ufb01rst-order GAP, HOP can\ncapture more statistical information, achieving remarkable improvement in either shallow models\n[4, 23] or deep architectures [19, 29, 8, 27]. As shown in [26], the comparisons on both large-\nscale image classi\ufb01cation [9] and \ufb01ne-grained visual recognition demonstrate matrix square-root\nnormalized second-order pooling (SR-SOP) outperforms other HOP methods and achieves promising\nperformance in deep architectures. In view of effectiveness of SR-SOP, it seems to be a good\nchoice for CM. However, there exist two problems lying in usage of SR-SOP. Firstly, SR-SOP is\na parameter-free model, which cannot be individually learned. Meanwhile, SR-SOP assumes data\ndistribution obeys a Gaussian, which may not always hold true. To address these problems, this paper\nproposes a parametric SR-SOP method4, which enables candidate CMs to be individually trained with\nnegligible additional cost. Besides, underlying the parametric SR-SOP is estimation of covariance\nin the generalized Gaussian setting which has better modeling capability than SR-SOP. Based on\nthe parametric SR-SOP, we propose a global Gated Mixture of Second-order Pooling (GM-SOP),\n\n4We reasonably introduce a group of trainable parameters into SR-SOP, and use the recently proposed fast\n\niterative algorithm [26] to speed up matrix square-root normalization on GPU.\n\n2\n\n()Hxsoftmax. . .\u2211Classification LossFinal Representation. . .Sparsity-constrained Gating ModuleGated Mixture of Second-order PoolingSR-SOPBalance LossxImageConvolutionLayersPrediction Layers4th-CM\u2299Top-K...Weights11\u00d7conv.Bank of Parametric Component Models (CMs)3rd-CMSR-SOP11\u00d7conv.2nd-CMSR-SOP11\u00d7conv.1st-CMSR-SOP11\u00d7conv. :Dot Product :Sum\u2299\u2211\fwhich is illustrated in Figure 1. Our GM-SOP can effectively exploit a large bank of personalized\nSOP models to generate more discriminative representations.\nWe evaluate the proposed GM-SOP method on two large scale image benchmarks, i.e., downsampled\nImageNet-1K [9] and Places365 [44] that are introduced by [6] and this paper, respectively. As\ndescribed in [6], downsampled ImageNet is a promising alternative to CIFAR10/100 datasets, as it\nis large-scale and more challenging, which postpones the saturation risk on CIFAR (as observed in\nhttp://karpathy.github.io/2011/04/27/manually-classifying-cifar10/). Compared\nwith standard-size ImageNet, downsampled ImageNet allows much faster experiments and lower com-\nputation requirement while maintaining similar characteristics with respect to analysis of networks[6].\nThe contributions of our paper are three-fold. (1) We, for the \ufb01rst time, introduce a global gated\nmixture of pooling model into prevalent deep CNN architectures. This goes beyond the existing global\naverage/covariance (second-order) pooling, possessing the potential to capture complex, multi-modal\ndistributions of convolutional activations. (2) We propose parametric second-order models as essential\ncomponents of mixture model. These components can be trained individually for modeling richer\nfeature characteristics than simple second-order pooling. (3) We perform extensive experiments on\ntwo large-scale benchmarks for evaluating and validating the proposed methods, which have proven\nto achieve much better results than the counterparts.\n\n2 Gated Mixture of Second-order Pooling (GM-SOP)\n\nIn this section, we introduce the proposed Gated Mixture of Second-order Pooling (GM-SOP) method.\nWe \ufb01rst describe a general idea of gated mixture model, and then propose a parametric matrix square-\nroot normalized second-order pooling (SR-SOP) method as our component model. Finally, the\nGM-SOP is integrated into deep CNNs in an end-to-end learning manner.\n\n2.1 Gated Mixture Model\n\nMixture Model The mixture model (e.g., \ufb01nite mixture distributions [12, 30] or mixture of experts\n[21, 22]) is widely used to characterize complex data distribution or improve discrimination ability\nof supervised system through ensemble of multiple component models (CMs). In general, mixture\nmodel can be formulated as the weighted sum of all CMs, i.e.,\n\ny =\n\n\u03c9i(X)Mi(X), s.t.,\n\n\u03c9i(X) = 1,\n\n(1)\n\ni=1\n\ni=1\n\nwhere N is the number of CMs. \u03c9i(X) and Mi(X) indicate weight and output of i-th CM given\ninput X, respectively. The mixture model in Eq. (1) consists of a weight (probability) function and a\nset of parametric CMs. Given speci\ufb01c forms of weight function and parametric CMs, mixture model\nin Eq. (1) can be learned by using gradient learning algorithm [21] or Expectation-Maximization\n(EM) algorithm [22, 30].\n\nSparsity-constrained Gating Module Given an input sample, contribution of each CM in mixture\nmodel is decided by the corresponding weight through either computation of posterior probability in\nmixture distributions [12, 30] or a gating network in mixture of experts [21, 22]. However, they both\nacquiesce to allow every input sample to participate in training of all CMs. It will suffer from high\ncomputational cost when number of CMs is large. Meanwhile, CMs with small weights may bring\nnoise into \ufb01nal representation [41]. Inspired by [33], we exploit a sparsity-constrained gating module\nas the weight function to overcome the above issues, where weights are learned by explicitly imposing\na sparse constraint. As illustrated in Figure 1, we \ufb01rst pass X throughout a group of prediction layers\nwith parameters \u03b8g, i.e., f (\u03b8g; X). Then, weights are outputted by using a fully-connected layer with\nadditional noise perturbations, i.e.,\n\nHi(f (\u03b8g; X)) = Wg\n\ni f (\u03b8g; X) + \u03b3 \u00b7 log(1 + exp(Wn\n\ni f (\u03b8g; X))).\n\n(2)\n\ni and Wn\n\nHere, Wg\ni are i-th row of parameters of fully-connected layer and additional noise, respec-\ntively. \u03b3 is a random variable sampled from a normal distribution. To make the learned weights\nH(f (\u03b8g; X)) be sparse, only the K largest weights are kept and remaining ones are set to be negative\nin\ufb01nity, denoted as Top-K(Hi(f (\u03b8g; X))). Finally, a softmax function is used to normalize the\n\n3\n\nN(cid:88)\n\nN(cid:88)\n\n\fweights. To sum up, the weight function can be written as\n\n(cid:80)N\n\n\u03c9i(X) =\n\nexp(Top-K(Hi(f (\u03b8g; X))))\ni=1 exp(Top-K(Hi(f (\u03b8g; X))))\n\n.\n\n(3)\n\nBalance Loss The sparsity-constrained gating module makes each sample participate in training\nof K CMs. However, as shown in [33] and the results of Section 3.2, such gating module has a\nself-reinforcing phenomenon (i.e., only the same few CMs receive almost all samples while remaining\nones have rarely been trained), decreasing representation ability of mixture model. As suggested in\n[33], we introduce an extra balance loss which is a function of weights de\ufb01ned as follows:\n\n(cid:18) std((cid:80)S\n\u00b5((cid:80)S\n\ns=1 \u03c9(Xs))\ns=1 \u03c9(Xs))\n\n(cid:19)2\n\nLB = \u03b1\n\n,\n\n(4)\n\nwhere Xs is s-th training sample in a mini-batch of S samples and \u03c9(Xj) = [\u03c91(Xj), . . . , \u03c9N (Xj)]\nis the weight function in Eq. (3); std(v) and \u00b5(v) denote standard deviation and mean of vector\nv, respectively; \u03b1 is a tunable parameter. The loss LB is to constrain that all CMs are adequately\ntrained.\n\n2.2 Component Model of GM-SOP\n\nBesides the weight function, component model (CM) plays an indispensable role in gated mixture\nmodel. Motivated by success of matrix square-root normalized second-order pooling (SR-SOP) in\ndeep CNN architectures [39, 27], we propose a parametric SR-SOP as CM of our GM-SOP.\nParametric SR-SOP Given an input X \u2208 RL\u00d7d containing L features of d-dimension, the SR-SOP\nof X is computed as\n\nZ = (XT \u02c6JX)\n\n1\n2 = \u03a3\n\n1\n2 = U\u039b\n\n1\n\n2 UT , \u02c6J =\n\n1\nL\n\n(I \u2212 1\nL\n\n11T ),\n\n(5)\n\nwhere \u03a3 = U\u039bUT is eigenvalue decomposition (EIG) of \u03a3. I and 1 are identity matrix and\nL-dimension vector with all elements being one, respectively. \u03a3 is sample covariance (second-order\nstatistics) of X estimated by the classical maximum likelihood estimation (MLE). Since X is a\nset of convolutional activations in deep CNNs, dimension of X is usually very high (128 in our\ncase) while number of features is very small (\u223c 100). It is well known that the classical MLE\nis not robust in the above scenario [10]. As explained in [40], performing matrix square root on\ncovariance amounts to robust covariance estimation, very suitable for the scenario of high dimension\nand small sample. In addition, matrix square-root normalization can be regarded as a special case of\nPower-Euclidean metric [11] between covariance matrices, i.e., (cid:107)\u03a3\u03b2\nj (cid:107)2 with \u03b2 = 0.5, which is\nan approximation of Log-Euclidean metric [2] (hence making use of Riemannian geometry lying in\ncovariance matrices5) while overcoming some downsides of Log-Euclidean metric [11].\nAlthough SR-SOP in Eq. (5) bene\ufb01ts from some merits and achieves promising performance, it is a\nparameter-free model, which can not be trained as personalized CMs. Meanwhile, covariance \u03a3 in\nEq. (5) is calculated based on assumption that X is sampled from a Gaussian distribution, which may\nnot always hold true. To handle above two problems, we propose a parametric second-order pooling\n(SOP), i.e.,\n\ni \u2212 \u03a3\u03b2\n\n\u03a3(Qj) = XT QjX = (PjX)T (PjX).\n\n(6)\n\nDifferent from the original sample covariance \u03a3 with constant matrix \u02c6J, Qj in Eq. (6) is a learnable\nparameter, and Qj is a symmetric positive semi-de\ufb01nite matrix with Qj = PT\nj Pj. Note that our\nparametric SOP in Eq. (6) shares similar philosophy with estimating covariance by assuming features\nfollow a multivariate generalized Gaussian distribution with zero mean [32], i.e.,\n\u22121\n\n(cid:18)\n\n(cid:19)\n\n\u0393(d/2)\n\np(xl;(cid:98)\u03a3; \u03b4; \u03b5) =\n\nxT\nl )\u03b4\n\n,\n\n(7)\n\n\u03b4\n\n\u03b5d/2|(cid:98)\u03a3|1/2\n\nexp\n\n2\u03b5\u03b4 (xl(cid:98)\u03a3\n\n\u2212 1\n\n\u03c0d/2\u0393(d/2\u03b4)2d/2\u03b4\n\n5Covariance matrices are symmetric positive de\ufb01nite matrices, whose space forms a non-linear Riemannian\n\nmanifold [2].\n\n4\n\n\fspeci\ufb01cally, for the j-th iteration\n\nL(cid:88)\n\n1\nL\n\n(cid:98)\u03a3j =\nl = xl(cid:98)\u03a3j\u22121xT\n\nl=1\n\nqj\nl + (qj\n\nLd\n\nl )1\u2212\u03b4(cid:80)\nand (cid:98)Gj\n\n\u00b7 xT\n\nl xl =\n\n1\nL\n\nk(cid:54)=l(qj\n\nk)\u03b4\n\nL(cid:88)\n\nl=1\n\nl xl = XT(cid:98)GjX,\n\nfj(xl) \u00b7 xT\n\n(8)\n\nwhere \u03b5 and \u03b4 are parameters of scale and shape, respectively; (cid:98)\u03a3 is covariance matrix, and \u0393 is\nGiven \u03b4 and \u03b5, covariance matrix (cid:98)\u03a3 can be estimated using iterative reweighed methods [3, 43], and\n\na Gamma function. Compared with assumption of data distribution being a Gaussian in Eq. (5),\ngeneralized Gaussian distribution in Eq. (7) is more general and captures more complex characteristics.\n\nl\n\nwhere qj\nis a diagonal matrix with diagonal elements being\n{fj(x1)/L, . . . , fj(xL)/L}. It is worth mentioning at this point that our parametric SOP in Eq. (6)\nlearns a more informative, full matrix, instead of only the diagonal one in traditional iterative\nreweighted methods [3, 43].\nObviously our parametric SOP in Eq. (6) can be regarded as single step of iterative estimation. To\naccomplish multi-step iterative estimation, we can learn a sequence of parameters Qj, j = 1, . . . , J.\nAfter that, we perform matrix square-root normalization to obtain better performance. In practice\nwe adopt two-step estimation (i.e., J = 2) to balance ef\ufb01ciency and effectiveness. We mention that\nimplementation of each one of the two step estimation (i.e., PjX) in Eq. (6) can be conveniently\nimplemented using 1 \u00d7 1 convolution. As a result, our parametric SR-SOP can be transformed into\nlearning multiple sequential 1 \u00d7 1 convolution operations following by computation of SR-SOP.\n\nFast Iterative Algorithm The computation of matrix square root in our parametric SR-SOP de-\npends on EIG, which is limited supported on GPU, slowing down training speed of the whole network.\nTherefore, we employ the recently proposed iterative method [26] to speed up computing matrix\nsquare root. This method is based on Newton-Schulz iteration [14], which computes approximate\nmatrix square root through iterative matrix multiplications as\n\n1\n\n2 \u2248 A \u02dcJ : {A\u02dcj =\n\n\u03a3\n\n1\n2\n\nA\u02dcj\u22121(3I \u2212 B\u02dcj\u22121A\u02dcj\u22121); B\u02dcj =\n\n1\n2\n\n(3I \u2212 B\u02dcj\u22121A\u02dcj\u22121)B\u02dcj\u22121}\u02dcj:=1,\u00b7\u00b7\u00b7 , \u02dcJ ,\n\n(9)\n\nwhere A0 = \u03a3 and B0 = I. Clearly, Eq. (9) involves only matrix multiplications, more suitable\nfor GPU implementation, and its back-propagation algorithm can be derived based on matrix back-\npropagation method [20]. Readers can refer to [26] for more details.\n\n2.3 Deep CNN with GM-SOP\n\nThe overview of deep CNNs with our GM-SOP is illustrated in Figure 1. Notably, the proposed GM-\nSOP, rather than global average pooling or second-order pooling, is inserted after the last convolution\nlayer. In our GM-SOP, the outputs of the last convolution layer are simultaneously fed into sparsity-\nconstrained gating module and the bank of parametric CMs. In terms of the Top-K results, the\ngating module allocates individual training samples to different CMs, and for each sample the \ufb01nal\nrepresentation is a weighted sum of the outputs of K selected CMs. We add a batch normalization\n[18] layer and a dropout [35] layer with drop rate of 0.2 after \ufb01nal representation. Finally, we use a\nfully-connected layer and a softmax layer for classi\ufb01cation. The sparsity-constrained gating module\nis composed of prediction layers, Top-K and softmax operations, where the prediction layers share\nthe same architecture with CMs to keep pace with representation. The parametric SR-SOP contains a\nset of convolution operations and iterative matrix multiplications. Clearly back-propagation of all\ninvolved layers can be accomplished according to traditional chain rule and matrix back-propagation\nmethod [20], and thus the deep CNNs with GM-SOP can be trained in an end-to-end manner.\n\n3 Experiments\n\nTo evaluate the proposed method, we conduct experiments on two large-scale image benchmarks,\ni.e., downsampled ImageNet-1K [6] and Places365 [44]. We \ufb01rst describe implementation details\nof different competing methods, and then assess the effect of key parameters on our method using\ndownsampled ImageNet-1K. Finally, we report the comparison results on two benchmarks.\n\n5\n\n\fTable 1: Modi\ufb01ed ResNet-18 and ResNet-50 for downsampled ImageNet-1K and Places365.\n\nResNet-18\n\nResNet-50\n\nOutput Size\n\nconv1\n3 \u00d7 3, 16\n(stride=1)\n3 \u00d7 3, 16\n(stride=1)\n64 \u00d7 64\n96 \u00d7 96\n\nconv2_x\n\n(cid:20)3 \u00d7 3, 16\n(cid:21)\n(cid:20)3 \u00d7 3, 16\n(cid:21)\n\n3 \u00d7 3, 16\n\n3 \u00d7 3, 16\n64 \u00d7 64\n96 \u00d7 96\n\n\u00d7 2\n\n\u00d7 6\n\nconv3_x\n\n(cid:20)3 \u00d7 3, 32\n(cid:21)\n(cid:20)3 \u00d7 3, 32\n(cid:21)\n\n3 \u00d7 3, 32\n\n3 \u00d7 3, 32\n32 \u00d7 32\n48 \u00d7 48\n\n\u00d7 2\n\n\u00d7 6\n\nconv4_x\n\n(cid:20)3 \u00d7 3, 64\n(cid:21)\n(cid:20)3 \u00d7 3, 64\n(cid:21)\n\n3 \u00d7 3, 64\n\n3 \u00d7 3, 64\n16 \u00d7 16\n24 \u00d7 24\n\n\u00d7 2\n\n\u00d7 6\n\nconv5_x\n\n(cid:20)3 \u00d7 3, 128\n(cid:21)\n(cid:20)3 \u00d7 3, 128\n(cid:21)\n\n3 \u00d7 3, 128\n\n3 \u00d7 3, 128\n8 \u00d7 8\n12 \u00d7 12\n\n\u00d7 2\n\n\u00d7 6\n\nGAP\n\nGAP\n\nImageNet-1K\n\nPlaces365\n\n3.1\n\nImplementation Details\n\nIn this work, we implement several methods for comparison, and consider two basic CNN models\nincluding ResNet [13] of 18 and 50 layers. All competing methods are described as follows.\n\n(1) ResNet-18/ResNet-50 indicate original ResNets with \ufb01rst-order GAP.\n(2) ResNet-18-Xd/ResNet-50-Xd denote ResNets with a parametric GAP, which is achieved by\ninserting a 1 \u00d7 1 \u00d7 X convolution layer before GAP. Such method can be regarded as a special\ncase of gated mixture of \ufb01rst-order GAP with single CM.\n\n(3) Ave-GAP-K performs simple average of K parametric GAPs without gating module.\n(4) GM-GAP-N-K selects K parametric GAPs from N GAP candidates through sparsity-constrained\n(5) Parametric SR-SOP is achieved by adding two convolution layers of {1 \u00d7 1 \u00d7 128 \u00d7 256} and\n\ngating module, and performs weighted sum of K selected parametric GAPs.\n{1 \u00d7 1 \u00d7 256 \u00d7 128} before SR-SOP.\n\n(6) Ave-SOP-K performs simple average of K parametric SR-SOPs without gating module.\n(7) GM-SOP-N-K selects K parametric SR-SOP models from N candidates with sparsity-constrained\n\ngating module, and performs weighted sum of K selected candidates.\n\nIn our experiments, image sizes of downsampled ImageNet-1K and Places365 respectively are\n64 \u00d7 64 and 100 \u00d7 100, so we modify ResNet architectures in [13] to \ufb01t image sizes in our case. The\narchitectures of modi\ufb01ed ResNet-18 and ResNet-50 are given in Table 1. As suggested in [26], we\ncompute approximate matrix square root in Eq. (9) within \ufb01ve iterations to balance the effectiveness\nand ef\ufb01ciency. For training the whole network, we employ mini-batch stochastic gradient descent\nwith batchsize of 256 and momentum of 0.9. The parameter of weight decay is set to 5e-4. The\nprogram is implemented using MatConvNet toolkit [37], and runs on a PC equipped with an Intel\ni7-4790K@4.00GHz CPU, a single NVIDIA GeForce GTX1080 GPU and 64G RAM.\n\nFigure 2: Left: Numbers of receiving samples in each CM by setting \u03b1 = 0 and \u03b1 = 100. \u03b1 = 0\nindicates balance loss is not employed. Right: Results of GM-GAP-16-4 with various \u03b1.\n\n3.2 Ablation Studies on Downsampled ImageNet-1K\n\nOur gated mixture model has three key parameters, i.e., weight parameter \u03b1 of balance loss in Eq. (4),\nnumber of CMs (N ) and number of selected CMs (K). We evaluate them using gated mixture of\n\ufb01rst-order GAP with ResNet-18 on downsampled ImageNet-1K, and the decided optimal parameters\nare directly adopted to GM-SOP. Such strategy is not only faster but also avoids over-\ufb01tting on\nparameters of GM-SOP. We train the networks on downsampled ImageNet-1K dataset [6] within\n\n6\n\n12345678910111213141516No. of CM05Num. of Samples105=012345678910111213141516No. of CM0510Num. of Samples104=1000110100100042.54343.54444.5Top-1 error (%)\fMethods\n\nDim. of Reps.\n\nTop-1 error (%)\n\nResNet-18\nResNet-18-512d\nAve-GAP-16\nGM-GAP-16-8 (Ours)\n\nResNet-18-8256d\nSR-SOP\nAve-SOP-16\nParametric SR-SOP\nGM-SOP-16-8 (Ours)\n\n128\n512\n512\n512\n\n8256\n8256\n8256\n8256\n8256\n\n52.00\n49.08\n47.44\n42.37\n\n47.29\n40.32\n40.28\n40.01\n38.21\n\nFigure 3: GM-GAP with various N and K.\n\nTable 2: Comparison with counterparts using\nResNet-18 on downsampled ImageNet-1K.\n\nTable 3: Comparison with state-of-the-arts on downsampled ImageNet-1K. The methods marked by\n(cid:63) double number of training images using original images and their horizontal \ufb02ipping ones, and\nperform 4 pixels padding and random crop in both training and prediction stages.\n\nMethods\n\nNumber of Parameters\n\nDimension of Representations\n\nTop-1 error/Top-5 error (%)\n\nWRN-36-1(cid:63) [6]\nWRN-36-2(cid:63) [6]\nWRN-36-5(cid:63) [6]\nResNet-18-512d [13]\nResNet-18+One-layer-SG-MOE [33]\nResNet-18+NetVLAD [1]\nResNet-50 [13]\nResNet-50-512d [13]\nResNet-50-8256d [13]\n\nGM-GAP-16-8 + ResNet-18 (Ours)\nGM-GAP-16-8 + ResNet-18(cid:63) (Ours)\nGM-GAP-16-8 + WRN-36-2 (Ours)\nGM-SOP-16-8 + ResNet-18 (Ours)\nGM-SOP-16-8 + ResNet-50 (Ours)\nGM-SOP-16-8 + WRN-36-2 (Ours)\n\n1.6M\n6.2M\n37.6M\n1.3M\n2.3M\n8.9M\n2.4M\n2.8M\n11.6M\n\n2.3M\n2.3M\n8.7M\n10.3M\n11.9M\n15.7M\n\n128\n256\n640\n512\n512\n8192\n128\n512\n8256\n\n512\n512\n512\n8256\n8256\n8256\n\n49.79/24.17\n39.55/16.57\n32.34/12.64\n49.08/24.25\n46.80/22.63\n45.16/21.73\n43.28/19.39\n41.79/18.30\n41.42/18.14\n\n42.37/18.82\n40.03/17.91\n35.97/14.41\n38.21/17.01\n35.73/14.96\n32.33/12.35\n\n{50, 15, 10} epochs, while the initial learning rate is set to 0.075 with decay rate of 0.1. Only random\n\ufb02ipping is used for data augmentation, and prediction is performed on whole images. Following\nthe common settings in [13, 6], we run experiments three trials and report Top-1 error of different\nmethods on validation set for comparison.\nEffect of Parameter \u03b1 The goal of balance loss is to make as many CMs as possible be adequately\ntrained. Here we evaluate its effect using GM-GAP-16-4 with various \u03b1. Figure 2 (Left) compares\nnumbers of receiving samples in each CM by setting \u03b1 = 0 and \u03b1 = 100 within the last training\nepoch, where \u03b1 = 0 indicates balance loss is discarded. Clearly, only four CMs receive almost all\ntraining samples when balance loss is not employed (i.e., \u03b1 = 0). Differently, the balance loss with\n\u03b1 = 100 makes most of CMs receive similar amount of training samples. Figure 2 (Right) shows the\nresults of GM-GAP-16-4 with various \u03b1, from it we can see that adequately training as many CMs\nas possible achieves lower classi\ufb01cation error. The balance loss with \u03b1 = 100, 1000 obtain similar\nresults. Without loss of generality, we set parameter \u03b1 to 100 in following experiments.\nNumbers of N and K Then we assess the effect of numbers of N and K by setting \u03b1 = 100. Top-1\nerror and training speed (Frames Per Second, FPS) of GM-GAP with various N and K are illustrated\nin Figure 3. Fixing number of K, increase of N leads lower error while bringing more computational\ncosts. When number of N is \ufb01xed, better results are obtained by appropriately enlarging K. Taking\nN = 16 as an example, K = 8 gets the best result, and results of K = 12 and K = 16 are slightly\ninferior to the one of K = 8. It maybe owe to the fact that sparsity constraint eliminates noisy CMs\nhaving small weights. In addition, we can see that GM-GAP with N = 16, K = 8 (42.37%, 800Hz)\nemploying 16 CMs is only about 1.8 times slower than baseline (49.08%, 1470Hz) with one CM,\nbut achieves about 6.71% gains. We also experiment with more CMs. GM-GAP with N=128 and\nK=32 obtains 42.52%, achieving no gain over the result of N=16 and K=8 (42.37%). We observe\nlarger number (128) of CMs leads to a bit over-\ufb01tting in our case and more computation cost. To\nbalance ef\ufb01ciency and effectiveness, we set N = 16 and K = 8 for both GM-GAP and GM-SOP\nthroughout all remaining experiments.\n\n7\n\n(1,1)(2,4)(2,8)(4,8)(4,16)(8,16)(12,16)(16,16)(4,32)(8,32)(K,N)4042.54547.550Top-1 error (%)Top-1 error (%)050010001500FPS (Hz)FPS (Hz)\fTable 4: Comparison with counterparts on Places365 dataset with image size of 100 \u00d7 100.\n\nResNet-18-512d\n\nGM-GAP-16-8\n\nResNet-18-8256d\n\nSR-SOP\n\nParametric SR-SOP\n\nGM-SOP-16-8\n\nDim.\nTop-1 error (%)\nTop-5 error (%)\n\n512\n49.96\n19.19\n\n512\n\n48.07\n17.84\n\n8256\n49.99\n19.32\n\n8256\n48.11\n18.01\n\n8256\n47.48\n17.52\n\n8256\n47.18\n17.02\n\nComparison with Counterparts We compare our method with several counterparts, and results of\ndifferent methods are listed in Table 2. We train SR-SOP (or parametric SR-SOP) and Ave-SOP-16\n(or GM-SOP-16-8) within {20, 5, 5, 5} and {40, 10, 5, 5} epochs. The initial learning rates are set\nto 0.15 and 0.1 with decay rate of 0.1. When employing GAP as CM, our GM-GAP is superior to\nResNet-18-512d (single CM) and Ave-GAP-16 (direct ensemble) by a large margin. Meanwhile,\nSR-SOP performs better than GM-GAP, and improves ResNet-18-8256d by 6.97% with the same\ndimensional representation, demonstrating superiority of SOP. Note that our parametric SR-SOP\noutperforms original SR-SOP with negligible additional costs (680Hz vs. 670Hz), and they are\nmoderately slower than GM-GAP-16-8 (800Hz). The GM-SOP-16-8 outperforms Ave-SOP-16 by\n2.07% with more than 2 times faster, and improves SR-SOP by about 2.11% with about 2 times\nslower. The above results verify the effectiveness of our GM-SOP and idea of gated mixture model.\n\n3.3 Results on Downsampled ImageNet-1K\n\nHere we compare our method with state-of-the-art (SOTA) methods on downsampled ImageNet-1K\n[6]. Since this dataset is recently proposed and has few reported results, we implement several SOTA\nmethods based on the modi\ufb01ed ResNet-18 and ResNet-50 by ourselves and report their results with\ntrying our best to tune their hyper-parameters. NetVLAD [1] is implemented using public available\nsource code with setting dictionary size to 64. By using the same settings with GM-GAP, we replace\nGAP with One-layer-SG-MoE [33], where each expert is a 128 \u00d7 512 fully-connected layer. All\nResNet-50 based methods are trained within {50, 15, 10} epochs, and initial learning rates with decay\nrate of 0.1 are set to 0.1 and 0.075 for our GM-SOP and remaining ones, respectively. We also\ncompare with wide residual network (WRN) [42], whose results are duplicated from [6]. As shown\nin Table 3, our GM-SOP and GM-GAP signi\ufb01cantly outperform NetVLAD, One-layer-SG-MoE and\noriginal network, when ResNet-18 is employed. Meanwhile, our GM-SOP with ResNet-50 improves\noriginal network and its variants by a large margin. These results verify our methods effectively\nimprove existing deep CNNs. Our GM-SOP with ResNet-18 clearly outperforms WRN-36-1 and\nWRN-36-2, although the latter ones adopt more sophisticated data augmentation. By using the same\naugmentation strategy in [6], GM-GAP-16-8 achieves over 2% gains in Top-1 error, which uses much\nless parameters to get simliar results with WRN-36-2. To further evaluate our methods, we integrate\nthe proposed methods with the stronger WRN-36-2, our GM-GAP and GM-GOP improve WRN-36-2\nover 3.58% and 7.22% in Top-1 error, respectively. Note that GM-SOP with WRN-36-2 obtains the\nsimilar result with WRN-36-5 [6] using one half parameters.\n\n3.4 Results on Places365\n\nFinally, we evaluate our method on Places365 [44], which contains about 1.8 million training images\nand 36,500 validation images collected from 365 scene categories. In our experiments, we resize\nall images to 100 \u00d7 100, developing a downsampled Places365 dataset. It is much larger and more\nchallenging than existing low-resolution image datasets [24, 6]. We implement several counterparts\nand compare with our method based on ResNet-18. For training these networks, we randomly crop\na 96 \u00d7 96 image patch or its \ufb02ip as input. ResNet-18-8256d and remaining ones are trained within\n{35, 10, 10, 5} and {25, 5, 5, 5} epochs, and the initial learning rates are set to 0.05 and 0.1 with\ndecay rate of 0.1, respectively. The inference is performed on single center crop, and we report results\non validation set for comparison. The results of different methods are given in Table 4, from it we can\nsee that our GM-SOP-16-8 achieves the best result and signi\ufb01cantly outperforms ResNet-18-512d and\nResNet-18-8256d, further demonstrating the effectiveness of GM-SOP. Meanwhile, GM-GAP-16-8\nand GM-SOP-16-8 are superior to ResNet-18-512d and SR-SOP by a large margin, respectively. It\nindicates the idea of gated mixture model is helpful for improving representation ability of deep\nCNNs. Note that parametric SR-SOP for non-trivial gains over original SR-SOP, showing a more\ngeneral approach for image modeling is meaningful and useful for improving performance.\n\n8\n\n\f4 Related Work\n\nOur GM-SOP method shares similarity with sparsely-gated mixture-of-experts (SG-MoE) layer\n[33]. The SG-MoE motivates the gating module of our GM-SOP, but quite differently, our GM-SOP\nproposes a parametric SR-SOP as CM while SG-MoE employs a linear transformation (fully-\nconnected layer) as expert. Meanwhile, our methods signi\ufb01cantly outperform one SG-MoE layer.\nAdditionally, the SG-MoE is proposed as a general purpose component in a recurrent model [15],\nwhile our GM-SOP is proposed as a global modeling step to improve representation ability of\ndeep CNNs. This work also is related to those methods integrating single HOP into deep CNNs\n[19, 29, 39, 27, 8]. Beyond them, our GM-SOP is a mixture model, which can capture richer\ninformation and achieve better performance. NetVLAD [1] and MFAFVNet [28] extend deep CNNs\nwith popular feature encoding methods, which also can be seen as mixture models. However, different\nfrom their concatenation scheme for all CMs, our GM-SOP performs sum of selected CMs, leading\nmore compact representations. Meanwhile, our GM-SOP is clearly superior to feature encoding based\nNetVLAD [1]. Recently, some researchers propose to learn deep mixture probability models for\nsemi-supervised learning [31] and unsupervised clustering [38]. These methods formulate mixture\nprobability models as multi-layer networks, and infer the corresponding networks with deriving\nvariants of EM algorithm. In contrary to deep mixture probability models [31, 38], we aim at plugging\na trainable gated mixture model into deep CNNs as representation for supervised classi\ufb01cation.\n\n5 Conclusion\n\nThis paper proposes a novel GM-SOP method for improving deep CNNs, whose core is a trainable\ngated mixture of parametric second-order pooling model for summarizing the outputs of the last\nconvolution layer as image representation. The GM-SOP can be \ufb02exibly integrated into deep CNNs\nin an end-to-end manner. Compared with popular GAP and existing HOP methods only considering\nunimodal distributions, our GM-SOP can make better use of statistical information inherent in\nconvolutional activations, leading better representation ability and higher accuracy. The experimental\nresults on two large-scale image benchmarks demonstrate the gated mixture model is helpful to\nimprove classi\ufb01cation performance of deep CNNs, and our GM-SOP method clearly outperforms its\ncounterparts with affordable costs. Note that the proposed GM-SOP is an architecture-independent\nmodel, so we can \ufb02exibly adopt it to other advanced CNN architectures [17, 5, 16]. In future, we will\nexperiment with standard-size ImageNet dataset, and extend GM-SOP to other tasks, such as video\nclassi\ufb01cation and semantic segmentation.\n\nAcknowledgments\n\nThe work was supported by the National Natural Science Foundation of China (Grant No. 61471082,\n61671182, 61806140) and the State Key Program of National Natural Science Foundation of China\n(Grant No. 61732011). Qilong Wang was supported by China Post-doctoral Programme Foundation\nfor Innovative Talent. We thank NVIDIA corporation for donating GPU.\n\nReferences\n[1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly\n\nsupervised place recognition. In CVPR, 2016.\n\n[2] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Fast and simple calculus on tensors in the Log-Euclidean\n\nframework. In MICCAI, 2005.\n\n[3] O. Arslan. Convergence behavior of an iterative reweighting algorithm to compute multivariate M-estimates\n\nfor location and scatter. Journal of Statistical Planning and Inference, 118:115\u2013128, 2004.\n\n[4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Free-form region description with second-order\n\npooling. IEEE TPAMI, 37(6):1177\u20131189, 2015.\n\n[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In NIPS, 2017.\n[6] P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of ImageNet as an alternative to the\n\nCIFAR datasets. arXiv, abs/1707.08819, 2017.\n\n[7] M. Cimpoi, S. Maji, and A. Vedaldi. Deep \ufb01lter banks for texture recognition and segmentation. In CVPR,\n\n2015.\n\n[8] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural networks.\n\nIn CVPR, 2017.\n\n9\n\n\f[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\nmodel. arXiv, 1311.0851, 2014.\n\n[10] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance\n\n[11] L. Dryden, A. Koloydenko, and D. Zhou. Non-Euclidean statistics for covariance matrices, with applica-\n\ntions to diffusion tensor imaging. The Annals of Applied Statistics, 2009.\n\n[12] B. S. Everitt and D. J. Hand. Finite mixture distributions. Springer Press, 1981.\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[14] N. J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied\n\nMathematics, Philadelphia, PA, USA, 2008.\n\n[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n[16] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018.\n[17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn CVPR, 2017.\n\ncovariate shift. In ICML, 2015.\n\nlayers. In ICCV, 2015.\n\nComputation, 3(1):79\u201387, 1991.\n\n6(2):181\u2013214, 1994.\n\n[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\n[19] C. Ionescu, O. Vantzos, and C. Sminchisescu. Matrix backpropagation for deep networks with structured\n\n[20] C. Ionescu, O. Vantzos, and C. Sminchisescu. Training deep networks with structured layers by matrix\n\nbackpropagation. arXiv, abs/1509.07838, 2015.\n\n[21] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural\n\n[22] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation,\n\n[23] P. Koniusz, F. Yan, P. Gosselin, and K. Mikolajczyk. Higher-order occurrence pooling for bags-of-words:\n\nVisual concept detection. IEEE TPAMI, 39(2):313\u2013326, 2017.\n\n[24] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[26] P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by\n\niterative matrix square root normalization. In CVPR, 2018.\n\n[27] P. Li, J. Xie, Q. Wang, and W. Zuo. Is second-order information helpful for large-scale visual recognition?\n\n[28] Y. Li, M. Dixit, and N. Vasconcelos. Deep scene image classi\ufb01cation with the MFAFVNet. In ICCV, 2017.\n[29] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for \ufb01ne-grained visual recognition. In\n\nIn ICCV, 2017.\n\nICCV, 2015.\n\n[30] G. McLachlan and D. Peel. Finite Mixture Models. Wiley Press, 2005.\n[31] M. T. Nguyen, W. Liu, E. Perez, R. G. Baraniuk, and A. B. Patel. Semi-supervised learning with the deep\n\nrendering mixture model. arXiv, abs/1612.01942, 2016.\n\n[32] F. Pascal, L. Bombrun, J. Tourneret, and Y. Berthoumieu. Parameter estimation for multivariate generalized\n\nGaussian distributions. IEEE TSP, 61(23):5960\u20135971, 2013.\n\n[33] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large\n\nneural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.\n\n[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n2015.\n\n[35] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[37] A. Vedaldi and K. Lenc. MatConvNet \u2013 Convolutional Neural Networks for MATLAB. In ACM MM,\n\n[38] C. Viroli and G. J. McLachlan. Deep Gaussian mixture models. arXiv, abs/1711.06929, 2017.\n[39] Q. Wang, P. Li, and L. Zhang. G2DeNet: Global Gaussian distribution embedding network and its\n\napplication to visual recognition. In CVPR, 2017.\n\n[40] Q. Wang, P. Li, W. Zuo, and L. Zhang. RAID-G: Robust estimation of approximate in\ufb01nite dimensional\n\nGaussian with application to material recognition. In CVPR, 2016.\n\n[41] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation.\n\nIEEE TPAMI, 31(2):210\u2013227, 2009.\n\n[42] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.\n[43] T. Zhang, A. Wiesel, and M. S. Greco. Multivariate generalized Gaussian distribution: Convexity and\n\ngraphical models. IEEE TSP, 61(16):4141\u20134148, 2013.\n\n[44] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for\n\nscene recognition. IEEE TPAMI, 40(6):1452\u20131464, 2017.\n\n10\n\n\f", "award": [], "sourceid": 670, "authors": [{"given_name": "Qilong", "family_name": "Wang", "institution": "Tianjin University"}, {"given_name": "Zilin", "family_name": "Gao", "institution": "Dalian University of Technology"}, {"given_name": "Jiangtao", "family_name": "Xie", "institution": "Dalian University of Technology"}, {"given_name": "Wangmeng", "family_name": "Zuo", "institution": "Harbin Institute of Technology"}, {"given_name": "Peihua", "family_name": "Li", "institution": "Dalian University of Technology"}]}