{"title": "Learning Deep Bilinear Transformation for Fine-grained Image Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 4277, "page_last": 4286, "abstract": "Bilinear feature transformation has shown the state-of-the-art performance in learning fine-grained image representations. However, the computational cost to learn pairwise interactions between deep feature channels is prohibitively expensive, which restricts this powerful transformation to be used in deep neural networks. In this paper, we propose a deep bilinear transformation (DBT) block, which can be deeply stacked in convolutional neural networks to learn fine-grained image representations. The DBT block can uniformly divide input channels into several semantic groups. As bilinear transformation can be represented by calculating pairwise interactions within each group, the computational cost can be heavily relieved. The output of each block is further obtained by aggregating intra-group bilinear features, with residuals from the entire input features. We found that the proposed network achieves new state-of-the-art in several fine-grained image recognition benchmarks, including CUB-Bird, Stanford-Car, and FGVC-Aircraft.", "full_text": "Learning Deep Bilinear Transformation for\n\nFine-grained Image Representation\n\nHeliang Zheng1\u2217, Jianlong Fu2, Zheng-Jun Zha1, Jiebo Luo3\n\n1University of Science and Technology of China, Hefei, China\n\n2Microsoft Research, Beijing, China\n\n3University of Rochester, Rochester, NY\n\n1zhenghl@mail.ustc.edu.cn, 2jianf@microsoft.com, 1zhazj@ustc.edu.cn, 3jluo@cs.rochester.edu\n\nAbstract\n\nBilinear feature transformation has shown the state-of-the-art performance in\nlearning \ufb01ne-grained image representations. However, the computational cost to\nlearn pairwise interactions between deep feature channels is prohibitively expensive,\nwhich restricts this powerful transformation to be used in deep neural networks.\nIn this paper, we propose a deep bilinear transformation (DBT) block, which can\nbe deeply stacked in convolutional neural networks to learn \ufb01ne-grained image\nrepresentations. The DBT block can uniformly divide input channels into several\nsemantic groups. As bilinear transformation can be represented by calculating\npairwise interactions within each group, the computational cost can be heavily\nrelieved. The output of each block is further obtained by aggregating intra-group\nbilinear features, with residuals from the entire input features. We found that\nthe proposed network achieves new state-of-the-art in several \ufb01ne-grained image\nrecognition benchmarks, including CUB-Bird, Stanford-Car, and FGVC-Aircraft.\n\n1\n\nIntroduction\n\nFine-grained image recognition aims to distinguish subtle visual differences within a subcategory\n(e.g., various bird species [1, 2], and car models [3, 4]). Predominant approaches are often divided\ninto two streams, as the following distinct characteristics existed in \ufb01ne-grained recognition datasets:\n1) well-structured, e.g., different birds share similar semantic parts like head, wings and tail [5\u20139]), 2)\nrich details, e.g., sophisticated textures are supposed to be useful to distinguish two similar species\n[10\u201313]. Part-based approaches were \ufb01rst proposed to focus on semantic attention, which particularly\ndecompose a \ufb01ne-grained object into signi\ufb01cant parts, and train several subsequent discriminative\npart-nets for classi\ufb01cation. Such architectures usually result in suboptimal performance, as the power\nof end-to-end optimization has not been fully studied.\nThe state-of-the-art results have been achieved by bilinear feature transformation, which learns\n\ufb01ne-grained details over a global image by calculating pairwise interactions between feature channels\nin fully-connected layers. However, the exponential growth over feature dimensions (e.g., N times\nincrease for input channels of N) restricts this powerful transformation to be used in deep neural\nnetworks. To solve the high-dimensionality issue, compact bilinear [14] and low-rank bilinear [15, 16]\npooling are proposed. However, their performance is far below the best part-based models [17],\nwhich limits this light-weight approximation to be further used in challenging recognition tasks.\nIn this paper, we propose a deep bilinear transformation (DBT) block, which can be integrated into\na deep convolutional neural network (e.g., ResNet-50/101[18]), thus pairwise interactions can be\nlearned in multiple layers to enhance feature discrimination ability. We empirically show that the\nproposed network is able to improve classi\ufb01cation accuracy by calculating bilinear transformation over\nthe most discriminative feature channels for a region while maintaining computational complexity.\n\n\u2217This work was performed when Heliang Zheng was visiting Microsoft Research as a research intern.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSpeci\ufb01cally, DBT consists of a semantic grouping layer and a group bilinear layer. First, the semantic\ngrouping layer maps input channels into uniformly divided groups according to their semantic\ninformation (e.g., head, wings, and feet for a bird), thus the most discriminative representations for\na speci\ufb01c semantic are concentrated within a group. Such representations can be further enhanced\nwith pairwise interactions by the group bilinear layer, which conducts bilinear transformation within\neach semantic group. Since the bilinear transformation increases feature dimensions in each group,\nwe obtain the output of group bilinear layer by conducting inter-group aggregation, which can\nsigni\ufb01cantly improve the ef\ufb01cientness of output channels. Group index encoding is obtained for\npreserving the information of group order during the aggregating process, and shortcut connection\nis adopted for residual learning of the original feature and bilinear feature. Compared to traditional\nbilinear transformation, DBT heavily relieves computational cost via grouping and aggregating,\nwhich ensures that it can be deeply integrated into CNNs. We summarize our contributions as follows:\n\u2022 We propose deep bilinear transformation (DBT), which can be integrated into CNN blocks\nto obtain deep bilinear transformation network (DBTNet), thus pairwise interactions can be\nlearned in multiple layers to enhance feature discrimination ability.\n\u2022 We propose to obtain pairwise interaction only within the most discriminative feature\nchannels for an image position/region by learning semantic groups and calculating intra-\ngroup bilinear transformation.\n\u2022 We conduct extensive experiments to demonstrate the effectiveness of DBTNet, which can\nachieve new state-of-the-arts on three challenging \ufb01ne-grained datasets, i.e., CUB-Bird,\nStanford-Car, and FGVC-Aircraft.\n\nThe rest of the paper is organized as follows. We discuss the relation of our approach to recent works\nin Section 2, and we present our approach in Section 3. We then evaluate and report results in Section\n4 and conclude the paper with Section 5.\n2 Related Work\n2.1 Fine-Grained Image Recognition\n\nBilinear pooling. Bilinear pooling [10] is proposed to obtain rich and orderless global representation\nfor the last convolutional feature, which achieved the state-of-the-art results in many \ufb01ne-grained\ndatasets. However, the high-dimensionality issue is caused by calculating pairwise interaction\nbetween channels, thus dimension reduction methods are proposed. Speci\ufb01cally, low-rank bilinear\npooling [15] proposed to reduce feature dimensions before conducting bilinear transformation, and\ncompact bilinear pooling [14] proposed a sampling based approximation method, which can reduce\nfeature dimensions by two orders of magnitude without performance drop. Different from them, we\nreduce feature dimension by intra-group bilinear transformation and inter-group aggregating, and\na detailed discussion can be found in Section 3.4. Moreover, feature matrix normalization [11\u201313]\n(e.g., matrix square-root normalization) is proved to be important for bilinear feature, while we do\nnot use such technics in our deep bilinear transformation since calculating such root is expensive and\nnot practical to be deeply stacked in CNNs. Second-order pooling convolutional networks[19] also\nproposed to integrate bilinear interactions into convolutional blocks, while they only use such bilinear\nfeatures for weighting convlutional channels.\nWeakly-supervised part learning. Semantic parts play an important role in \ufb01ne-grained image\nrecognition, which adopts a divide and conquer strategy to learn \ufb01ne-grained details in part level and\nconduct part alignment for recognition. Recent part learning methods use attention mechanism to\nlearn semantic parts in a weakly-supervised manner. Speci\ufb01cally, TLAN [20] proposed to obtain\npart templates on both objects and parts by clustering CNN \ufb01lters, DVAN [21] proposed to explicitly\npursue the diversity of attention and is able to gather discriminative information to the maximal extent,\nand MA-CNN [9] proposed to learn multiple attentions for each image by grouping convolutional\nchannels semantically, which is further optimized by a diversity loss and a distance loss. Inspired by\nsuch methods, we propose a simple yet effective semantic grouping constrain to integrate semantic\ninformation into bilinear features.\n2.2 Group Convolution\nGroup convolution is \ufb01rst proposed in AlexNet [22], which can distribute the model over two GPUs\nto solve the problem of out of GPU memory. Such a grouping operation is rethought and further\n\n2\n\n\fFigure 1: An overview of the proposed deep bilinear transformation. Given an input of image\nin (a), Semantic Grouping module learns to group relevant feature channels, according to their\ncorresponding regions. For example in (b) and (c), pink and green channel corresponds to head\nand wings, respectively. Group Bilinear further calculates and aggregates intra-group pairwise\ninteractions in (d), (e), and (f). We can observe that the bilinear feature of a part is obtained by the\nmost discriminative feature channels for this part. [Best viewed in color]\n\nstudied in ResNeXt [23], which is an effective and ef\ufb01cient way to reduce convolutional parameters\nwith even better performance. Such a method is widely used in ef\ufb01cient network designing, such\nas MobileNet [24] and Shuf\ufb02eNet [25]. CondenseNet [26] takes a step further to propose a group\nlearning strategy, instead of simply group convolutional channels by their index order. Compared\nwith them, our novelty lies in two folds: 1) we integrate semantic information into the grouping\nprocess, and 2) we are the \ufb01rst to adopt channel grouping for bilinear transformation.\n\n3 Deep Bilinear Transformation\n\nBilinear pooling is proposed to capitalize pairwise interactions among feature elements by outer\nproduct. We denote an input convolutional feature as X \u2208 RN\u00d7HW , where H, W , and N are the\nhight, width, and channel numbers, respectively. Thus the bilinear pooling with a fully connected\nlayer can be denoted as:\n\nHW(cid:88)\n\ni=1\n\nf = W\n\n1\n\nHW\n\nvec(XXT ) + b = W\n\n1\n\nHW\n\nvec(xixT\n\ni ) + b,\n\n(1)\n\nwhere W \u2208 RK\u00d7N 2, b \u2208 RK, and f \u2208 RK are the weight, bias, and output of the fully connected\nlayer, and xi is the ith column of X. Such a bilinear representation is proved to be powerful for many\ntasks [10, 27, 28]. However, the second order information is only obtained in the last convolutional\nlayer, and the feature dimensionality larger than global average pooled [29] feature by N times.\nWe integrate semantic information into bilinear features, and the proposed deep bilinear transformation\n(DBT) can be stacked with convolutional layers. For example, the concrete formulation with a 1 \u00d7 1\nconvolutional layer is given by:\n\nf = W[y1, y2, ..., yHW ] + b, where\nyi = TB(Axi) = vec(\n\nG(cid:88)\n\n((IjAxi + pj)(IjAxi + pj)T )).\n\n(2)\n\nj=1\n\nIn the above equation, TB(\u00b7) is group bilinear function, Axi is semantically grouped feature, G is the\nnumber of group, pj is group index encoding vector, which indicates the group order, Ij \u2208 R N\nG \u00d7N is\na block matrix with G blocks, whose jth block is an identity matrix I while others are zero matrixes,\nand A is semantic mapping matrix, which groups channels representing the same semantic together.\nNote that both the dimension of parameters W \u2208 RK\u00d7( N\nG )2 and the features [y1, y2, ..., yHW ] \u2208\nR( N\n\nG )2\u00d7HW are reduced by G2 times. We will introduce the detail for each item in this section.\n\n3\n\n(b) conv feature(c) grouped feature(d) local feature vec(f) aggregated feature(e) intra-group interactionSemantic GroupingGroup Bilinear bilinear(a) input\u2466\u2467\u2468\u2463\u2464\u2465\u2460 \u2461\u2462\u2460 \u2465 \u2466conv\f3.1 Semantic Grouping Layer\n\nIt has been observed in previous work [7, 9, 20, 30, 31] that convolutional channels in high-level\nlayers tend to have responses to speci\ufb01c semantic patterns. Thus we can divide convolutional\nchannels into several groups by their semantic information. Speci\ufb01cally, given a convolutional feature\nX \u2208 RN\u00d7HW , we denote each channel as a feature map mi \u2208 RHW , where i \u2208 [1, N ]. We divide\nthe semantic space into G groups, and S(mi) \u2208 [1, G] is a mapping function that maps a channel\ninto a semantic group. The convolutional channels are uniformly grouped, i.e., S(mi) = S(mj) if\n(cid:98)i/G(cid:99) = (cid:98)j/G(cid:99). To obtain bilinear feature for semantic parts, we \ufb01rst arrange the channels in the\norder of semantic groups by:\n\n[ \u02dcm1, \u02dcm2, ..., \u02dcmN ] = [m1, m2, ..., mN ]AT ,\n\ns.t. S( \u02dcmi) = (cid:98)i/G(cid:99),\n\n(3)\n\nwhere AT \u2208 RN\u00d7N is a semantic mapping matrix, which needs to be optimized.\nSince different semantic parts are located in different regions of a given image, which correspond\nto different positions in an convolutional feature, we can use such spacial information for semantic\ngrouping. Speci\ufb01cally, the channels in the same/different semantic groups are optimized to share\nlarge/small overlaps of the response in spacial, which is formulated as semantic grouping loss Lg:\n\nLg = Lintra + Linter =\n\n\u2212d2\n\nij +\n\nd2\nij,\n\n(4)\n\n(cid:88)\n\n(cid:88)\n\n0\u2264i,j<N\n\n(cid:98)i/G(cid:99)=(cid:98)j/G(cid:99)\n\n0\u2264i,j<N\n\n(cid:98)i/G(cid:99)(cid:54)=(cid:98)j/G(cid:99)\n\n\u02dcmT\n\ni \u02dcmj\n\n.\n\n(cid:107) \u02dcmi(cid:107)2\u00b7(cid:107) \u02dcmj(cid:107)2\n\nwhere the pairwise correlation is dij =\nNote that such a semantic grouping can be implemented by a 1 \u00d7 1 convolutional layer. Speci\ufb01cally,\na convolutional layer can be denoted as x = Wz, where z \u2208 RM is the input feature, x \u2208 RN\nis the output feature, and W \u2208 RN\u00d7M is the weight matrix. Let U = AW, the mapped feature\ncan be obtained by Ax = AWz = Uz, thus U is the weight matrix of a semantic grouping layer,\nwhose outputs are semantically grouped. Note that U is used for not only semantic grouping, but\nalso convolutional feature learning, thus we can uniformly divide the output channels into a preset\nnumber of groups and obtain the grouped features in the CNN training process.\n\n3.2 Group Bilinear Layer\n\nGiven a semantically grouped convolutional feature, we propose a group bilinear layer to ef\ufb01ciently\ngenerate bilinear features without increasing feature dimensions. Speci\ufb01cally, group bilinear layer\ncalculates bilinear transformation over the channels within each group, which can enhance the\nrepresentation of the corresponding semantic by pairwise interactions. Note that the intra-group\nbilinear transformation increases feature dimensions in each group, to improve the ef\ufb01cientness of\noutput channels, we further aggregate such intra-group bilinear features. Thus the proposed group\nbilinear can be obtained as:\n\nG(cid:88)\n\ny = TB(Ax) = vec(\n\n((IjAx)(IjAx)T )),\n\n(5)\n\nj=1\n\nwhere A is the mapping matrix learned in Equation 3, and Ij \u2208 R N\nG \u00d7N is a block matrix with G\nblocks, whose jth block is an identity matrix I while others are zero matrixes. IjAx is the jth group\nof the semantically grouped feature. The dimension of the input feature x \u2208 RN is N, and that\nof the output feature y \u2208 R( N\nN (conduct bilinear interpolation over\nchannels for non-integer cases) to keep feature dimensions unchanged, thus DBT can be conveniently\nintegrated into CNNs without changing original network architecture.\nConducting aggregating over groups can reduce feature dimensionality by G times, while the\ninformation of group order would be lost in such a process. Thus we introduce position encoding\n[32] into convolutional channel representations, and add a group index encoding item to preserve\ngroup order information and improve discriminations for the features in different groups:\n\nG )2, we adopt G =\n\nG )2 is ( N\n\n\u221a\n\nPj(2i) = sin(j/t2i/ N\nPj(2i + 1) = cos(j/t2i/ N\n\nG ),\n\nG ),\n\n4\n\n(6)\n\n\ff = W(BN (tanh(TB(AX)) + X) + b.\n\nwhere t indicates the frequency of function sin(\u00b7). Such a group index encoding is element-wise added\ninto the jth group feature before conducting bilinear transformation: (IjAx + Pj)(IjAx + Pj)T ,\nthus the item of Pj(IjAx)T can preserve the group index information in the aggregating process. To\nthis end, the proposed semantic group bilinear module can be obtained, which is shown in Equation 2.\n3.3 Deep Bilinear Transformation Network\nActivation and shortcut connection. Non-linear activation functions can increase the representative\ncapacity of a model. Instead of ReLU, tanh function is widely used to activate bilinear features,\nbecause such functions are able to suppress large second-order values. Moreover, inspired by residual\nlearning [18], we add shortcut connections for semantic group bilinear feature to assist optimization:\n(7)\nSuch a shortcut connection 1) fuse original and bilinear feature, and 2) build a gateway for the original\nfeature for backward propagation. Note that we make the network study from the original feature by\ninitializing the \u201cscale\u201d parameter of the batch normalization layer to be zero, which is an effective\nway for parameters optimization.\nDeep bilinear transformation block. Figure 2 shows the network structure of deep bilinear trans-\nformation block. The semantic grouping layer is a 1 \u00d7 1 convolution with the constraints introduced\nin Equation 4. 3 \u00d7 3 convolutional layers are the key components for feature extraction in CNNs,\nwhich can integrate local context information into convolutional channels. The group bilinear layer\nis followed by a 3 \u00d7 3 convolution, thus the rich pairwise interactions can be further integrated to\nobtain \ufb01ne-grained representation.\nNetwork architecture. The proposed semantic group bilinear module can be integrated into deep\nconvolutional neural networks. Table 1 is an illustration of integrating DBT into each ResNet block,\nand the effectiveness of DBT in different stages are discussed in Section 4.2. The overall loss L for\nsuch a model is shown as:\n\nB(cid:88)\n\nL = Lc + \u03bb\n\nL(b)\ng ,\n\n(8)\n\nb\n\nwhere Lc is softmax cross entropy loss for classi\ufb01cation, L(b)\ng\nblock, B is the number of residual blocks, and \u03bb is the weight of semantic grouping loss.\nSince our DBT does not change feature dimensions (as introduced in Section 3.2), it can be conve-\nniently integrated into convolutional blocks. We conduct DBT before 3 \u00d7 3 convolutional layers and\nthe value of G indicates the number of semantic groups. Note that the proposed DBT is ef\ufb01cient\nsince there are no additional parameters, and the computational cost is also very low, i.e., 5M FLOPs.\n\nis semantic grouping loss over the bth\n\nFigure 2: An illustration of the structure of deep\nbilinear transformation block, where \u201cSG\u201d indi-\ncates semantic grouping layer, \u201cGB\u201d indicates\ngroup bilinear layer, 1 \u00d7 1 and 3 \u00d7 3 indicates\nthe kernel size of convolutional layers.\n\n3.4 Discussion\n\nFigure 3: An illustration of intra-group (i.e.,\ndiagonal blocks in the \ufb01gure) and inter-group\npairwise interaction. Note that yellow indicates\nlarge value and purple indicates small value.\n\nThe proposed deep bilinear transformation can generate \ufb01ne-grained representation without increasing\nfeature dimension by calculating intra-group bilinear transformation and conducting inter-group\naggregation. In this subsection, we analyze the intra-group and inter-group pairwise interaction and\nshow the difference between our work and previous low-dimensional bilinear variants.\n\n5\n\nSGGB3\u00d731\u00d71#1#2#16...\fTable 1: An illustration of integrating DBT into deep CNNs. \u201cSG\u201d indicates semantic grouping layer,\nand \u201cGB\u201d indicates group bilinear layer. \u201cG = 16\u201d suggests channels are divided into 16 semantic\ngroups. A detailed discussion on the integrating stages can be found in Section 4.2.\n\nStage\n\nI\n\nOutput\n112 \u00d7 112\n\nII\n\nIII\n\nIV\n\nV\n\n56 \u00d7 56\n\n28 \u00d7 28\n\n14 \u00d7 14\n\n7 \u00d7 7\n\n1 \u00d7 1\n\n(#Params, FLOPs)\n\nResNet-50 [18]\n\n3 \u00d7 3, 64\n1 \u00d7 1, 256\n\n(cid:34) 1 \u00d7 1, 64\n(cid:34)1 \u00d7 1, 128\n(cid:34) 1 \u00d7 1, 256\n(cid:34) 1 \u00d7 1, 512\n\n3 \u00d7 3, 128\n1 \u00d7 1, 512\n\n3 \u00d7 3, 256\n1 \u00d7 1, 1024\n\n(cid:35)\n(cid:35)\n(cid:35)\n(cid:35)\n\n3 \u00d7 3, 512\n1 \u00d7 1, 2048\n\n\u00d7 3\n\n\u00d7 4\n\n\u00d7 6\n\n\u00d7 3\n\nDBTNet-50\n7 \u00d7 7, 64, stride 2\n\n3 \u00d7 3 max pool, stride 2\n\nGB, 128, G = 8\n\nGB, 64, G = 8\n\n3 \u00d7 3, 64\n1 \u00d7 1, 256\n\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 64\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 128\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 256\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 512\n\n3 \u00d7 3, 256\n1 \u00d7 1, 1024\n\n3 \u00d7 3, 128\n1 \u00d7 1, 512\n\n3 \u00d7 3, 512\n1 \u00d7 1, 2048\n\nGB, 256, G = 16\n\nGB, 512, G = 16\n\n\uf8f9\uf8fa\uf8fb \u00d7 3\n\uf8f9\uf8fa\uf8fb \u00d7 4\n\uf8f9\uf8fa\uf8fb \u00d7 6\n\uf8f9\uf8fa\uf8fb \u00d7 3\n\nDBTNet-101\n\nGB, 128, G = 8\n\nGB, 64, G = 8\n\n3 \u00d7 3, 64\n1 \u00d7 1, 256\n\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 64\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 128\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 256\n\uf8ee\uf8ef\uf8f0 SG, 1 \u00d7 1, 512\n\n3 \u00d7 3, 256\n1 \u00d7 1, 1024\n\n3 \u00d7 3, 128\n1 \u00d7 1, 512\n\n3 \u00d7 3, 512\n1 \u00d7 1, 2048\n\nGB, 256, G = 16\n\nGB, 512, G = 16\n\n\uf8f9\uf8fa\uf8fb \u00d7 3\n\uf8f9\uf8fa\uf8fb \u00d7 4\n\uf8f9\uf8fa\uf8fb \u00d7 23\n\uf8f9\uf8fa\uf8fb \u00d7 3\n\nglobal average pool, 1000-d fc, softmax\n\n(25.5 M, 3.8 G)\n\n(25.5 M, 3.8 G)\n\n(44.4 M, 7.6 G)\n\nIntra-group and inter-group pairwise interaction We empirically show the intra-group and inter-\ngroup pairwise interaction in Figure 3. Speci\ufb01cally, we extract semantically grouped convolutional\nfeatures in stage3 of DBTNet-50 for all the testing samples in CUB-200-2011 and visualize the\naverage pairwise interaction in Figure 3. There are 256 channels in 16 groups, and yellow indicates\nlarge value while purple indicates small value. It can be observed that intra-group interaction plays\na dominant role. The small intra-group interactions (e.g., group #2) indicate that such semantic\nparts appear less than others. Consider two convolutional channels from different semantic groups,\ni.e., mi, mj \u2208 RHW , and S(mi) (cid:54)= S(mi). Since different semantic parts are located in different\npositions, we can obtain that mi \u25e6 mj = 0, where \u25e6 indicates element-wise product, and 0 \u2208 RHW .\nIn a word, the bilinear feature among channels in different semantic groups is zero vector, which\nenlarges the bilinear feature dimension without providing discriminative information. Our proposed\nDBT solve this problem by conducting outer product over channels within a group. However, previous\nbilinear variants with low dimensions cannot solve this problem.\nCompared with low-dimensional bilinear variants Low-rank bilinear [15] proposes to reduce\nfeature channels by a 1 \u00d7 1 convolutional layer before conducting bilinear pooling, however, the\nreduced channels are still in different semantic groups. Compact bilinear [14] proposes to use\nsampling methods to approximate the second-order kernel. The random maclaurin projection version\nof compact bilinear pooling can be denoted as: f = W1x \u25e6 W2x, where \u25e6 indicates hadamard\nproduct, and W1, W2 are random and \ufb01xed, whose entries are either +1 or \u22121 with equal probability.\nHadamard low-rank bilinear [27] takes one step further to use learnable weights: f = PT (Ux \u25e6\nVx) + b, where P, U, V, and b are learnable parameters. We take one element of y = Ux\u25e6 Vx for\nj,k ujvk(xjxk). Compared to original bilinear\nj,k Wj,k(xjxk), hadamard low-rank bilinear decompose parameter matrix W into\ntwo vectors u, v. Such approximate also contains uninformative zeros as xjxk = 0 if the jth channel\nand the kth channel are not in the same semantic group.\n4 Experiments\n4.1 Experiment setup\n\nanalysis, which is denoted as y = uT x \u25e6 vT x =(cid:80)\nyi = xT Wx =(cid:80)\n\nDatasets: We conducted experiments on three widely used \ufb01ne-grained datasets (i.e., CUB-200-2011\n[2] with 6k training images for 200 categories, Stanford-Car [3] with 8k training images for 196\ncategories and FGVC-Aircraft[33] with 6k training images for 100 categories) and a large scale\n\ufb01ne-grained dataset iNatualist-2017 [34] with 600k training images for 5,089 categories. Note that\n\n6\n\n\ftraditional \ufb01ne-grained task focus on distinguishing categories within a super-category, while there\nare 13 super categories in iNaturalist.\nImplementation: We use MXNet [35] as our code-base, and all the models are trained on 8 Tesla\nV-100 GPUs. We follow the most common setting in \ufb01ne-grained tasks to pre-train the models on\nImageNet [36] with input size of 224 \u00d7 224, and \ufb01ne-tune on \ufb01ne-grained datasets with input size of\n448 \u00d7 448 (unless specially stated). We adopt consine learning rate schedule, SGD optimizer with\nthe batch size to be 48 per GPU. The weight for semantic constrain in Equation 4 is set to 3e-4 in\npre-training stage and 1e-5 in \ufb01ne-tune stage. And we defer other implementation details to our code\nhttps://github.com/researchmm/DBTNet.\n4.2 Ablation studies\nWe conduct ablation studies for DBTNet-50 on CUB-200-2011, with input image size of 224 \u00d7 224.\nSemantic grouping The proposed semantic grouping constrain in Equation 4 encourages the channels\nwith similar semantic to gather together, which is a vital pre-processing for group bilinear. The impact\nof the loss weight (i.e., \u03bb in Equation 8) is shown in Table 2 for both pre-training stage and \ufb01ne-tuning\nstage. It can be observed that such a constraint has a signi\ufb01cant impact in the pre-training stage.\nSpeci\ufb01cally, without semantic grouping constraint, that is, conduct group bilinear over randomly\ngrouped channels, bring a 4.8% accuracy drop compared to with suitable constraint. While a large\nconstraint would also damage classi\ufb01cation accuracy because the classi\ufb01cation loss is supposed to\ndominate the optimization of the network. A similar phenomenon can be observed in the \ufb01ne-tuning\nstage, while the impact is much less, which indicates that the semantic grouping is important in the\nearly stage of network optimization.\n\nloss weight\n(0,0)\n(3e-4,0)\n(3e-3,0)\n(3e-4,1e-5)\n(3e-4,1e-4)\n\naccuracy (%)\n\n79.8\n84.6\n83.1\n85.1\n84.8\n\ngroup index\n\nw/o\n\nw/\n\nt\n-\n1.1\n1.5\n2\n10\n\naccuracy (%)\n\n84.5\n85.0\n85.1\n84.8\n84.5\n\nTable 2: Ablation study on semantic grouping\nconstraint. In the form of (pre-train, \ufb01ne-tune).\n\nTable 3: Ablation study on group index encod-\ning. t indicates the frequency in Equation 6.\n\nstage\n\nV\nV\n\nV + IV\nV + IV\n\nshortcut\n\nw/o\nw/\nw/o\nw/\n\naccuracy (%)\n\n84.6\n84.8\n83.1\n85.1\n\nTable 4: Ablation study on shortcut connection.\nNote that all settings include \u2019last layer\u2019 men-\nsioned in Table 5.\n\napproach\nbaseline\nlast layer\nStage V\nlast layer + Stage V\nlast layer + Stage V + IV\nlast layer + Stage V + IV + III\n\naccuracy (%)\n\n83.3\n84.4\n84.5\n84.8\n85.1\n85.0\n\nTable 5: Ablation study on integrated stages.\n\ngrouping mechanism w/o constraints\n\naccuracy (%)\n\n79.8\n\nconstraints in MA-CNN [9]\n\n83.2\n\nours (Equation 4)\n\n85.1\n\nTable 6: Comparison on channel grouping constrains.\n\nGroup index encoding Group index encoding is obtained before conducting aggregation over groups,\nwhich can preserve group index information in the aggregated feature. Table 3 shows the impact of\ngroup index encoding with different frequencies (i.e., t in Equation 6). It can be observed that group\nindex encoding can improve classi\ufb01cation by 0.6% accuracy gains, and the frequency t should be\nsmall since the encoding dimension, i.e., N\nShortcut connection Shortcut connection can facilitate the network from two aspects, i.e., 1) fusing\noriginal and bilinear features, and 2) enabling a straight backward propagation. Table 4 shows the\nimpact of shortcut connection for semantic group bilinear block in different stages. It can be observed\nthat shortcut connection can bring 0.2% accuracy gains in stage 4, and 2.0% accuracy gains in stage\n3 + 4. And note that without shortcut connection, adding semantic group bilinear in stage 3 brings\naccuracy drop. Such a drop is caused by optimization problem, which can be solved by the shortcut\n\nG , is small (typically 16 or 32).\n\n7\n\n\fTable 7: Comparison in terms of classi\ufb01cation accuracy on the CUB-200-2011, Stanford-Car, and\nFGVC-Aircraft datasets. The \u201cdimension\u201d indicates the dimension of the last layer bilinear feature.\n\ndimension CUB-200-2011\n\nStanford-Car Aircraft\n\napproach\n\nCompact Bilinear [14]\nKernel Pooling [37]\niSQRT-COV [13]\niSQRT-COV [13]\nDBTNet-50 (ours)\nDBTNet-101 (ours)\n\nResNet-50\n\nResNet-101\n\n14k\n14k\n8k\n32k\n2k\n2k\n\n81.6\n84.7\n87.3\n88.1\n87.5\n88.1\n\n88.6\n91.1\n91.7\n92.8\n94.1\n94.5\n\n81.6\n85.7\n89.5\n90.0\n91.2\n91.6\n\nconnection together with batch normalization, where the \u201cscale\u201d parameter is supposed to be initial-\nized with zero.\nIntegrated stages Table 5 shows the ablation study on integrated stages. It can be observed that\ndeeply integrating DBT into Stage V and Stage IV brings 0.7% accuracy gains compared to only\nconducting it over the last layer. Since the proposed DBT taking advantages of semantic information,\nintegrating DBT into Stage III with low-level features can not further improve the performance. Thus\nwe integrate DBT into Stage IV, Stage V, together with the last layer in our DBTNet.\nGrouping constrains The proposed group bilinear requires the intra-group channels to be highly\ncorrelated, and the proposed semantic grouping can better satisfy such requirements than MA-CNN\n[9]. Speci\ufb01cally, MA-CNN [9] adopts the idea of k-means, which optimizes each channel to its cluster\ncenter. While the proposed grouping method in this paper optimize the correlation of intra-group\nand inter-group channels in a pairwise manner (as shown in Equation 4). Moreover, we conducted\nexperiments by replacing our grouping loss with MA-CNN [9], and the results in Table 6 also show\nthe effectiveness of our proposed grouping module.\n4.3 Comparison with the state-of-the-art\nFine-grained image recognition benchmarks: Table 7 shows the comparison with other bilinear\nbased \ufb01ne-grained recognition methods on three competitive datasets, i.e., CUB-200-2011 [2],\nStanford-Car [3] and FGVC-Aircraft[33], with input image size of 448 \u00d7 448. Since our DBT block\nis basically built on ResNet block, we compared with other methods which also uses ResNet as the\nbackbone. It can be observed that the proposed DBT can signi\ufb01cantly outperform Compact bilinear\n[14] and Kernel Pooling [37] in all three datasets. Compared to the state-of-the-art iSQRT-COV [13],\nwhich conducts matrix normalization over bilinear features, we can also obtain a large margin of\naccuracy gains on two of the three datasets, i.e., Stanford-Car and FGVC-Aircraft. Moreover, we\ncan also see better performance by integrating DBT into deeper ResNet-101. For DBTNet-101, we\nintegrated DBT into the last layer, the Stage V and the last 6 layers of Stage IV.\nLarge-scale image recognition benchmarks: To further evaluate the proposed DBTNet on large\nscale image datasets, we conduct experiments on iNaturalist-2017 [34]. We compare the performance\nof ResNet-50 and DBTNet-50 with 224 \u00d7 224 input images, and observed that DBTNet-50 (62.0%)\ncan outperform ResNet-50 (59.9%) with 2.1% accuracy gains. Moreover, our DBTNet-50 can obtain\n1.6% accuracy gains over ResNet-50 on ImageNet [36] dataset for general image recognition tasks.\n5 Conclusion\nIn this paper, we propose a novel deep bilinear transformation (DBT) block, which can be integrated\ninto deep convolutional neural networks. The DBT takes advantages of semantic information and can\nobtain bilinear features ef\ufb01ciently by calculating pairwise interaction within a semantic group. A\nhighly-modularized network DBTNet can be obtained by stacking DBT blocks with convolutional\nlayers, and the deeply integrated bilinear representations enable DBTNet to achieve new state-of-the-\nart in several \ufb01ne-grained image recognition tasks. Since semantic information can only be obtained\nin high-level features, we will study on conducting deep bilinear transformation over low-level\nfeatures in the future. Moreover, we will explore to integrate matrix normalization into DBT in an\nef\ufb01cient way, to further leverage the bilinear representations.\nAcknowledgement: This work was supported by the National Key R&D Program of China under\nGrant 2017YFB1300201, the National Natural Science Foundation of China (NSFC) under Grants\n61622211 and 61620106009 as well as the Fundamental Research Funds for the Central Universities\nunder Grant WK2100100030.\n\n8\n\n\fReferences\n[1] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, & Peter N Belhumeur.\n\nBirdsnap: Large-scale \ufb01ne-grained visual categorization of birds. In CVPR, pages 2011\u20132018, 2014.\n\n[2] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, & P. Perona. Caltech-UCSD Birds 200.\n\nTechnical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[3] Jonathan Krause, Michael Stark, Jia Deng, & Li Fei-Fei. 3d object representations for \ufb01ne-grained\n\ncategorization. In ICCV Workshops, pages 554\u2013561, 2013.\n\n[4] Linjie Yang, Ping Luo, Chen Change Loy, & Xiaoou Tang. A large-scale car dataset for \ufb01ne-grained\n\ncategorization and veri\ufb01cation. In CVPR, pages 3973\u20133981, 2015.\n\n[5] Jonathan Krause1, Hailin Jin, Jianchao Yang, & Fei-Fei Li. Fine-grained recognition without part annota-\n\ntions. In CVPR, pages 5546\u20135555, 2015.\n\n[6] Steve Branson, Grant Van Horn, Serge J. Belongie, & Pietro Perona. Bird species categorization using\n\npose normalized deep convolutional nets. In BMVC, 2014.\n\n[7] Jianlong Fu, Heliang Zheng, & Tao Mei. Look closer to see better: Recurrent attention convolutional\n\nneural network for \ufb01ne-grained image recognition. In CVPR, pages 4438\u20134446, 2017.\n\n[8] Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, & Chunhua Shen. Mask-cnn: Localizing parts and selecting\n\ndescriptors for \ufb01ne-grained bird species categorization. Pattern Recognition, 76:704\u2013714, 2018.\n\n[9] Heliang Zheng, Jianlong Fu, Tao Mei, & Jiebo Luo. Learning multi-attention convolutional neural network\n\nfor \ufb01ne-grained image recognition. In ICCV, pages 5209\u20135217, 2017.\n\n[10] Tsung-Yu Lin, Aruni RoyChowdhury, & Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\n\nrecognition. In ICCV, pages 1449\u20131457, 2015.\n\n[11] Tsung-Yu Lin and Subhransu Maji. Improved bilinear pooling with cnns. arXiv preprint arXiv:1707.06772,\n\n2017.\n\n[12] Peihua Li, Jiangtao Xie, Qilong Wang, & Wangmeng Zuo. Is second-order information helpful for large-\n\nscale visual recognition? In ICCV, pages 2070\u20132078, 2017.\n\n[13] Peihua Li, Jiangtao Xie, Qilong Wang, & Zilin Gao. Towards faster training of global covariance pooling\n\nnetworks by iterative matrix square root normalization. In CVPR, June 2018.\n\n[14] Yang Gao, Oscar Beijbom, Ning Zhang, & Trevor Darrell. Compact bilinear pooling. In CVPR, pages\n\n317\u2013326, 2016.\n\n[15] Shu Kong and Charless Fowlkes. Low-rank bilinear pooling for \ufb01ne-grained classi\ufb01cation. In CVPR, pages\n\n7025\u20137034, 2017.\n\n[16] Yanghao Li, Naiyan Wang, Jiaying Liu, & Xiaodi Hou. Factorized bilinear models for image recognition.\n\nIn ICCV, pages 2079\u20132087, 2017.\n\n[17] Michael Lam, Behrooz Mahasseni, & Sinisa Todorovic. Fine-grained recognition as hsnet search for\n\ninformative image parts. In CVPR, July 2017.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, pages 770\u2013778, 2016.\n\n[19] Zilin Gao, Jiangtao Xie, Qilong Wang, & Peihua Li. Global second-order pooling convolutional networks.\n\nIn CVPR, pages 3024\u20133033, 2019.\n\n[20] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, & Zheng Zhang. The application\nof two-level attention models in deep convolutional neural network for \ufb01ne-grained image classi\ufb01cation. In\nCVPR, pages 842\u2013850, 2015.\n\n[21] Bo Zhao, Xiao Wu, Jiashi Feng, Qiang Peng, & Shuicheng Yan. Diversi\ufb01ed visual attention networks for\n\n\ufb01ne-grained object classi\ufb01cation. CoRR, abs/1606.08572, 2016.\n\n[22] Alex Krizhevsky, Ilya Sutskever, & Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1106\u20131114, 2012.\n\n9\n\n\f[23] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, & Kaiming He. Aggregated residual transformations\n\nfor deep neural networks. In CVPR, pages 1492\u20131500, 2017.\n\n[24] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, & Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017.\n\n[25] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, & Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\n\nneural network for mobile devices. In CVPR, pages 6848\u20136856, 2018.\n\n[26] Gao Huang, Shichen Liu, Laurens Van der Maaten, & Kilian Q Weinberger. Condensenet: An ef\ufb01cient\n\ndensenet using learned group convolutions. In CVPR, pages 2752\u20132761, 2018.\n\n[27] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, & Byoung-Tak Zhang.\n\nHadamard product for low-rank bilinear pooling. ICLR, 2017.\n\n[28] Jin-Hwa Kim, Jaehyun Jun, & Byoung-Tak Zhang. Bilinear attention networks. In NIPS, pages 1571\u20131581,\n\n2018.\n\n[29] Min Lin, Qiang Chen, & Shuicheng Yan. Network in network. ICLR, 2014.\n\n[30] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, & Qi Tian. Picking deep \ufb01lter responses\n\nfor \ufb01ne-grained image recognition. In CVPR, pages 1134\u20131142, 2016.\n\n[31] Marcel Simon and Erik Rodner. Neural activation constellations: Unsupervised part model discovery with\n\nconvolutional networks. In ICCV, pages 1143\u20131151, 2015.\n\n[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz\n\nKaiser, & Illia Polosukhin. Attention is all you need. In NIPS, pages 5998\u20136008, 2017.\n\n[33] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, & A. Vedaldi. Fine-grained visual classi\ufb01cation of aircraft.\n\nTechnical report, 2013.\n\n[34] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro\n\nPerona, & Serge Belongie. The inaturalist species classi\ufb01cation and detection dataset. 2018.\n\n[35] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, & Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, rej\nKarpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, & Li Fei-Fei. ImageNet Large Scale\nVisual Recognition Challenge. IJCV, (3):211\u2013252, 2015.\n\n[37] Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, & Serge J Belongie. Kernel pooling for\n\nconvolutional neural networks. In CVPR, number 2, page 7, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2401, "authors": [{"given_name": "Heliang", "family_name": "Zheng", "institution": "University of Science and Technology of China"}, {"given_name": "Jianlong", "family_name": "Fu", "institution": "Microsoft Research"}, {"given_name": "Zheng-Jun", "family_name": "Zha", "institution": "University of Science and Technology of China"}, {"given_name": "Jiebo", "family_name": "Luo", "institution": "U. Rochester"}]}