{"title": "Virtual Class Enhanced Discriminative Embedding Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1942, "page_last": 1952, "abstract": "Recently, learning discriminative features to improve the recognition performances gradually becomes the primary goal of deep learning, and numerous remarkable works have emerged. In this paper, we propose a novel yet extremely simple method Virtual Softmax to enhance the discriminative property of learned features by injecting a dynamic virtual negative class into the original softmax. Injecting virtual class aims to enlarge inter-class margin and compress intra-class distribution by strengthening the decision boundary constraint. Although it seems weird to optimize with this additional virtual class, we show that our method derives from an intuitive and clear motivation, and it indeed encourages the features to be more compact and separable. This paper empirically and experimentally demonstrates the superiority of Virtual Softmax, improving the performances on a variety of object classification and face verification tasks.", "full_text": "Virtual Class Enhanced Discriminative Embedding\n\nLearning\n\nBinghui Chen1, Weihong Deng1, Haifeng Shen2\n1Beijing University of Posts and Telecommunications\n\n2AI Labs, Didi Chuxing, Beijing 100193, China\n\nchenbinghui@bupt.edu.cn, whdeng@bupt.edu.cn, shenhaifeng@didiglobal.com\n\nAbstract\n\nRecently, learning discriminative features to improve the recognition performances\ngradually becomes the primary goal of deep learning, and numerous remarkable\nworks have emerged. In this paper, we propose a novel yet extremely simple\nmethod Virtual Softmax to enhance the discriminative property of learned features\nby injecting a dynamic virtual negative class into the original softmax. Injecting\nvirtual class aims to enlarge inter-class margin and compress intra-class distribution\nby strengthening the decision boundary constraint. Although it seems weird to\noptimize with this additional virtual class, we show that our method derives from\nan intuitive and clear motivation, and it indeed encourages the features to be more\ncompact and separable. This paper empirically and experimentally demonstrates\nthe superiority of Virtual Softmax, improving the performances on a variety of\nobject classi\ufb01cation and face veri\ufb01cation tasks.\n\nIntroduction\n\n1\nIn the community of deep learning, the Softmax layer is widely adopted as a supervisor at the top of\nthe model due to its simplicity and differentiability, such as in object classi\ufb01cation[9\u201311, 39] and face\nrecognition[29, 30, 27, 36], etc. While, many research works take it for granted and omit the fact\nthat the learned features by Softmax are only separable not discriminative. Thus, the performances\nof deep models in many recognition tasks are limited. Moreover, there are a few research works\nconcentrating on learning discriminative features through re\ufb01ning this commonly used softmax layer,\ne.g. L-Softmax [22] and A-Softmax[21]. However, they require an annealing-like training procedure\nwhich is controlled by human, and thus are dif\ufb01cult to transfer to other new tasks. To this end, we\nintend to propose an automatic counterpart which dedicates to learning discriminative features.\nIn standard softmax, the input pattern xi (with label yi)\nis classi\ufb01ed by evaluating the inner product between its\nfeature vector Xi and the class anchor vector Wj (if not\nspeci\ufb01ed, the basis b is removed, which does not affec-\nt the \ufb01nal performances of DCNN veri\ufb01ed in [22, 21]).\nSince the inner product W T\nj Xi can be rewritten into\n(cid:107)Wj(cid:107)(cid:107)Xi(cid:107) cos \u03b8j, where \u03b8j is the angle between vector\nXi and Wj, and the linear softmax classi\ufb01er is mainly\ndetermined by the angle 1, thus can be regarded as an\nangle classi\ufb01er. Moreover, as shown in [33], the feature\ndistribution leaned by softmax is \u2019radial\u2019, like in Fig.1.\nThus, learning the angularly discriminative feature is the\nbedrock of softmax-based classi\ufb01er. Here, we will \ufb01rst\nsuccinctly elucidate the direct motivation of our Virtual Softmax. For a 4-class classi\ufb01cation problem,\nas illustrated in Fig.1.(a)2, one can observe that when optimizing with the original 4 classes, the\n\nFigure 1: Illustration of angularly distributed features on\n2-D space. (a) shows features learned by the original 4-\nway softmax. (b) shows features learned by 8-way soft-\nmax, where there are 4 additionally hand-injected nega-\ntive classes. W denotes the class anchor vector.\n\n1Elucidated in [22]. Here for simplicity, assume that all (cid:107)Wj(cid:107) = l.\n2N-dimensionality space complicates our analysis but has the similar mechanism as 2-D space.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\nW1W1W5W6W7W8W4W4W3W3W2W2Decision Boundarydecision boundaryfor class 2Area ofClass 1Area ofClass 2Area ofClass 3Area ofClass 4Area ofAdditional classesdecision boundary for class 1decision boundary for class 1 & 2(a)(b)\u03b8\u03b8\u03a6\fdecision boundary for class 1 overlaps with that for class 2, i.e. there is no inter-class angular margin.\nObviously, this constraint of overlapped boundaries is not enough to ensure intra-class compactness\nand inter-class separability, and then contributes little to recognition performance improvement.\nWhile, in expanded 8-category constraint case as exhibited in Fig.1 .(b), the decision boundaries for\nclass 1 and 2 have a large inter-class angular margin with each other due to the additionally injected\nclass 5 (which doesn\u2019t take part in the \ufb01nal recognition evaluation) and the constrained region of class\n1 is much tighter than before (other classes have the same phenomenon), yielding more compact and\nseparable features which are highly bene\ufb01cial to the original 4-class recognition. Thus, arti\ufb01cially\ninjected virtual classes (e.g. W5 \u223c W8) can be dedicated to encouraging larger decision margin\namong original classes (e.g. class 1 \u223c 4) such that produces angularly discriminative features.\nNaturally motivated by this, we propose Virtual Softmax, a novel yet extremely simple technique\nto improve the recognition performances by training with an extra dynamic virtual class. Injecting\nthis virtual class aims at imposing stronger and continuous constraint on decision boundaries so\nas to further produces the angularly discriminative features. Concretely, generalizing Softmax to\nVirtual Softmax gives chances of enlarging inter-class angular margin and tightening intra-class\ndistribution, and allows the performance improvement of recognition via introducing a large angular\nmargin. Different from L-Softmax[22] and A-Softmax[21], our work aims at extending Softmax to\nan automatic counterpart, which can be easily performed without or with little human interaction.\nFig.2 veri\ufb01es our technique where the learned features by our Virtual Softmax turn to be much more\ncompact and well separated. And the main contributions of this paper are summarized as follows:\n\u2022 We propose Virtual Softmax to automatically enhance the representation power of learned features\nby employing a virtual negative class. The learned features are much more compact and separable,\nillustrated in Fig.2. And to our best knowledge, it is the \ufb01rst work to employ additional virtual class\nin Softmax to optimize the feature learning and to improve the recognition performance.\n\u2022 The injected virtual class derives from the natural and intuitive motivation, and intends to force a\nzero limit \u03b8yi which is a much stronger constraint than softmax. Moreover, it is dynamic and adaptive,\npursuing an automatic training procedure and without incurring much computational complexity and\nmemory consumption.\n\u2022 Extensive experiments have been conducted on several datasets, including MNIST [17], SVHN\n[23], CIFAR10/100 [16], CUB200 [35], ImageNet32[5], LFW [12] and SLLFW [6]. Finally, our\nVirtual Softmax has achieved competitive and appealing performances, validating its effectiveness.\n2 Related Work\nSince the training of deep model is dominated and guided by the top objective loss, and this objective\nfunction can be equipped with more semantic information, many research works force stronger\nrepresentation by re\ufb01ning the supervisor at the top of deep model. In [29], contrastive loss is\nemployed to enlarge the inter-class Euclidean distance and to reduce the intra-class Euclidean distance.\nAfterwards, it is generalized to triplet loss, the major idea is to simultaneously maximize inter-class\ndistance and minimize intra-class distance, and it is widely applied in numerous computer vision tasks,\nsuch as in face veri\ufb01cation [27], in person re-identi\ufb01cation [4], in \ufb01ne-grained classi\ufb01cation [34],\netc. Moreover, many tuple-based methods are developed and perform better on the corresponding\ntasks, such as Lifted loss [24], N-pair loss [28, 2], etc. However, these tuple-based methods constrain\nthe feature learning with multiple instances, and thus require to elaborately manipulate the tuple\nmining procedure, which is much expensive in computation and is performance-sensitive. In addition,\ninspired by linear discriminant analysis, center loss [36] breaks away from the tuple-based idea and\nshows better performances. However, our Virtual Softmax differs with them in that (1) the model\nis optimized with only the single Virtual Softmax loss not the joint loss functions (e.g. softmax +\ncontrastive-loss). (2) our method can also be applied to improve the classi\ufb01cation performances,\nwhile the above methods are not as good in classi\ufb01cation tasks. Compared to L-Softmax[22] and\nA-Softmax[21], this paper heads from a totally different idea that encourages a large margin among\nclasses via injecting additionally virtual class, and dedicates to proposing an automatic method, i.e.\nVirtual-Softmax.\n3\nIn this section, we will give a toy example to introduce the immediate motivation of our Vir-\ntual Softmax. De\ufb01ne the i-th input data xi with its corresponding label yi, where yi \u2208\n[1 . . . C] and C is the class number. De\ufb01ne the j-th class anchor vector in softmax clas-\nsi\ufb01er Wj, where j \u2208 [1 . . . C].\nis de\ufb01ned as Xi.\n\nIntuition and Motivation\n\nThe output feature of deep model\n\n2\n\n\fFigure 2: Visualization of the learned features optimized by original Softmax vs. Virtual Softmax on MNIST\ndataset. We provide two cases of 2-D and 3-D visualization, which are shown in the left two columns and the\nright two columns, respectively. From the visualization, it can be observed that our Virtual Softmax possesses a\nstronger power of forcing compact and separable features.\nIn standard softmax, the optimization objective is to\nminimize the following cross-entropy loss:\n\n(cid:80)C\n\nXi\n\neW T\nyi\nj=1 eW T\n\nj Xi\n\nLi = \u2212 log\n\n(1)\n\nXi > W T\n\nXi > maxj\u2208c,j(cid:54)=yi (W T\n\nFigure 3: Toy example for non-uniform distribu-\ntion case. If the class number increases, the con-\nstrained regions will be more compact than before.\n\nhere we omit the basis b and it does not affect the\nperformance. Obviously, in original softmax opti-\nj Xi,\u2200j (cid:54)= yi, in\nmization, it is to force W T\nyi\norder to correctly classify the input xi.\nThus we can obtain the following property of soft-\nmax, which is the immediate incentive of our Virtual\nSoftmax.\nProperty 1. As the number of class anchor vector increases, the constrained region for each class\n(e.g.\nthe area of each class in Fig.1,3) becomes more and more compact, i.e. \u03b8 will gradually\ndecreases (e.g. in Fig.1,3).\nProof. For simplicity, here we consider the 2-D c-class case (n-D c-class case can be generalized\nas well but is more complicated.). First, assume (cid:107)Wj(cid:107) = l,\u2200j and they are evenly distributed in\nfeature space, in another word, every two vectors have the same vectorial angle \u03a6 = 2\u03c0\nc . As stated\nabove, the softmax is to force W T\nj Xi) in order to correctly classify xi, i.e.\nyi\nl(cid:107)Xi(cid:107) cos \u03b8yi > maxj\u2208c,j(cid:54)=yi (l(cid:107)Xi(cid:107) cos \u03b8j) \u21d2 cos \u03b8yi > maxj\u2208c,j(cid:54)=yi (cos \u03b8j), where \u03b8j denotes\nthe angle between feature vector Xi and class vector Wj. Hence, the decision boundary for class yi is\nj X, i.e. cos \u03b8yi = maxj\u2208c,j(cid:54)=yi(cos \u03b8j), we de\ufb01ne \u03b8 as the angle between\nW T\nyi\nWyi and the decision boundary of class yi, and the solution is \u03b8 = \u03b8yi = \u03b8argmaxj,j(cid:54)=yi (W T\nj X) = \u03a6\n2\n(see Fig.1. (a) for a toy example). Considering the distribution symmetry of W , the angle range of\nthe constrained region for class yi is 2\u03b8 = 2\u03c0\nc , which is inversely proportional to c. Thus, if the class\nnumber increases, the constrained region of each class will be more and more compact. Moreover,\nfor the (cid:107)Wyi(cid:107) (cid:54)= (cid:107)Wj(cid:107) and non-uniform distribution case, the analysis is a little more complicated.\nBecause the length of W are different, the feasible angles of class yi and class j are also different\n(see the decision boundary of the case before injecting new classes in Fig.3.(a)). Normally, the larger\n(cid:107)Wyi(cid:107) is, the larger the feasible region of the class yi is [22]. While, as illustrated in Fig.3.(b), if the\nclass number increases, i.e. more new classes are injected to the same space, the constrained region\nof each class is going to be compact as well.\n4 Virtual Softmax\nNaturally motivated by the property.1, a simple way to introduce a large margin between classes for\nachieving intra-class compactness and inter-class separability is to constrain the features with more\nand more additional classes. In another word, injecting additional classes is to space the original\nclasses, resulting in a margin between these original classes. Concretely, for a given classi\ufb01cation\ntask (namely that the number of categories to be classi\ufb01ed is \ufb01xed), the additionally injected classes\n\nX = maxj\u2208c,j(cid:54)=yiW T\n\n3\n\nTraining (Softmax)Testing (Softmax)Testing (Softmax)Training (Softmax)Training (Virtual Softmax)Testing (Virtual Softmax)Testing (Virtual Softmax)Training (Virtual Softmax)Testing Accuracy: 98.91%Testing Accuracy: 99.2%Testing Accuracy: 99.13%Testing Accuracy: 99.38%0123456789WyiWjWyiWjWkDecision BoundaryAreas of different classesbefore injecting new classafter injecting new class\u03b8\u03b8(a)(b)\fL =\n\n1\nN\n\nLi = \u2212 1\nN\n\nwhere Wvirt =\nW T\nyi\n\n(cid:107)Wyi(cid:107)Xi\n(cid:107)Xi(cid:107)\nXi \u2265 max (W T\nC Xi, W T\n1 Xi . . . W T\nC+1\n\n(cid:124)\n\nN(cid:88)\n(cid:123)(cid:122)\n\ni=1\n\nN(cid:88)\n(cid:125)\n\ni=1\n\n(cid:80)C\n\nlog\n\nXi\n\neW T\nyi\nj Xi + eW T\n\nvirtXi\n\nj=1 eW T\n\n(3)\n\nand N is the training batch size. In the above equation, we instead require\n, it is a special case of Eq. 2 where K = 1. Replacing\n\nvirtXi)\n\n(if inserted at the exact position) will introduce a new and more rigorous decision boundary for these\noriginal classes, compressing their intra-class distribution, and furthermore the originally overlapped\ndecision boundaries of the adjacent categories will thus be forced to separate with each other,\nproducing a margin between these decision boundaries and thus encouraging inter-class separability\n(see Fig.1 for a toy example). Therefore, injecting extra classes for feature supervision is a bold yet\nreasonable idea to enhance the discriminative property of features, and the objective function can be\nformulated as:\n\nLi = \u2212 log\n\n(cid:80)C+K\n\neW T\nyi\nj=1 eW T\n\nXi\n\nj Xi\n\n= \u2212 log\n\n(cid:80)C+K\n\ne(cid:107)Wyi(cid:107)(cid:107)Xi(cid:107) cos \u03b8yi\nj=1 e(cid:107)Wj(cid:107)(cid:107)Xi(cid:107) cos \u03b8j\n\n(2)\n\nwhere K is the number of extra injected classes. From Eq. 2, it can be observed that it is more\npossible to acquire a compact region for class yi quanti\ufb01ed by the angle \u03b8yi, since the angle \u03b8yi is\noptimized above a much larger set (C+K classes)3.\nTheoretically, the larger K is, the better features are. Nevertheless, in practical cases where the\navailable training samples are limited and the total number of classes is changeless, it is intractable\nto enlarge the angular margin with real and existed extra classes. Moreover, a most troublesome\nproblem is that we can not insert the extra classes exactly between the originally adjacent categories\ndue to the random initialization of class anchor vectors W before training and the dynamic nature of\nparameter update during optimizing.\nTo address the aforementioned issues, we introduce a single and dynamic negative class into original\nsoftmax. This negative class is constructed on the basis of current training instance xi. Since there\nis no real training data belonging to this class and it is employed only as a negative category (i.e.\nthis class is utilized only to assist the training of original C classes and have no need to be treated\nas a positive class), we denote it as virtual negative class and have the following formulation of our\nproposed Virtual Softmax:\n\na large and \ufb01xed set of virtual classes (like K classes in Eq. 2) with a single dynamic virtual class\nWvirt incurs nearly zero extra computational cost and memory consumption compared to original\nsoftmax.\nFrom Eq.3, one can observe that this virtual class is tactfully inserted around the class yi, and\nparticularly it is inserted at the same position with Xi as illustrated in Fig.4.(a). It well matches\nour motivation that insert negative class between the originally adjacent classes Wyi and Wj, and\nthis negative class will never stop pushing Xi towards Wyi until they overlap with each other\ndue to the dynamic nature of Xi during training procedure. Moreover, from another optimization\nperspective, in order to correctly classify xi, Virtual Softmax forces W T\nXi to be the largest one\nyi\nXi \u2265\namong (C + 1) inner product values. Since W T\nvirtXi, i.e. (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107) cos \u03b8yi \u2265 (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107), is to optimize \u03b8yi = 0,\nmaxj\u2208C+1(W T\nhowever, the original softmax is to optimize (cid:107)Wyi(cid:107) cos \u03b8yi \u2265 maxj\u2208C ((cid:107)Wj(cid:107) cos \u03b8j), i.e. \u03b8yi \u2264\n(cid:107)Wj(cid:107)\n(cid:107)Wyi(cid:107) cos \u03b8j)) (brie\ufb02y, if (cid:107)Wj(cid:107) have the same magnitude, the original softmax is\nminj\u2208C(arccos (\nto optimize \u03b8yi \u2264 minj\u2208C(\u03b8j)). Obviously, the original softmax is to make \u03b8yi to be smaller than a\ncertain value, while our Virtual Softmax optimizes a much more rigorous objective, i.e. zero limit\n\u03b8yi, aiming to produce more compact and separable features. And based on this optimization goal,\nthe new decision boundary of class yi is overlapping with the class anchor Wyi, which is more strict\nthan softmax.\nOptimization: The Virtual Softmax can be optimized with standard SGD and BP. And in backward\npropagation, the computation of gradients are listed as follows:\n\nvirtXi = (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107), the only way to make W T\n\nj Xi) = W T\n\nyi\n\n\u2202Li\n\u2202Xi\n\n=\n\nj=1 eW T\n\n(cid:80)C\n\nj Xi Wj + eW T\nj Xi + eW T\n\nj=1 eW T\n\nvirtXi Wvirt\n\nvirtXi\n\n\u2212 Wyi ,\n\n\u2202Li\n\u2202Wyi\n\n=\n\neW T\nyi\n\n(cid:80)C\n\nXi Xi + eW T\nj=1 eW T\n\nvirtXi (cid:107)Xi(cid:107)\n\n(cid:107)Wyi(cid:107) Wyi\nvirtXi\n\nj Xi + eW T\n\n(cid:80)C\n\n\u2212 Xi\n\n(4)\n\n3Assume that all the K classes are injected at the exact position among original C classes.\n\n4\n\n\fFigure 4: (a) shows the virtual class by green arrow. (b) exhibits the feature layer in CNN. (c) illustrates the\n(cid:48)\nfeature update, the blue arrow represents X\nobtained by original softmax and the red arrow represents X\nobtained by Virtual Softmax.\n\n(cid:48)\n\n(cid:80)C\n\n\u2202Li\n\u2202Wj\n\n=\n\nj Xi\n\neW T\nj Xi + eW T\n\nvirtXi\n\nj=1 eW T\n\nXi, where j (cid:54)= yi\n\n(5)\n\nyi\n\nj=1 eW T\n\nInterpretation from Coupling Decay\n\nXi + log ((cid:80)C\n\n5 Discussion\nExcept for the above analysis that Virtual Softmax optimizes a much more rigorous objective than the\noriginal Softmax for learning discriminative features, in this section, we will give some interpretations\nfrom several other perspectives, i.e. coupling decay5.1 and feature update5.2. And we also provide\nthe visualization of the learned features for intuitive understanding5.3.\n5.1\nIn this paragraph, we give a macroscopic analysis from coupling decay which can be regarded as a\nregularization strategy.\nObserving Eq.3, Li can be reformulated as: Li = \u2212W T\nj Xi + e(cid:107)Wyi(cid:107)(cid:107)Xi(cid:107)),\nthen performing the \ufb01rst order Taylor Expansion for the second term, a term of (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107) shows\nup. Therefore, minimizing Eq.3 is to minimize (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107) to some extend, and it can be viewed\nas a coupling decay term, i.e. data-dependent weight decay and weight-dependent data decay. It\nregularizes the norm of both feature representation and the parameters in classi\ufb01er layer such that\nimproves the generalization ability of deep models by reducing over-\ufb01tting. Moreover, it is supported\nby some experimental results, e.g. the feature norm in Fig.2 is decreased by Virtual Softmax (e.g. from\n100 to 50 in 2-D space), and the performance improvement over the original Softmax is increased\nwhen using a much wider network as in Sec.6.1 (e.g. in CIFAR100/100+, increasing the model\nwidth from t=4 to t=7, the performance improvement is rising, since increasing the dimensionality of\nparameters while keeping data set size constant calls for stronger regularization).\nHowever, the reason of calling the above analysis a macroscopic one is that it coarsely throws away\nmany other relative terms and only considers the single effect of (cid:107)Wyi(cid:107)(cid:107)Xi(cid:107), without taking into\naccount the collaborative efforts of other terms. Thus there are some other phenomenons that cannot\nbe explained well, e.g. why the inter-class angular margin is increased as shown in Fig.2 and why\nthe confusion matrix of Virtual Softmax in Fig.5 shows more compact intra-class distribution and\nseparable inter-class distribution than the original Softmax. To this end, we provide another discussion\nbelow from a relatively microscopic view that how the individual data representations of Virtual\nSoftmax and original Softmax are constrained and formed in the feature space.\n5.2\nHere, considering the collaborative effect of all the terms in Eq.3, we give another microscopic\ninterpretation from the perspective of feature update, which is also a strong justi\ufb01cation of our\nmethod. In part, it reveals the reason why the learned features by Virtual Softmax are much more\ncompact and separable than original softmax.\nTo simplify our analysis, we only consider the update in a linear feature layer, namely that the input\nto this linear feature layer is \ufb01xed. As illustrated in Fig.4.(b), the feature vector X is computed as\nX = wT Z 4, where Z is the input vector, w is the weight parameters in this linear layer. We denote\nthe k-th element of vector X as Xk, the connected weight vector (i.e. the k-th column of w) as wk.\nThus, Xk = wT\n, the parameters are updated by SGD as\n\nInterpretation From Feature Update\n\nk Z, after computing the partial derivative \u2202L\n\u2202wk\n\n4Although the following discussion is based on this assumption, the actual uses of activation functions (like\n\nthe piecewise linear function ReLU and PReLU) and the basis b do not affect the \ufb01nal performances.\n\n5\n\n(a)(b)(c)feature layerinput Zoutput XwVirtualSoftmaxXkWyiXvector3vector2vector1\u03b8WyiWjXiWvirtpushVirtual SoftmaxSoftmaxwk\f, where \u03b1 is the learning rate. Since Z is \ufb01xed, in the next training iteration, the\n\n(cid:48)\n\nk = wk \u2212 \u03b1 \u2202L\nw\nnew feature output X\n\n\u2202wk\n\nX\n\n(cid:48)\nk can be computed by the updated w\nk Z \u2212 \u03b1(\n\u2202L\n\u2202Xk\n\n\u2202L\n)T Z = wT\n\u2202wk\n)T Z = Xk \u2212 \u03b1\n\n(cid:48)\nk)T Z = (wk \u2212 \u03b1\n\u2202Xk\n\u2202wk\n\n\u2202L\n\u2202Xk\n\n(cid:48)\nk = (w\n= Xk \u2212 \u03b1\n\n(\n\n(cid:48)\nk as:\n\u2202L\n\u2202wk\n\n\u2202L\n\u2202Xk\n\n\u2202Xk\n\u2202wk\n\n)T Z = Xk \u2212 \u03b1(\nZ T Z = Xk \u2212 \u03b1(cid:107)Z(cid:107)2 \u2202L\n\u2202Xk\ncan be obtained by X\n\n)T Z\n\n(cid:48)\n\n(cid:48)\n\n(6)\n\nthus from Eq. 6, it can be inferred that the holistic feature vector X\nX \u2212 \u03b2 \u2202L\n\u2202X , where \u03b2 = \u03b1(cid:107)Z(cid:107)2, implying that updating weight parameters w with SGD can implicitly\nlead to the update of the output feature in the similar way. Based on this observation, putting the partial\nderivatives \u2202L\n\u2202X of Softmax and Virtual Softmax into Eq. 6 respectively, we can obtain the following\ncorresponding updated features for Softmax (Eq. 7) and Virtual Softmax (Eq. 8) respectively:\n\n=\n\n(cid:48)\n\nX\n\n= X + \u03b2(Wyi \u2212\n(cid:80)C\n(cid:80)C\n\nj=1 eW T\n\n(cid:48)\n\nX\n\n= X + \u03b2(Wyi \u2212\n\n(cid:80)C\n(cid:80)C\n\nj X Wj\n\nj=1 eW T\nj=1 eW T\n\nj X\n\n)\n\nj X Wj + eW T\nj X + eW T\n\nj=1 eW T\n\nvirtX\n\nvirtX Wvirt\n\n(7)\n\n(8)\n\n)\n\nSince the above equations are complicated enough to discuss, here for simplicity, we will give\nX (cid:29)\nan approximate analysis. Consider a well trained feature (i.e. eW T\nj X ,\u2200j (cid:54)= yi) and that the parameters in Softmax and Virtual Softmax are the same. Then, omitting\neW T\nj X Wj, Eq.7, 8 can be separately\n\nvirtX (cid:29) eW T\n\nj X and eW T\n\neW T\n\nyi\n\nj=1,j(cid:54)=yi\n\napproximated as:\n\nthe relatively smaller terms in denominators, i.e. (cid:80)C\n(cid:80)C\n= X + \u03b2(Wyi \u2212 eW T\n(cid:123)(cid:122)\n(cid:80)C\n\n= X + \u03b2(Wyi \u2212\n\neW T\nyi\nj=1 eW T\nj X + eW T\n\nX Wyi\n\nvector1\n\nvirtX\n\n(cid:124)\n\nX\n\nX\n\n(cid:48)\n\n(cid:48)\n\nyi\n\nX Wyi\nj=1 eW T\nj X\n\n)\n\n(cid:125)\n(cid:80)C\n\n(9)\n\n(10)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nvector2\n\n)\u2212\u03b2\n\n(cid:125)\n\n(cid:124)\n\neW T\nj=1 eW T\n\nvirtX Wvirt\nj X + eW T\n\nvirtX\n\n(cid:123)(cid:122)\n\nvector3\n\n(cid:125)\n\n(cid:48)\n\n(cid:48)\n\nfrom the above Eq.9 and Eq.10, we consider the magnitude and direction of \u2019vector1\u2019, \u2019vector2\u2019\nand \u2019vector3\u2019, and then the updated X\nby the corresponding softmax and Virtual Softmax can be\neasily illustrated as in Fig.4.(c). It can be \ufb01rstly observed that the norm of red arrow is smaller\nthan that of blue arrow, showing a similar result with the analysis in Sec.5.1, i.e. Virtual Softmax\nwill regularize the feature norm. Meanwhile, one can also observe another important phenomenon\nthat the feature vector X\noptimized by Virtual Softmax has a much smaller angle \u03b8 to the class\nanchor Wyi than original Softmax, well explaining the reason why the features learned by Virtual\nSoftmax is much more compact and separable than original Softmax. In summary, by considering\nthe collaborative effort of many terms, we could know the working principle of our Virtual Softmax\nbetter. The Virtual Softmax not only provides regularization but more importantly, intensi\ufb01es\nthe discrimination property within the learned features.\nAlthough it is based on an approximate analysis, to some extent, it can give us an heuristic inter-\npretation of Virtual Softmax why it is capable of encouraging the discrimination of features from a\nnovel feature update perspective. And practically, without these assumptions the Virtual Softmax can\nindeed produce discriminative features, validated by the visualization in Sec. 5.3 and the experimental\nresults in Sec. 6.\n5.3 Visualization of Compactness and Separability\nIn order to highlight and emphasize that Virtual Softmax indeed encourages the discriminative feature\nlearning, we provide a clear visualization of the learned features on MNIST dataset [17] in 2-D and\n3-D space respectively, as shown in Fig.2. From it, one can observe that the learned features by\nour Virtual Softmax are much compact and well separated, with larger inter-class angular margin\nand tighter intra-class distribution than softmax, and Virtual Softmax can consistently improve the\nperformances in both 2-D (99.2% vs. 98.91%) and 3-D (99.38% vs. 99.13%) cases. Furthermore,\nwe also visualize the leaned features in k-D space (where k > 3) with the confusion matrix. The\nconfusion matrix comparison between original softmax and our Virtual Softmax are shown in\nFig.5. Speci\ufb01cally, we compute the included angle cosine (i.e. cos \u03b8) between any two feature\n\n6\n\n\fLayer\nBlock1\nPool1\nBlock2\nPool2\nBlock3\nPool3\n\nCIFAR10\n[3x3,32\u00d7t]x5\n[3x3,64\u00d7t]x4\n[3x3,128\u00d7t]x4\nMax-Pooling\n\nCIFAR100/100+\n[3x3,32\u00d7t]x5\n[3x3,64\u00d7t]x4\n[3x3,128\u00d7t]x4\nAve-Pooling\n\nMNIST(for \ufb01g.2)\n\n[5x5,32\u00d7t]x2,padding 2\n[5x5,64\u00d7t]x2,padding 2\n[5x5,128\u00d7t]x2,padding 2\n\nMax-Pooling\n\nMNIST\n\nSVHN\n\nMax-Pooling\n\n[3x3,32\u00d7t]x5\n[3x3,32\u00d7t]x4\n[3x3,64\u00d7t]x4\n[3x3,32\u00d7t]x4\n[3x3,32\u00d7t]x4\n[3x3,128\u00d7t]x4\nMax-Pooling Max-Pooling\n\nMax-Pooling\n\nFully Connected\nTable 1: Model architectures for different benchmarks. [3x3,32\u00d7t]x4 denotes 4 cascaded convolutional layers\nwith 32 \u00d7 t \ufb01lters of size 3x3. The toy models are of t = 1. Max-Pooling is with 3 \u00d7 3 kernel and stride of 2.\n\n512\n\n2/3\n\n64\n\n64\n\n64\n\nFigure 5: Cosine confusion matrixes on the test splits of CIFAR10/100/100+. On each dataset, both Softmax\nand Virtual Softmax use the same CNN architecture.\nvectors which are extracted from the testing splits of CIFAR10/100/100+ [16] datasets. From Fig.5,\nit can be observed that, over the there datasets, the intra-class similarities are enhanced and the\ninter-class similarities are reduced when training with our Virtual Softmax. Moreover, our Virtual\nSoftmax improves nearly 1%, 2% and 1.5% classi\ufb01cation accuracies over the original softmax on\nCIFAR10/100/100+, respectively. To summarize, all these aforementioned experiments demonstrate\nthat our Virtual Softmax indeed serves as an ef\ufb01cient algorithm which has a much stronger power of\nenlarging inter-class angular margin and compressing intra-class distribution than softmax and thus\ncan signi\ufb01cantly improve the recognition performances.\n5.4 Relationship to Other Methods\nThere are a few research works concentrating on re\ufb01ning the Softmax objective, e.g. L-Softmax[22],\nA-Softmax[21] and Noisy Softmax[3]. Research works [22, 21] require to manually select a tractable\nconstraint, and need a careful and annealing-like training procedure which is under human control.\nHowever, our Virtual-Softmax can be easily trained end-to-end without or with little human inter-\nventions, since the margin constraint is introduced automatically by the virtual class. Moreover,\nour method differs from L-Softmax and A-softmax in a heuristic idea that training deep models\nwith the additional virtual class not only the given ones, which is the \ufb01rst work to extend the o-\nriginal Softmax to employ virtual class for discriminative feature learning and may inspire other\nresearchers. And it can also provide regularization for deep models implicitly. Method [3] aims to\nimprove the generalization ability of DCNN by injecting noise into Softmax, which heads from a\ndifferent perspective. In summary, our Virtual-Softmax comes from a clear and unusual motivation\nthat injects a dynamic virtual class for enhancing features, which is different from the listed other\nmethods[22, 21, 3], even though all the methods intend to learn better features.\n6 Experiments and Results\nWe evaluate our Virtual Softmax on classi\ufb01cation tasks and on face veri\ufb01cation tasks. For fair\ncomparison, we train with the same network for both Virtual Softmax and the baseline softmax.\nSmall-Set Object Classi\ufb01cation: We follow [39, 22, 3] to devise our CNN models. Denote t as the\nwidening factor, which is used to multiply the basic number of \ufb01lters in one convolutional layer, then\nour architecture con\ufb01gurations are listed in Table.1. For training, the initial learning rate is 0.1, and is\ndivided by 10 at (20k, 27k) and (12k, 18k) in CIFAR100 and the other datasets respectively, and the\ncorresponding total iterations are 30k and 20k.\nFine-grained Object Classi\ufb01cation: We \ufb01ne-tune some popular models pre-trained on\nImageNet[26], including GooglenetV1[31] and GooglenetV2[32] , by replacing the last softmax\nlayer with our Virtual Softmax. The learning rates are \ufb01xed to 0.0001 and 0.001 for the pre-trained\nlayers and the randomly initialized layer respectively, and stop training at 30k iteration.\nLarge-Set Object Classi\ufb01cation: Here, we use the network for CIFAR100 with t=7, the learning\nrate starts from 0.1 and is divided by 10 at 20k, 40k, 60k iterations. The maximal iteration is 70k.\nFace Veri\ufb01cation: We employ the published Resnet model in [21] to evaluate our Virtual Softmax.\nStart with a learning rate of 0.1, divide it by 10 at (30k, 50k) and stop training at 70k.\n\n7\n\n\fFigure 6: Testing Acc(%) on CIFAR10/100/100+ datasets.\n\nCompared Methods: We set L-Softmax(LS)[22], Noisy-Softmax(NS)[3] and A-Softmax(AS)[21]\nas the compared methods and re-implement them with the same experimental con\ufb01gurations as us.\nAll of our experiments are implemented by Caffe[14]. The models are trained on one TitanX and we\n\ufb01ll it with different batch sizes for different networks. For data preprocessing, we follow NS[3]. For\ntesting, we use original softmax to classify the testing data in classi\ufb01cation tasks and cosine distance\nto evaluate the performances in face veri\ufb01cation tasks.\n\nCIFAR10\nSoftmax\n\nSoftmax\n\nVirtual Softmax\nImprovement\n\nt = 2\n8.2\n7.27\n0.93\nt = 3\n30.74\n28.4\n2.34\nt = 3\n28.59\n26.81\n1.78\n\nt = 3\n7.76\n7.04\n0.72\nt = 4\n28.76\n27.81\n0.95\nt = 4\n27.21\n26.17\n1.04\n\nt = 4\n7.15\n6.68\n0.47\nt = 7\n27.7\n26.02\n1.68\nt = 7\n25.52\n24.01\n1.51\n\nt = 1\n9.84\n8.95\n0.77\nt = 2\n32.72\n29.84\n2.88\nt = 2\n30.44\n28.62\n1.82\n\nVirtual Softmax\nImprovement\nCIFAR100\nSoftmax\n\nVirtual Softmax\nImprovement\nCIFAR100+\n\nTable 2: Recognition error rates(%) on CI-\nFAR10/100 datasets. t is the widening factor and +\ndenotes data augmentation.\n\n6.1 Ablation Study on Network Width\nAs listed in Table.1, our toy network on each dataset\nis with the widening factor of t = 1. We expand the\ntoy models by setting t = 1, 2, 3, 4 and t = 2, 3, 4, 7\non CIFAR10 and CIFAR100 respectively, and the\nexperimental results are listed in Table.2. From these\nresults, for example on CIFAR100+, it can be ob-\nserved that as the model width increasing (i.e. from\nt = 2 to t = 7) the recognition error rate of Virtu-\nal Softmax is diminishing, and our method achieves\nconsistent performance gain over the original soft-\nmax when training with different networks, verifying\nthe robustness of our method. Furthermore, when\ntraining with Virtual Softmax, the recognition error\nrates of wider models are consistently lower than that\nof thinner models across all these datasets, indicating that the Virutal Softmax does not easily suffer\nfrom over-\ufb01tting and supporting the analysis in Sec.5.1. And we plot the curves of testing Acc\non CIFAR10/100/100+ as shown in Fig.6, one can observe that our Virtual Softmax can be easily\noptimized, with a similar convergence speed compared to the original softmax.\n6.2 Evaluation on objecect datasets\nMNIST [17], SVHN [23], CIFAR [16] are the popular used small-set object classi\ufb01cation datasets,\nincluding different number of classes. From Table.3, it can be observed that on both MNIST\nand SVHN datasets the Virtual Softmax not only surpasses the original softmax using the same\nnetwork (i.e. 0.28% vs. 0.35% on MNIST, 1.93% vs. 2.11% on SVHN) but also outperforms\nLS[22], NS[3] and AS[21], showing the effectiveness of our method. Moreover, we also report\nthe experimental results on CIFAR dataset as in Table.4. Speci\ufb01cally, one can observe that our\nVirtual Softmax drastically improves nearly 0.6%, 2%, 1.5% accuracies over the baseline softmax\non CIFAR10/100/100+ respectively. Meanwhile, it outperforms all of the other methods on both\nCIFAR10/100 datasets, e.g. it surpasses both the RestNet-110[9] and Densenet-40[11] on CIFAR100+\nwhich are much deeper and more complex than our architecture, and also surpasses the listed compared\nmethods.\nCUB200 [35] is the popular \ufb01ne-grained object classi\ufb01cation\nset. The comparison results between other state-of-the-art re-\nsearch works and our Virtual Softmax are shown in Table.5, V1\nand V2 denote the corresponding GoogleNet models. One can\nobserve that the Virtual Softmax outperforms the baseline soft-\nmax over all the two pre-trained models, and surpasses all the\ncompared methods LS[22], NS[3] and AS[21]. Additionally,\ntraining with only the Virtual Softmax, our \ufb01nal result is comparable to other remarkable works\nwhich exploit many assistant attention and alignment models, showing the superiority of our method.\n\nTop5\n73.14\n73.25\n73.82\n73.57\n74.06\nVirtual Softmax\nTable 7: Acc (%) on ImageNet32\n\nMethod\nSoftmax\nNS*[3]\nLS*[22]\nAS*[21]\n\nTop1\n47.63\n47.96\n48.59\n48.66\n48.84\n\n8\n\n02040608010012014016018020000.10.20.30.40.50.60.70.80.91IterationTesting AccCIFAR10t=1, softmaxt=1, virtual softmaxt=2, softmaxt=2, virtual softmaxt=3, softmaxt=3, virtual softmaxt=4, softmaxt=4, virtual softmax05010015020025030000.10.20.30.40.50.60.70.8IterationTesting AccCIFAR100t=2, softmaxt=2, virtual softmaxt=3, softmaxt=3, virtual softmaxt=4, softmaxt=4, virtual softmaxt=7, softmaxt=7, virtual softmax05010015020025030000.10.20.30.40.50.60.70.8IterationTesting AccCIFAR100+t=2, softmaxt=2, virtual softmaxt=3, softmaxt=3, virtual softmaxt=4, softmaxt=4, virtual softmaxt=7, softmaxt=7, virtual softmax1201251301351400.840.860.880.90.920.94IterationTestingAcc2002102202302402502602700.580.60.620.640.660.680.70.720.740.762002102202302402502600.60.620.640.660.680.70.720.740.760.78TestingAcc\fMethod\n\nMaxout [7]\nDSN [19]\nR-CNN [20]\nWRN [39]\n\nDisturbLabel [38]\nNoisy Softmax [3]\nL-Softmax [22]\n\nSoftmax\nNS*[3]\nLS*[22]\nAS*[21]\n\nVirtual Softmax\n\n0.45\n0.39\n0.31\n\n-\n\n0.33\n0.33\n0.31\n0.35\n0.32\n0.30\n0.31\n0.28\n\nMNIST(%)\n\nSVHN(%)\n\nCIFAR10(%) CIFAR100(%) CIFAR100+(%)\n\nMethod\n\nGenPool [18]\n\nDisturbLabel [38]\nNoisy Softmax [3]\nL-Softmax [22]\n\nACU [13]\n\nResNet-110 [9]\nDensenet-40 [11]\n\nSoftmax\nNS*[3]\nLS*[22]\nAS*[21]\n\nVirtual Softmax\n\n7.62\n9.45\n7.39\n7.58\n7.12\n\n-\n\n7.00\n7.15\n6.91\n6.77\n6.83\n6.68\n\n32.37\n32.99\n28.48\n29.53\n27.47\n\n-\n\n27.55\n27.7\n26.33\n26.18\n26.09\n26.02\n\n-\n\n-\n-\n-\n\n26.63\n\n25.16\n24.42\n25.52\n25.20\n24.32\n24.11\n24.01\n\nTable 3: Recognition error rates on MNIST and\nSVHN. * denotes our reproducing.\n\nTable 4: Recognition error rates on CIFAR datasets. + denotes\ndata augmentation. * denotes our reproducing.\n\nCUB(%)\n\nMethod\n\nPose Normalization [1]\nPart-based RCNN [40]\n\nVGG-BGLm [41]\nPG Alignment [15]\n\nVirtual Softmax (V1)\n\nSoftmax (V1)\nNS*[3](V1)\nLS*[22](V1)\nAS*[21](V1)\n\nSoftmax (V2)\nNS*[3](V2)\nLS*[22](V2)\nAS*[21](V2)\n\nMethod\n\nDeepID2+ [30]\n\nVGG [25]\n\nLightened CNN [37]\n\nL-Softmax [22]\nCenter loss [36]\nNoisy Softmax [3]\n\nNormface [33]\nA-Softmax [21]\n\nSoftmax\nNS*[3]\nLS*[22]\n\nAS*[21]+Normface*[33]\n\nVirtual Softmax\n\nModels\n\n25\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n\n-\n\n-\n-\n\n-\n-\n\n94.50\n\n88.13\n91.22\n\nLFW SLLFW\n99.47\n97.27\n98.13\n98.71\n99.05\n99.18\n99.19\n99.42\n99.10\n99.16\n99.37\n99.57\n99.46\n\n94.59\n94.75\n95.58\n96.45\n95.85\n\nVirtual Softmax (V2)\n\nTable 6: Veri\ufb01cation results (%) on LFW/SLLFW. * denotes\nour reproducing.\n\nTable 5: Accuracy results on CUB200. * denotes\nour reproducing.\nImageNet32[5]: is a downsampled version of large-scale dataset ImageNet [26], which contains\nexactly the same number of images as original dataset but with 32x32 size. The results are in Tab.7,\none can observe that Virtual Softmax performs the best.\n6.3 Evaluation on face datasets\nLFW [12] is a popular face veri\ufb01cation benchmark. SLLFW [6] generalizes the original protocol\nin LFW to a more dif\ufb01cult veri\ufb01cation task, indicating that the images are all coming from LFW\nbut the testing pairs are more knotty to verify. For data preprocessing, we follow the method in\n[3]. Then augment the training data by randomly mirroring. We adopt the public available Resnet\nmodel from A-Softmax[21]. Then we train this Resnet model with the cleaned subset of [8]. The\n\ufb01nal results are listed in Table.6. One can observe that our Virtual Softmax drastically improves\nthe performances over the baseline softmax on both LFW and SLLFW datasets, e.g. from 99.10%\nto 99.46% and 94.59% to 95.85% , respectively. And it also outperforms both NS[3] and LS[22],\nshowing the effectiveness of our method. Moreover, since the class number for training is very large,\ni.e. 67k, the optimizing procedure is dif\ufb01cult than that in object classi\ufb01cation tasks, therefore, if the\ntraining phase can be eased by human guidance and control, the performance will be more better,\ne.g. AS*[21]+Normface*[33] achieve the best results when arti\ufb01cially and speci\ufb01cally choose the\noptimal margin constraints and feature scale for the current training set.\n7 Conclusion\nIn this paper, we propose a novel but extremely simple method Virtual Softmax to enhance the\ndiscriminative property of learned features by encouraging larger angular margin between classes.\nIt derives from a clear motivation and generalizes the optimization goal of the original softmax to\na more rigorous one, i.e. zero limit \u03b8yi. Moreover, it also has heuristic interpretations from feature\nupdate and coupling decay perspectives. Extensive experiments on both object classi\ufb01cation and face\nveri\ufb01cation tasks validate that our Virtual Softmax can signi\ufb01cantly outperforms the original softmax\nand indeed serves as an ef\ufb01cient feature-enhancing method.\nAcknowledgments: This work was partially supported by the National Natural Science Foundation\nof China under Grant Nos. 61573068 and 61871052, Beijing Nova Program under Grant No.\nZ161100004916088, and sponsored by DiDi GAIA Research Collaboration Initiative.\n\n2.47\n1.92\n1.77\n1.85\n2.19\n\n-\n-\n\n2.11\n2.04\n2.01\n2.04\n1.93\n\n75.7\n76.4\n80.4\n82.8\n73.5\n74.8\n76.5\n75.2\n77.1\n77.2\n77.9\n80.5\n80.2\n81.1\n\n9\n\n\fReferences\n\n[1] S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized\n\ndeep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.\n\n[2] B. Chen and W. Deng. Almn: Deep embedding learning with geometrical virtual point generating. arXiv\n\npreprint arXiv:1806.00974, 2018.\n\n[3] B. Chen, W. Deng, and J. Du. Noisy softmax: Improving the generalization ability of dcnn via postponing\nthe early softmax saturation. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), July 2017.\n\n[4] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identi\ufb01cation by multi-channel parts-based\ncnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 1335\u20131344, 2016.\n\n[5] P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of imagenet as an alternative to the\n\ncifar datasets. arXiv preprint arXiv:1707.08819, 2017.\n\n[6] W. Deng, J. Hu, N. Zhang, B. Chen, and J. Guo. Fine-grained face veri\ufb01cation: Fglfw database, baselines,\n\nand human-dcmn partnership. Pattern Recognition, 66:63\u201373, 2017.\n\n[7] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv\n\npreprint arXiv:1302.4389, 2013.\n\n[8] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face\n\nrecognition. In European Conference on Computer Vision (ECCV), pages 87\u2013102. Springer, 2016.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.\n[11] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks.\n\narXiv preprint arXiv:1608.06993, 2016.\n\n[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for\nstudying face recognition in unconstrained environments. Technical report, Technical Report 07-49,\nUniversity of Massachusetts, Amherst, 2007.\n\n[13] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classi\ufb01cation. arXiv\n\npreprint arXiv:1703.09076, 2017.\n\n[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\nConvolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international\nconference on Multimedia, pages 675\u2013678. ACM, 2014.\n\n[15] J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition without part annotations. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5546\u20135555, 2015.\n\n[16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n[17] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n[18] C.-Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks:\n\nMixed, gated, and tree. In International conference on arti\ufb01cial intelligence and statistics, 2016.\n\n[19] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 562\u2013570, 2015.\n\n[20] M. Liang and X. Hu. Recurrent convolutional neural network for object recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3367\u20133375, 2015.\n\n[21] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face\n\nrecognition. arXiv preprint arXiv:1704.08063, 2017.\n\n[22] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In\nProceedings of The 33rd International Conference on Machine Learning (ICML), pages 507\u2013516, 2016.\n[23] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\nunsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,\nvolume 2011, page 5, 2011.\n\n[24] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature\nembedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 4004\u20134012, 2016.\n\n[25] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, volume 1, page 6, 2015.\n[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[27] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\nclustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 815\u2013823, 2015.\n\n[28] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 1849\u20131857, 2016.\n\n[29] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identi\ufb01cation-veri\ufb01cation.\n\nIn Advances in neural information processing systems (NIPS), pages 1988\u20131996, 2014.\n\n[30] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n2892\u20132900, 2015.\n\n10\n\n\f[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\nGoing deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1\u20139, 2015.\n\n[32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\ncomputer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 2818\u20132826, 2016.\n\n[33] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l_2 hypersphere embedding for face veri\ufb01cation.\n\narXiv preprint arXiv:1704.06369, 2017.\n\n[34] Y. Wang, J. Choi, V. Morariu, and L. S. Davis. Mining discriminative triplets of patches for \ufb01ne-grained\nclassi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1163\u20131172, 2016.\n\n[35] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-ucsd birds 200.\n\n2010.\n\n2015.\n\n[36] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition.\n\nIn European Conference on Computer Vision (ECCV), pages 499\u2013515. Springer, 2016.\n\n[37] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face representation. arXiv preprint arXiv:1511.02683,\n\n[38] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel: Regularizing cnn on the loss layer.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n4753\u20134762, 2016.\n\n[39] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.\n[40] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for \ufb01ne-grained category detection.\n\nIn European conference on computer vision (ECCV), pages 834\u2013849. Springer, 2014.\n\n[41] F. Zhou and Y. Lin. Fine-grained image classi\ufb01cation by exploring bipartite-graph labels. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1124\u20131133, 2016.\n\n11\n\n\f", "award": [], "sourceid": 980, "authors": [{"given_name": "Binghui", "family_name": "Chen", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Weihong", "family_name": "Deng", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Haifeng", "family_name": "Shen", "institution": "AI Labs, Didi Chuxing"}]}