{"title": "Dual Variational Generation for Low Shot Heterogeneous Face Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 2674, "page_last": 2683, "abstract": "Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a novel Dual Variational Generation (DVG) framework. It generates large-scale new paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR. Specifically, we first introduce a dual variational autoencoder to represent a joint distribution of paired heterogeneous images. Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space. Moreover, the HFR network reduces the domain discrepancy by constraining the pairwise feature distances between the generated paired heterogeneous images. Extensive experiments on four HFR databases show that our method can significantly improve state-of-the-art results. When using the generated paired images for training, our method gains more than 18\\% True Positive Rate improvements over the baseline model when False Positive Rate is at $10^{-5}$.", "full_text": "Dual Variational Generation for Low Shot\n\nHeterogeneous Face Recognition\n\nChaoyou Fu1,2\u2217, Xiang Wu1\u2217, Yibo Hu1, Huaibo Huang1, Ran He1,2,3\u2020\n\n1NLPR & CRIPAC, CASIA\n\n2University of Chinese Academy of Sciences\n\n3Center for Excellence in Brain Science and Intelligence Technology, CAS\n{chaoyou.fu, rhe}@nlpr.ia.ac.cn, alfredxiangwu@gmail.com\n\n{yibo.hu, huaibo.huang}@cripac.ia.ac.cn\n\nAbstract\n\nHeterogeneous Face Recognition (HFR) is a challenging issue because of the large\ndomain discrepancy and a lack of heterogeneous data. This paper considers HFR\nas a dual generation problem, and proposes a novel Dual Variational Generation\n(DVG) framework. It generates large-scale new paired heterogeneous images with\nthe same identity from noise, for the sake of reducing the domain gap of HFR.\nSpeci\ufb01cally, we \ufb01rst introduce a dual variational autoencoder to represent a joint\ndistribution of paired heterogeneous images. Then, in order to ensure the identity\nconsistency of the generated paired heterogeneous images, we impose a distribution\nalignment in the latent space and a pairwise identity preserving in the image space.\nMoreover, the HFR network reduces the domain discrepancy by constraining the\npairwise feature distances between the generated paired heterogeneous images. Ex-\ntensive experiments on four HFR databases show that our method can signi\ufb01cantly\nimprove state-of-the-art results.\n\n1\n\nIntroduction\n\nWith the development of deep learning, face recognition has made signi\ufb01cant progress [34, 2] in recent\nyears. However, in many real-world applications, such as video surveillance, facial authentication on\nmobile devices and computer forensics, it is still a great challenge to match heterogeneous face images\nin different modalities, including sketch images [37], near infrared images [24] and polarimetric\nthermal images [36]. Heterogeneous face recognition (HFR) has attracted much attention in the face\nrecognition community. Due to the large domain gap, one challenge is that the face recognition model\ntrained on VIS data often degrades signi\ufb01cantly for HFR. Therefore, lots of cross domain feature\nmatching methods [10] are introduced to reduce the large domain gap between heterogeneous face\nimages. However, since it is expensive and time-consuming to collect a large number of heterogeneous\nface images, there is no public large-scale heterogeneous face database. With the limited training\ndata, CNNs trained for HFR often tend to be over\ufb01tting.\nRecently, the great progress of high-quality face synthesis [38, 5, 33, 39] has made \u201crecognition via\ngeneration\u201d possible. TP-GAN [16] and CAPG-GAN [13] introduce face synthesis to improve the\nquantitative performance of large pose face recognition. For HFR, [32] proposes a two-path model\nto synthesize VIS images from NIR images. [36] utilizes a GAN based multi-stream feature fusion\ntechnique to generate VIS images from polarimetric thermal faces. However, all these methods are\nbased on conditional image-to-image translation framework, leading to two potential challenges: 1)\n\n\u2217Equal Contribution\n\u2020Corresponding Author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The diversity comparisons between the conditional image-to-image translation [32] (left\npart, the above is the input NIR image and the below is the corresponding translated VIS image) and\nour unconditional DVG (right part, all paired heterogeneous images are generated via noise). For the\nconditional image-to-image translation methods, given one NIR image, a generator only synthesizes\none new VIS image with same attributes (e.g., the pose and the expression) except for the spectral\ninformation. Differently, DVG generates massive new paired images with rich intra-class diversity\nfrom noise.\n\nDiversity: Given one image, a generator only synthesizes one new image of the target domain [32].\nIt means such conditional image-to-image translation methods can only generate limited number\nof images. In addition, as shown in the left part of Fig. 1, two images before and after translation\nhave same attributes (e.g., the pose and the expression) except for the spectral information, which\nmeans it is dif\ufb01cult for such conditional image-to-image translation methods to promote intra-class\ndiversity. In particular, these problems will be very prominent in the low-shot heterogeneous face\nrecognition, i.e., learning from few heterogeneous data. 2) Consistency: When generating large-scale\nsamples, it is challenging to guarantee that the synthesized face images belong to the same identity of\nthe input images. Although identity preserving loss [13] constrains the distances between features of\nthe input and synthesized images, it does not constraint the intra-class and inter-class distances of the\nembedding space.\nTo tackle the above challenges, we propose a novel unconditional Dual Variational Generation (DVG)\nframework (shown in Fig. 3) that generates large-scale paired heterogeneous images with the same\nidentity from noise. Unconditional generative models can generate new images (generate single\nimage per time) from noise [21], but since these images do not have identity labels, it is dif\ufb01cult\nto use these images for recognition networks. DVG makes use of the property of generating new\nimages of the unconditional generative model [21], and adopts a dual generation manner to get\npaired heterogeneous images with the same identity every time. This enables DVG to generate\nlarge-scale images, and make the generated images can be used to optimize recognition networks.\nMeanwhile, DVG also absorbs the various intra-class changes of the training database, leading to\nthe generated paired images have abundant intra-class diversity. For instance, as presented in the\nright part of Fig. 1, the \ufb01rst four paired images have different poses, and the \ufb01fth paired images\nhave different expressions. Furthermore, DVG only pays attention to the identity consistency of\nthe paired heterogeneous images rather than the identity whom the paired heterogeneous images\nbelong to, which avoids the consistency problem of previous methods. Speci\ufb01cally, we introduce a\ndual variational autoencoder to learn a joint distribution of paired heterogeneous images. In order to\nconstrain the generated paired images to belong to the same identity, we impose both a distribution\nalignment in the latent space and a pairwise identity preserving in the image space. New paired\nimages are generated by sampling and copying a noise from a standard Gaussian distribution, as\ndisplayed in the left part of Fig. 3. These generated paired images are used to optimize the HFR\nnetwork by a pairwise distance constraint, aiming at reducing the domain discrepancy.\nIn summary, the main contributions are as follows:\n\n\u2022 We provide a new insight into the problems of HFR. That is, we consider HFR as a dual\ngeneration problem, and propose a novel dual variational generation framework. This\nframework generates new paired heterogeneous images with abundant intra-class diversity\nto reduce the domain gap of HFR.\n\n\u2022 In order to guarantee that the generated paired images belong to the same identity, we\nconstrain the consistency of paired images in both latent space and image space. These\nallow new images sampled from the noise can be used for recognition networks.\n\n2\n\nConditional translationUnconditional dual variational generationTranslation\fFigure 2: The dual generation results via noise (256 \u00d7 256 resolution). For each pair, the left is NIR\nand the right is the paired VIS image.\n\n\u2022 We can sample large-scale diverse paired heterogeneous images from noise. By constraining\nthe pairwise feature distances of the generated paired images in the HFR network, the\ndomain discrepancy is effectively reduced.\n\n\u2022 Experiments on the CASIA NIR-VIS 2.0, the Oulu-CASIA NIR-VIS, the BUAA-VisNir and\nthe IIIT-D Viewed Sketch databases demonstrate that our method can generate photo-realistic\nimages, and signi\ufb01cantly improve the performance of recognition.\n\n2 Background and Related Work\n\n2.1 Heterogeneous Face Recognition\n\nLots of researchers pay their attention to Heterogeneous Face Recognition (HFR). For the feature-level\nlearning, [22] employs HOG features with sparse representation for HFR. [7] utilizes LBP histogram\nwith Linear Discriminant Analysis to obtain domain-invariant features. [10] proposes Invariant Deep\nRepresentation (IDR) to disentangle representations into two orthogonal subspaces for NIR-VIS HFR.\nFurther, [11] extends IDR by introducing Wasserstein distance to obtain domain invariant features\nfor HFR. For the image-level learning, the common idea is to transform heterogeneous face images\nfrom one modality into another one via image synthesis. [19] utilizes joint dictionary learning to\nreconstruct face images for boosting the performance of face matching. [23] proposes a cross-spectral\nhallucination and low-rank embedding to synthesize a heterogeneous image in a patch way.\n\n2.2 Generative Models\n\nVariational autoencoders (VAEs) [21] and generative adversarial networks (GANs) [6] are the most\nprominent generative models. VAEs consist of an encoder network q\u03c6(z|x) and a decoder network\np\u03b8(x|z). q\u03c6(z|x) maps input images x to the latent variables z that match to a prior p(z), and p\u03b8(x|z)\nsamples images x from the latent variables z. The evidence lower bound objective (ELBO) of VAEs:\n\nlog p\u03b8(x) \u2265 Eq\u03c6(z|x) log p\u03b8(x|z) \u2212 DKL(q\u03c6(z|x)||p(z)).\n\n(1)\n\nThe two parts in ELBO are a reconstruction error and a Kullback-Leibler divergence, respectively.\nDifferently, GANs adopt a generator G and a discriminator D to play a min-max game. G generates\nimages from a prior p(z) to confuse D, and D is trained to distinguish between generated data and\nreal data. This adversarial rule takes the form:\n\nmin\n\nG\n\nmax\n\nD\n\nEx\u223cpdata(x) [log D(x)] + Ez\u223cpz(z) [log(1 \u2212 D(G(z)))] .\n\n(2)\n\nThey have achieved remarkable success in various applications, such as unconditional image genera-\ntion that generates images from noise [20, 15], and conditional image generation that synthesizes\nimages according to the given condition [32, 16]. According to [15], VAEs have nice manifold\nrepresentations, while GANs are better at generating sharper images.\nAnother work to address the similar problem of our method is CoGAN [25], which uses a weight-\nsharing manner to generate paired images in two different modalities. However, CoGAN neither\nexplicitly constrains the identity consistency of paired images in the latent space nor in the image\nspace. It is challenging for the weight-sharing manner of CoGAN to generate paired images with the\nsame identity, as shown in Fig. 4.\n\n3\n\n\fFigure 3: The purpose (left part) and training model (right part) of our unconditional DVG framework.\nDVG generates large-scale new paired heterogeneous images with the same identity from standard\nGaussian noise, aiming at reducing the domain discrepancy for HFR. In order to achieve this purpose,\nwe elaborately design a dual variational autoencoder. Given a pair of heterogeneous images from\nthe same identity, the dual variational autoencoder learns a joint distribution in the latent space. In\norder to guarantee the identity consistency of the generated paired images, we impose a distribution\nalignment in the latent space and a pairwise identity preserving in the image space.\n\n3 Proposed Method\n\nIn this section, we will introduce our method in detail, including the dual variational generation and\nheterogeneous face recognition. Note that we speci\ufb01cally discuss the NIR-VIS images for better\npresentation. Other heterogeneous images are also applicable.\n\n3.1 Dual Variational Generation\n\nAs shown in the right part of Fig. 3, DVG consists of a feature extractor Fip, and a dual variational\nautoencoder: two encoder networks and a decoder network, all of which play the same roles of\nVAEs [21]. Speci\ufb01cally, Fip extracts the semantic information of the generated images to preserve\nthe identity information. The encoder network EN maps NIR images xN to a latent space zN =\nq\u03c6N (zN|xN ) by a reparameterization trick: zN = uN + \u03c3N (cid:12) \u0001, where uN and \u03c3N denote mean\nand standard deviation of NIR images, respectively. In addition, \u0001 is sampled from a multi-variate\nstandard Gaussian and (cid:12) denotes the Hadamard product. The encoder network EV has the same\nmanner as EN : zV = q\u03c6V (zV |xV ), which is for VIS images xV . After obtaining the two independent\ndistributions, we concatenate zN and zV to get the joint distribution zI.\n\nDistribution Learning We utilize VAEs to learn the joint distribution of the paired NIR-VIS images.\nGiven a pair of NIR-VIS images {xN , xV }, we constrain the posterior distribution q\u03c6N (zN|xN ) and\nq\u03c6V (zV |xV ) by the Kullback-Leibler divergence:\n\nLkl = DKL(q\u03c6N (zN|xN )||p(zN )) + DKL(q\u03c6V (zV |xV )||p(zV )),\n\n(3)\n\nwhere the prior distributions p(zN ) and p(zV ) are both the multi-variate standard Gaussian dis-\ntributions. Like the original VAEs, we require the decoder network p\u03b8(xN , xV |zI ) to be able to\nreconstruct the input images xN and xV from the learned distribution:\n\nLrec = \u2212Eq\u03c6N (zN|xN )\u222aq\u03c6V (zV |xV ) log p\u03b8(xN , xV |zI ).\n\n(4)\n\nDistribution Alignment We expect a pair of NIR-VIS images {xN , xV } to be projected into a\ncommon latent space by the encoders EN and EV , i.e., the NIR distribution p(z(i)\nN ) is the same as\nthe VIS distribution p(z(i)\nV ), where i denotes the identity information. That means we maintain the\nidentity consistency of the generated paired images in the latent space. Explicitly, we align the NIR\nand VIS distributions by minimizing the Wasserstein distance between the two distributions. Given\ntwo Gaussian distributions p(z(i)\n), the 2-Wasserstein\ndistance between p(z(i)\n\nN ) = N (u(i)\n\nV ) = N (u(i)\n\n) and p(z(i)\n\nV ) is simpli\ufb01ed [10] as:\n\nN , \u03c3(i)\n\nN\n\nV , \u03c3(i)\n\nV\n\n2\n\nN \u2212 u(i)\n\nV ||2\n\n2 + ||\u03c3(i)\n\nN \u2212 \u03c3(i)\n\nV ||2\n\n2\n\n.\n\n(5)\n\nN ) and p(z(i)\nLdist =\n1\n2\n\n(cid:104)||u(i)\n\n(cid:105)\n\n2\n\n4\n\n!\"CopyStandard Gaussian Noise#\u0302\"HFR NetDomain GapReductionz%&'&#&#(Distribution Alignment)&*&)(*(%(!\"#\"Concat+,-PairwiseIdentityPreserving'(ReconstructedGenerated\fPairwise Identity Preserving\nIn previous image-to-image translation works [16, 13], identity\npreserving is usually introduced to maintain identity information. The traditional approach uses a\npre-trained feature extractor to enforce the features of the generated images to be close to the target\nones. However, since the lack of intra-class and inter-class constraints, it is challenge to guarantee the\nsynthesized images to belong to the speci\ufb01c categories of the target images. Considering that DVG\ngenerates a pair of heterogeneous images per time, we only need to consider the identity consistency\nof the paired images.\nSpeci\ufb01cally, we adopt Light CNN [34] as the feature extractor Fip to constrain the feature distance\nbetween the reconstructed paired images:\n\nLip-pair = ||Fip(\u02c6xN ) \u2212 Fip(\u02c6xV )||2\n2,\n\n(6)\nwhere Fip(\u00b7) means the normalized output of the last fully connected layer of Fip. In addition, we\nalso use Fip to make the features of the reconstructed images and the original input images close\nenough as previous works [16, 13]:\n\nLip-rec = ||Fip(\u02c6xN ) \u2212 Fip(xN )||2\n\n2 + ||Fip(\u02c6xV ) \u2212 Fip(xV )||2\n2,\n\n(7)\n\nwhere \u02c6xN and \u02c6xV denote the reconstructions of the input paired images xN and xV , respectively.\n\nDiversity Constraint\nIn order to further increase the diversity of the generated images, we also\nintroduce a diversity loss [27]. In the sampling stage, when two sampled noise zI1 are zI2 are\nclose, the generated images xI1 and xI2 are going to be similar. We maximize the following loss to\nencourage the decoder DI to generate more diverse images:\n\nLdiv = max\n\nDI\n\n|Fip(xI1) \u2212 Fip(xI2)|\n\n|zI1 \u2212 zI2|\n\n.\n\n(8)\n\nOverall Loss Moreover, in order to increase the sharpness of our generated images, we also adopt\nan adversarial loss Ladv as [31]. Hence, the overall loss to optimize the dual variational autoencoder\ncan be formulated as\n\nLgen = Lrec + Lkl + Ladv + \u03bb1Ldist + \u03bb2Lip-pair + \u03bb3Lip-rec + \u03bb4Ldiv,\n\n(9)\n\nwhere \u03bb1, \u03bb2, \u03bb3 and \u03bb4 are the trade-off parameters.\n\n3.2 Heterogeneous Face Recognition\n\nFor the heterogeneous face recognition, our training data contains the original limited labeled data\nxi(i \u2208 {N, V }) and the large-scale generated unlabeled paired NIR-VIS data \u02dcxi(i \u2208 {N, V }). Here,\nwe de\ufb01ne a heterogeneous face recognition network F to extract features fi = F (xi; \u0398), where\ni \u2208 {N, V } and \u0398 is the parameters of F . For the original labeled NIR and VIS images, we utilize a\nsoftmax loss:\n\nsoftmax(F (xi; \u0398), y),\n\n(10)\n\n(cid:88)\n\ni\u2208{N,V }\n\nLcls =\n\nwhere y is the label of identity.\nFor the generated paired heterogeneous images, since they are generated from noise, there are no\nspeci\ufb01c classes for the paired images. But as mentioned in section 3.1, DVG ensures that the generated\npaired images belong to the same identity. Therefore, a pairwise distance loss between the paired\nheterogeneous samples is formulated as follows:\n\nLpair = ||F (\u02dcxN ; \u0398) \u2212 F (\u02dcxV ; \u0398)||2\n2,\n\n(11)\n\nIn this way, we can ef\ufb01ciently minimize the domain discrepancy by generating large-scale unlabeled\npaired heterogeneous images. As stated above, the \ufb01nal loss to optimize for the heterogeneous face\nrecognition network can be written as\n\nLhfr = Lcls + \u03b11Lpair,\n\n(12)\n\nwhere \u03b11 is the trade-off parameter.\n\n5\n\n\fMethod MD\nCoGAN 0.61\nVAE\n0.54\nDVG\n0.24\n\nFID Rank-1\n95.2\n10.6\n94.6\n8.2\n7.1\n99.2\n\nMethod\nw/o Ldist\nw/o Lip-pair\nw/o Ldiv\nDVG\n\nRank-1\n94.3\n96.1\n98.5\n99.2\n\n(a)\n\n(b)\n\nTable 1: Experimental analyses on the CASIA NIR-VIS 2.0 database. The backbone is LightCNN-9.\n(a) The quantitative comparisons of different methods. MD (lower is better) means the mean feature\ndistance between the generated paired NIR and VIS images. FID (lower is better) is measured based\non the features of LightCNN-9, instead of the traditional Inception model. (b) The ablation study.\n\n4 Experiments\n\n4.1 Databases and Protocols\n\nThree NIR-VIS heterogeneous face databases and one Sketch-Photo heterogeneous face database\nare used to evaluate our proposed method. For the NIR-VIS face recognition, following [35], we\nreport Rank-1 accuracy and veri\ufb01cation rate (VR)@false accept rate (FAR) for the CASIA NIR-VIS\n2.0 [24], the Oulu-CASIA NIR-VIS [18] and the BUAA-VisNir Face [14] databases. Note that, for\nthe Oulu-CASIA NIR-VIS database, there are only 20 subjects are selected as the training set. In\naddition, the IIIT-D Viewed Sketch database [1] is employed for the Sketch-Photo face recognition.\nDue to the few number of images in the IIIT-D Viewed Sketch database, following the protocols\nof [3], we use the CUHK Face Sketch FERET (CUFSF) [37] as the training set and report the Rank-1\naccuracy and VR@FAR=1% for comparisons.\n\n4.2 Experimental Details\n\nFor the dual variational generation, the architectures of the encoder and decoder networks are the\nsame as [15], and the architecture of our discriminator is the same as [31]. These networks are\ntrained using Adam optimizer with a \ufb01xed rate of 0.0002. Other parameters \u03bb1, \u03bb2, \u03bb3 and \u03bb4 in\nEq. (9) are set to 50, 5, 1000 and 0.2, respectively. For the heterogeneous face recognition, we\nutilize both LightCNN-9 and LightCNN-29 [34] as the backbones. The models are pre-trained on the\nMS-Celeb-1M database [9] and \ufb01ne-tuned on the HFR training sets. All the face images are aligned\nto 144\u00d7 144 and randomly cropped to 128\u00d7 128 as the input for training. Stochastic gradient descent\n(SGD) is used as the optimizer, where the momentum is set to 0.9 and weight decay is set to 5e-4.\nThe learning rate is set to 1e-3 initially and reduced to 5e-4 gradually. The batch size is set to 64 and\nthe dropout ratio is 0.5. The trade-off parameters \u03b11 in Eq. (12) is set to 0.001 during training.\n\n4.3 Experimental Analyses\n\nIn this section, we analyze three metrics, including identity consistency, distribution consistency and\nvisual quality, to demonstrate the effectiveness of DVG. The compared methods include CoGAN [25]\nand VAE [21]. For VAE model, the input is the concatenated NIR-VIS images.\n\nIdentity Consistency.\nIn order to analyze the identity consistency, we measure the feature distance\nbetween the generated paired images on the CASIA NIR-VIS 2.0 database. Speci\ufb01cally, we \ufb01rst use\na pre-trained Light CNN-9 [34] to extract features and then measure the mean distance (MD) of the\npaired images. The results are reported in Table 1a. MD is computed from 50K generated image pairs\nand the MD value of the original database is 0.26. We can clearly see that the MD value of DVG\nis even smaller than the original database, which means that our method can effectively guarantee\nthe identity consistency of the generated paired images. The recognition performance of different\nmethods is also reported in Table 1a. We can see that DVG correspondingly achieves the best results.\n\nDistribution Consistency. On the CASIA NIR-VIS 2.0 database, we take Fr\u00b4echet Inception Dis-\ntance (FID) [12] to measure the Fr\u00b4echet distance of two distributions in the feature space, re\ufb02ecting\nthe distribution consistency. We \ufb01rst measure the FID between the generated VIS images and the real\nVIS images, and the FID between the generated NIR images and the real NIR images, respectively.\n\n6\n\n\fFigure 4: Visual comparisons of dual image generation results on the CASIA NIR-VIS 2.0 database.\nThe generated paired images of DVG are more similar than those of CoGAN and VAE.\n\nFigure 5: The dual generation results on the Oulu-CASIA NIR-VIS, the BUAA-VisNir and the\nCUHK Face Sketch FERET (CUFSF) databases.\n\nThen we calculate the mean FID as the \ufb01nal results, which are reported in Table 1a. Considering\nthat the face recognition network can better extract features of face images, we use a LightCNN-9 to\nextract features for calculating FID instead of the traditional Inception model. Similarly, FID results\nare computed from 50K generated image pairs. As shown in Table 1a, DVG achieves best results,\ndemonstrating that DVG has really learned the distributions of two modalities.\n\nIn Fig. 4, we compare the dual generation results (128 \u00d7 128 resolution) of different\nVisual Quality.\nmethods on the CASIA NIR-VIS 2.0 database. Our visual results are obviously better than CoGAN\nand VAE. Moreover, we can observe that the generated paired images of VAE and CoGAN are not\nsimilar, which leads to worse Rank-1 accuracy during optimizing HFR network (see Table 1a). More\ndual generation results of DVG are shown in Fig. 2 (256 \u00d7 256 resolution) and Fig. 5.\n\nAblation Study. Table 1b presents the comparison results of our DVG and its three variants on\nthe CASIA NIR-VIS 2.0 database. We observe that the recognition performance will decrease if\none component is not adopted. Particularly, the accuracy drops signi\ufb01cantly when the distribution\nalignment loss Ldist or the pairwise identity preserving loss Lip-pair are not used. These results suggest\nthat every component is crucial in our model.\nMoreover, we analyze how the number of generated samples in\ufb02uence the HFR network on the\nOulu-CASIA NIR-VIS database that only contains 20 identities with about 1,000 images for training.\nWe generate 1K, 5K, 10K and 50K pairs of heterogeneous images via DVG, and we obtain 68.7%,\n85.9%, 89.5% and 89.4% on VR@FAR=0.1% by LightCNN-9, respectively. The results have been\nsigni\ufb01cantly improved with the increasing number of the generated pairs, suggesting that DVG can\nboost the performance of the low-shot heterogeneous face recognition.\n\n4.4 Comparisons with State-of-the-art Methods\n\nThe recognition performance of our proposed DVG is demonstrated in this section on four heteroge-\nneous face recognition databases. The performance of state-of-the-art methods, such as IDNet [29],\nHFR-CNN [30], Hallucination [23], DLFace [28], TRIVET [26], IDR [10], W-CNN [11], RCN [4],\nMC-CNN [3] and DVR [35] is compared in Table 2. In addition, LightCNN-9 and LightCNN-29 are\nour baseline methods.\n\n7\n\nDVGCoGANVAEOuluBUAACUFSF\fFAR=0.1% Rank-1\n\nFAR=1% FAR=0.1% Rank-1\n\nOulu-CASIA NIR-VIS\n\nBUAA-VisNir\nFAR=1% FAR=0.1% Rank-1\n\nIIIT-D Viewed Sketch\nFAR=1%\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n98.68\n\nMethod\n\nIDNet [29]\n\n74.5\n78.0\n\n-\n-\n\nDLFace [28]\nTRIVET [26]\n\nIDR [10]\n\nW-CNN [11]\n\nHFR-CNN [30]\nHallucination [23]\n\nCASIA NIR-VIS 2.0\nRank-1\n87.1 \u00b1 0.9\n85.9 \u00b1 0.9\n89.6 \u00b1 0.9\n95.7 \u00b1 0.5\n97.3 \u00b1 0.4\n98.7 \u00b1 0.3\n99.7 \u00b1 0.1\n99.3 \u00b1 0.2\n99.4 \u00b1 0.1\n97.1 \u00b1 0.7\n99.2 \u00b1 0.3\nLightCNN-9 + DVG\n98.1 \u00b1 0.4\nLightCNN-29 + DVG 99.8 \u00b1 0.1\n\n91.0 \u00b1 1.3\n95.7 \u00b1 0.7\n98.4 \u00b1 0.4\n99.6 \u00b1 0.3\n98.7 \u00b1 0.2\n99.3 \u00b1 0.1\n93.7 \u00b1 0.8\n98.8 \u00b1 0.3\n97.4 \u00b1 0.5\n99.8 \u00b1 0.1\n\nMC-CNN [3]\nLightCNN-9\n\nDVR [35]\nRCN [4]\n\nLightCNN-29\n\n92.2\n94.3\n98.0\n100.0\n\n-\n-\n\n93.8\n100.0\n99.0\n100.0\n\n67.9\n73.4\n81.5\n97.2\n\n-\n-\n\n80.4\n97.6\n93.1\n98.5\n\n33.6\n46.2\n54.6\n84.9\n\n-\n-\n\n43.8\n89.5\n68.3\n92.9\n\n93.9\n94.3\n97.4\n99.2\n\n-\n-\n\n94.8\n98.0\n96.8\n99.3\n\n93.0\n93.4\n96.0\n98.5\n\n-\n-\n\n94.3\n97.1\n97.0\n98.5\n\n80.9\n84.7\n91.9\n96.9\n\n-\n-\n\n83.5\n93.1\n89.4\n97.3\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n90.34\n87.40\n84.07\n86.65\n83.24\n96.99\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n75.30\n92.24\n81.04\n97.86\n\nTable 2: Comparisons with other state-of-the-art deep HFR methods on the CASIA NIR-VIS 2.0, the\nOulu-CASIA NIR-VIS, the BUAA-VisNir and the IIIT-D Viewed Sketch databases.\n\nFigure 6: The ROC curves on the CASIA NIR-VIS 2.0, the Oulu-CASIA NIR-VIS and the BUAA-\nVisNir databases, respectively\n\nFor the most challenging CASIA NIR-VIS 2.0 database, it is obvious that DVG outperforms other\nstate-of-the-art methods. We \ufb01rst employ LightCNN-9 as the backbone to perform DVG, which\nobtains 99.2% on Rank-1 accuracy and 98.8% on VR@FAR=0.1%. Further, when backbone\nchanged to more powerful LightCNN-29, DVG obtains 99.8% on Rank-1 accuracy and 99.8%\non VR@FAR=0.1%. Moreover, for BUAA-VisNir Face database, DVG obtains 99.3% on Rank-1\naccuracy and 97.3% on VR@FAR=0.1%, which outperforms our baseline LightCNN-29 and other\nstate-of-the-art methods.\nTo further analyze the effectiveness of the proposed DVG for low-shot heterogeneous face recognition,\nwe evaluate DVG on the Oulu-CASIA NIR-VIS and the IIIT-D Viewed Sketch Face databases. As\nmentioned in section 4.1, there are fewer identities or images in these two databases. Table 2 presents\nthe performance of DVG on these two challenging low-shot HFR databases. For the Oulu-CASIA\nNIR-VIS database, we observe that DVG with LightCNN-29 signi\ufb01cantly boosts the performance\nfrom 84.9% [35] to 92.9% on VR@FAR=0.1%. Besides, for the IIIT-D Viewed Sketch Face database,\nDVG also obtains 96.99% on Rank-1 accuracy and 97.86% on VR@FAR=1%, which signi\ufb01cantly\noutperform our baseline lightCNN-29 and state-of-the-art methods including RCN and MC-CNN by\na large margin.\nFig. 6 presents the ROC curves, including TRIVET, IDR, W-CNN, DVR, and the proposed DVG.\nTo better demonstrate the results, we only perform ROC curves of DVG trained on LightCNN-\n29. It is obvious that DVG outperforms other state-of-the-art methods, especially on the low shot\nheterogeneous databases such as the Oulu-CASIA NIR-VIS database.\nExpect for the above commonly used NIR-VIS and Sketch-Photo, we further explore other potential\napplications, including the face recognition under different resolutions on the NJU-ID database [17]\nand different poses on the Multi-PIE database [8]. The NJU-ID database consists of 256 identities\nwith one ID card image (102 \u00d7 126 resolution) and one camera image (640 \u00d7 480 resolution) per\nidentity. Considering the few number of images in the NJU-ID database, we use our collected\nID-Photo database (1000 identities) as the training set and the NJU-ID database as the testing set.\nThe Multi-PIE database contains 337 subjects with different poses. We use pro\ufb01les (\u00b175o,\u00b190o)\n\n8\n\n105104103102101100False Positive Rate0.8000.8250.8500.8750.9000.9250.9500.9751.000True Positive Rate(a) CASIA NIR-VIS 2.0 ROCDVGDVRTRIVETIDRWCNN105104103102101100False Positive Rate0.30.40.50.60.70.80.91.0True Positive Rate(b) Oulu-CASIA NIR-VIS ROCDVGDVRTRIVETIDRWCNN105104103102101100False Positive Rate0.8000.8250.8500.8750.9000.9250.9500.9751.000True Positive Rate(c) BUAA-VisNir ROCDVGDVRTRIVETIDRWCNN\fand frontal faces as different modalities. 200 persons are used as the training set and the rest 137\npersons are the testing set (Setting 2 of [13]). On the NJU-ID database, we improve Rank-1 by 5.5%\n(DVG 96.8% - Baseline 91.3%) and VR@FAR=1% by 6.2% (DVG 96.7% - Baseline 90.5%) over\nthe baseline LightCNN-29. On the Multi-PIE database, the Rank-1 of \u00b190o and \u00b175o is increased\nby 18.5% (DVG 83.9% - Baseline 65.4%) and 4.3% (DVG 97.3% - Baseline 93.0%), respectively.\nWe will continue to explore more applications in our future work.\n\n5 Conclusion\n\nThis paper has developed a novel dual variational generation framework that generates large-scale\nnew paired heterogeneous images with abundant intra-class diversity from noise, providing a new\ninsight into the problems of HFR. A dual variational autoencoder is \ufb01rst proposed to learn a joint\ndistribution of paired heterogeneous images. Then, both the distribution alignment in the latent space\nand the pairwise distance constraint in the image space are utilized to ensure the identity consistency\nof the generated image pairs. Finally, DVG generates diverse paired heterogeneous images with the\nsame identity from noise to boost HFR network. Extensive qualitative and quantitative experimental\nresults on four databases have shown the superiority of our method.\n\nAcknowledgments\n\nThis work is funded by the National Natural Science Foundation of China (Grants No. 61622310)\nand Beijing Natural Science Foundation (Grants No. JQ18017).\n\nReferences\n[1] Himanshu S Bhatt, Samarth Bharadwaj, Richa Singh, and Mayank Vatsa. Memetic approach for matching\n\nsketches with digital face images. Technical report, 2012.\n\n[2] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for\n\ndeep face recognition. In CVPR, 2019.\n\n[3] Zhongying Deng, Xiaojiang Peng, Zhifeng Li, and Yu Qiao. Mutual component convolutional neural\n\nnetworks for heterogeneous face recognition. TIP, 2019.\n\n[4] Zhongying Deng, Xiaojiang Peng, and Yu Qiao. Residual compensation networks for heterogeneous face\n\nrecognition. In AAAI, 2019.\n\n[5] Chaoyou Fu, Yibo Hu, Xiang Wu, Guoli Wang, Qian Zhang, and Ran He. High \ufb01delity face manipulation\n\nwith extreme pose and expression. arXiv:1903.12003, 2019.\n\n[6] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\n\nAaron C. Courville, and Yoshua Bengio. Generative adversarial networks. In NeurIPS, 2014.\n\n[7] Debaditya Goswami, Chi-Ho Chan, David Windridge, and Josef Kittler. Evaluation of face recognition\n\nsystem in heterogeneous environments (visible vs nir). In ICCV Workshops, 2011.\n\n[8] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision\n\nComputing, 2010.\n\n[9] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and\n\nbenchmark for large-scale face recognition. In ECCV, 2016.\n\n[10] Ran He, Xiang Wu, Zhenan Sun, and Tieniu Tan. Learning invariant deep representation for NIR-VIS face\n\n[11] Ran He, Xiang Wu, Zhenan Sun, and Tieniu Tan. Wasserstein CNN: learning invariant features for\n\nrecognition. In AAAI, 2017.\n\nNIR-VIS face recognition. TPAMI, 2018.\n\n[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00fcnter Klambauer, and Sepp\nHochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. In NeurIPS,\n2017.\n\n[13] Yibo Hu, Xiang Wu, Bing Yu, Ran He, and Zhenan Sun. Pose-guided photorealistic face rotation. In\n\nCVPR, 2018.\n\n2012.\n\n[14] Di Huang, Jia Sun, and Yunhong Wang. The BUAA-VisNir face database instructions. Technical report,\n\n[15] Huaibo Huang, Zhihang Li, Ran He, Zhenan Sun, and Tieniu Tan. Introvae: Introspective variational\n\nautoencoders for photographic image synthesis. In NeurIPS, 2018.\n\n[16] Rui Huang, Shu Zhang, Tianyu Li, and Ran He. Beyond face rotation: Global and local perception gan for\n\nphotorealistic and identity preserving frontal view synthesis. In ICCV, 2017.\n\n[17] Jing Huo, Yang Gao, Yinghuan Shi, Wanqi Yang, and Hujun Yin. Heterogeneous face recognition by\n\nmargin-based cross-modality metric learning. IEEE Transactions on Cybernetics, 2017.\n\n[18] Jie Chen, D. Yi, Jimei Yang, Guoying Zhao, S. Z. Li, and M. Pietikainen. Learning mappings for face\n\nsynthesis from near infrared to visual light images. In CVPR, 2009.\n\n9\n\n\f[19] Felix Juefei-Xu, Dipan K. Pal, and Marios Savvides. Nir-vis heterogeneous face recognition via cross-\n\nspectral joint dictionary learning and reconstruction. In CVPR Workshops, 2015.\n\n[20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved\n\nquality, stability, and variation. In ICLR, 2018.\n\n[21] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[22] Brendan Klare, Zhifeng Li, and Anil K. Jain. Matching forensic sketches to mug shot photos. TPAMI,\n\n2011.\n\n2013.\n\n[23] Jos\u00e9 Lezama, Qiang Qiu, and Guillermo Sapiro. Not afraid of the dark: Nir-vis face recognition via\n\ncross-spectral hallucination and low-rank embedding. In CVPR, 2017.\n\n[24] Stan Z. Li, Dong Yi, Zhen Lei, and Shengcai Liao. The casia nir-vis 2.0 face database. In CVPR Workshops,\n\n[25] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NeurIPS, 2016.\n[26] Xiaoxiang Liu, Lingxiao Song, Xiang Wu, and Tieniu Tan. Transferring deep representation for nir-vis\n\nheterogeneous face recognition. In ICB, 2016.\n\n[27] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative\n\nadversarial networks for diverse image synthesis. In CVPR, 2019.\n\n[28] Chunlei Peng, Nannan Wang, Jie Li, and Xinbo Gao. Dlface: Deep local descriptor for cross-modality face\n\nrecognition. PR, 2019.\n\n[29] Christopher Reale, Nasser M. Nasrabadi, Heesung Kwon, and Rama Chellappa. Seeing the forest from the\n\ntrees: A holistic approach to near-infrared heterogeneous face recognition. In CVPR Workshops, 2016.\n\n[30] Shreyas Saxena and Jakob Verbeek. Heterogeneous face recognition with cnns. In ECCV Workshops, 2016.\n[31] Zhixin Shu, Mihir Sahasrabudhe, R\u0131za Alp G\u00fcler, Dimitris Samaras, Nikos Paragios, and Iasonas Kokkinos.\n\nDeforming autoencoders: Unsupervised disentangling of shape and appearance. In ECCV, 2018.\n\n[32] Lingxiao Song, Man Zhang, Xiang Wu, and Ran He. Adversarial discriminative heterogeneous face\n\nrecognition. In AAAI, 2018.\n\nrecognition. In CVPR, 2017.\n\nTIFS, 2018.\n\n[33] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for pose-invariant face\n\n[34] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan. A light cnn for deep face representation with noisy labels.\n\n[35] Xiang Wu, Huaibo Huang, Vishal M. Patel, Ran He, and Zhenan Sun. Disentangled variational representa-\n\ntion for heterogeneous face recognition. In AAAI, 2019.\n\n[36] He Zhang, Benjamin S. Riggan, Shuowen Hu, Nathaniel J. Short, and Vishal M. Patel. Synthesis of\nhigh-quality visible faces from polarimetric thermal faces using generative adversarial networks. IJCV,\n2019.\n\n[37] Wei Zhang, Xiaogang Wang, and Xiaoou Tang. Coupled information-theoretic encoding for face photo-\n\nsketch recognition. In CVPR, 2011.\n\n[38] Jian Zhao, Yu Cheng, Yan Xu, Lin Xiong, Jianshu Li, Fang Zhao, Karlekar Jayashree, Sugiri Pranata,\nShengmei Shen, Junliang Xing, Shuicheng Yan, and Jiashi Feng. Towards pose invariant face recognition\nin the wild. In CVPR, 2018.\n\n[39] Jian Zhao, Lin Xiong, Yu Cheng, Yi Cheng, Jianshu Li, Li Zhou, Yan Xu, Jayashree Karlekar, Sugiri\nPranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, and Jiashi Feng. 3d-aided deep pose-invariant\nface recognition. In IJCAI, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1536, "authors": [{"given_name": "Chaoyou", "family_name": "Fu", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Xiang", "family_name": "Wu", "institution": "Institue of Automation, Chinese Academy of Science"}, {"given_name": "Yibo", "family_name": "Hu", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Huaibo", "family_name": "Huang", "institution": "Institute of Automation, Chinese Academy of Science"}, {"given_name": "Ran", "family_name": "He", "institution": "NLPR, CASIA"}]}