{"title": "Deep Learning Face Representation by Joint Identification-Verification", "book": "Advances in Neural Information Processing Systems", "page_first": 1988, "page_last": 1996, "abstract": "The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepID2 features extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepID2 features extracted from the same identity together, both of which are essential to face recognition. The learned DeepID2 features can be well generalized to new identities unseen in the training data. On the challenging LFW dataset, 99.15% face verification accuracy is achieved. Compared with the best previous deep learning result on LFW, the error rate has been significantly reduced by 67%.", "full_text": "Deep Learning Face Representation by Joint\n\nIdenti\ufb01cation-Veri\ufb01cation\n\n1Department of Information Engineering, The Chinese University of Hong Kong\n\n2SenseTime Group\n\nYi Sun1\n\nYuheng Chen2\n\nXiaogang Wang3,4\n\nXiaoou Tang1,4\n\n3Department of Electronic Engineering, The Chinese University of Hong Kong\n4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences\n\nsy011@ie.cuhk.edu.hk chyh1990@gmail.com\nxgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk\n\nAbstract\n\nThe key challenge of face recognition is to develop effective feature repre-\nsentations for reducing intra-personal variations while enlarging inter-personal\ndifferences. In this paper, we show that it can be well solved with deep learning\nand using both face identi\ufb01cation and veri\ufb01cation signals as supervision. The\nDeep IDenti\ufb01cation-veri\ufb01cation features (DeepID2) are learned with carefully\ndesigned deep convolutional networks. The face identi\ufb01cation task increases the\ninter-personal variations by drawing DeepID2 features extracted from different\nidentities apart, while the face veri\ufb01cation task reduces the intra-personal\nvariations by pulling DeepID2 features extracted from the same identity together,\nboth of which are essential to face recognition. The learned DeepID2 features\ncan be well generalized to new identities unseen in the training data. On the\nchallenging LFW dataset [11], 99.15% face veri\ufb01cation accuracy is achieved.\nCompared with the best previous deep learning result [20] on LFW, the error rate\nhas been signi\ufb01cantly reduced by 67%.\n\n1\n\nIntroduction\n\nFaces of the same identity could look much different when presented in different poses, illumina-\ntions, expressions, ages, and occlusions. Such variations within the same identity could overwhelm\nthe variations due to identity differences and make face recognition challenging, especially in\nunconstrained conditions. Therefore, reducing the intra-personal variations while enlarging the\ninter-personal differences is a central topic in face recognition.\nIt can be traced back to early\nsubspace face recognition methods such as LDA [1], Bayesian face [16], and uni\ufb01ed subspace\n[22, 23]. For example, LDA approximates inter- and intra-personal face variations by using two\nscatter matrices and \ufb01nds the projection directions to maximize the ratio between them. More recent\nstudies have also targeted the same goal, either explicitly or implicitly. For example, metric learning\n[6, 9, 14] maps faces to some feature representation such that faces of the same identity are close\nto each other while those of different identities stay apart. However, these models are much limited\nby their linear nature or shallow structures, while inter- and intra-personal variations are complex,\nhighly nonlinear, and observed in high-dimensional image space.\nIn this work, we show that deep learning provides much more powerful tools to handle the two types\nof variations. Thanks to its deep architecture and large learning capacity, effective features for face\nrecognition can be learned through hierarchical nonlinear mappings. We argue that it is essential\nto learn such features by using two supervisory signals simultaneously, i.e. the face identi\ufb01cation\nand veri\ufb01cation signals, and the learned features are referred to as Deep IDenti\ufb01cation-veri\ufb01cation\nfeatures (DeepID2).\nIdenti\ufb01cation is to classify an input image into a large number of identity\n\n1\n\n\fclasses, while veri\ufb01cation is to classify a pair of images as belonging to the same identity or not\n(i.e. binary classi\ufb01cation). In the training stage, given an input face image with the identi\ufb01cation\nsignal, its DeepID2 features are extracted in the top hidden layer of the learned hierarchical nonlinear\nfeature representation, and then mapped to one of a large number of identities through another\nfunction g(DeepID2). In the testing stage, the learned DeepID2 features can be generalized to other\ntasks (such as face veri\ufb01cation) and new identities unseen in the training data. The identi\ufb01cation\nsupervisory signal tends to pull apart the DeepID2 features of different identities since they have to\nbe classi\ufb01ed into different classes. Therefore, the learned features would have rich identity-related\nor inter-personal variations. However, the identi\ufb01cation signal has a relatively weak constraint on\nDeepID2 features extracted from the same identity, since dissimilar DeepID2 features could be\nmapped to the same identity through function g(\u00b7). This leads to problems when DeepID2 features\nare generalized to new tasks and new identities in test where g is not applicable anymore. We solve\nthis by using an additional face veri\ufb01cation signal, which requires that every two DeepID2 feature\nvectors extracted from the same identity are close to each other while those extracted from different\nidentities are kept away. The strong per-element constraint on DeepID2 features can effectively\nreduce the intra-personal variations. On the other hand, using the veri\ufb01cation signal alone (i.e. only\ndistinguishing a pair of DeepID2 feature vectors at a time) is not as effective in extracting identity-\nrelated features as using the identi\ufb01cation signal (i.e. distinguishing thousands of identities at a\ntime). Therefore, the two supervisory signals emphasize different aspects in feature learning and\nshould be employed together.\nTo characterize faces from different aspects, complementary DeepID2 features are extracted from\nvarious face regions and resolutions, and are concatenated to form the \ufb01nal feature representation\nafter PCA dimension reduction. Since the learned DeepID2 features are diverse among different\nidentities while consistent within the same identity, it makes the following face recognition easier.\nUsing the learned feature representation and a recently proposed face veri\ufb01cation model [3], we\nachieved the highest 99.15% face veri\ufb01cation accuracy on the challenging and extensively studied\nLFW dataset [11]. This is the \ufb01rst time that a machine provided with only the face region achieves an\naccuracy on par with the 99.20% accuracy of human to whom the entire LFW face image including\nthe face region and large background area are presented to verify.\nIn recent years, a great deal of efforts have been made for face recognition with deep learning\n[5, 10, 18, 26, 8, 21, 20, 27]. Among the deep learning works, [5, 18, 8] learned features or\ndeep metrics with the veri\ufb01cation signal, while DeepFace [21] and our previous work DeepID\n[20] learned features with the identi\ufb01cation signal and achieved accuracies around 97.45% on\nLFW. Our approach signi\ufb01cantly improves the state-of-the-art. The idea of jointly solving the\nclassi\ufb01cation and veri\ufb01cation tasks was applied to general object recognition [15], with the focus on\nimproving classi\ufb01cation accuracy on \ufb01xed object classes instead of hidden feature representations.\nOur work targets on learning features which can be well generalized to new classes (identities) and\nthe veri\ufb01cation task.\n\n2\n\nIdenti\ufb01cation-veri\ufb01cation guided deep feature learning\n\nWe learn features with variations of deep convolutional neural networks (deep ConvNets) [12].\nThe convolution and pooling operations in deep ConvNets are specially designed to extract visual\nfeatures hierarchically, from local low-level features to global high-level ones. Our deep ConvNets\ntake similar structures as in [20]. It contains four convolutional layers, with local weight sharing\n[10] in the third and fourth convolutional layers. The ConvNet extracts a 160-dimensional DeepID2\nfeature vector at its last layer (DeepID2 layer) of the feature extraction cascade. The DeepID2\nlayer to be learned are fully-connected to both the third and fourth convolutional layers. We use\nrecti\ufb01ed linear units (ReLU) [17] for neurons in the convolutional layers and the DeepID2 layer.\nAn illustration of the ConvNet structure used to extract DeepID2 features is shown in Fig. 1 given\nan RGB input of size 55 \u00d7 47. When the size of the input region changes, the map sizes in the\nfollowing layers will change accordingly. The DeepID2 feature extraction process is denoted as\nf = Conv(x, \u03b8c), where Conv(\u00b7) is the feature extraction function de\ufb01ned by the ConvNet, x is the\ninput face patch, f is the extracted DeepID2 feature vector, and \u03b8c denotes ConvNet parameters to\nbe learned.\n\n2\n\n\fFigure 1: The ConvNet structure for DeepID2 feature extraction.\n\nDeepID2 features are learned with two supervisory signals. The \ufb01rst is face identi\ufb01cation signal,\nwhich classi\ufb01es each face image into one of n (e.g., n = 8192) different identities. Identi\ufb01cation is\nachieved by following the DeepID2 layer with an n-way softmax layer, which outputs a probability\ndistribution over the n classes. The network is trained to minimize the cross-entropy loss, which we\ncall the identi\ufb01cation loss. It is denoted as\n\nIdent(f, t, \u03b8id) = \u2212 n(cid:88)\n\ni=1\n\npi log \u02c6pi = \u2212 log \u02c6pt ,\n\n(1)\n\nwhere f is the DeepID2 feature vector, t is the target class, and \u03b8id denotes the softmax layer\nparameters. pi is the target probability distribution, where pi = 0 for all i except pt = 1\nfor the target class t.\n\u02c6pi is the predicted probability distribution. To correctly classify all\nthe classes simultaneously, the DeepID2 layer must form discriminative identity-related features\n(i.e. features with large inter-personal variations). The second is face veri\ufb01cation signal, which\nencourages DeepID2 features extracted from faces of the same identity to be similar. The veri\ufb01cation\nsignal directly regularize DeepID2 features and can effectively reduce the intra-personal variations.\nCommonly used constraints include the L1/L2 norm and cosine similarity. We adopt the following\nloss function based on the L2 norm, which was originally proposed by Hadsell et al.[7] for\ndimensionality reduction,\n\n(cid:40) 1\n2 max(cid:0)0, m \u2212 (cid:107)fi \u2212 fj(cid:107)2\n2 (cid:107)fi \u2212 fj(cid:107)2\n\n2\n\n(cid:1)2\n\nif yij = 1\nif yij = \u22121\n\n,\n\n(2)\n\nVerif(fi, fj, yij, \u03b8ve) =\n\n1\n\nwhere fi and fj are DeepID2 feature vectors extracted from the two face images in comparison.\nyij = 1 means that fi and fj are from the same identity. In this case, it minimizes the L2 distance\nbetween the two DeepID2 feature vectors. yij = \u22121 means different identities, and Eq. (2) requires\nthe distance larger than a margin m. \u03b8ve = {m} is the parameter to be learned in the veri\ufb01cation loss\nfunction. Loss functions based on the L1 norm could have similar formulations [15]. The cosine\nsimilarity was used in [17] as\n\nVerif(fi, fj, yij, \u03b8ve) =\n\n(yij \u2212 \u03c3(wd + b))2 ,\n\n1\n2\n\n(3)\n\nfi\u00b7fj\n\n(cid:107)fi(cid:107)2(cid:107)fj(cid:107)2\n\nis the cosine similarity between DeepID2 feature vectors, \u03b8ve = {w, b} are\nwhere d =\nlearnable scaling and shifting parameters, \u03c3 is the sigmoid function, and yij is the binary target of\nwhether the two compared face images belong to the same identity. All the three loss functions are\nevaluated and compared in our experiments.\nOur goal is to learn the parameters \u03b8c in the feature extraction function Conv(\u00b7), while \u03b8id and \u03b8ve are\nonly parameters introduced to propagate the identi\ufb01cation and veri\ufb01cation signals during training.\nIn the testing stage, only \u03b8c is used for feature extraction. The parameters are updated by stochastic\ngradient descent. The identi\ufb01cation and veri\ufb01cation gradients are weighted by a hyperparameter \u03bb.\nOur learning algorithm is summarized in Tab. 1. The margin m in Eq. (2) is a special case, which\ncannot be updated by gradient descent since this will collapse it to zero. Instead, m is \ufb01xed and\nupdated every N training pairs (N \u2248 200, 000 in our experiments) such that it is the threshold of\n\n3\n\n\fTable 1: The DeepID2 feature learning algorithm.\n\ninput: training set \u03c7 = {(xi, li)}, initialized parameters \u03b8c, \u03b8id, and \u03b8ve, hyperparame-\nter \u03bb, learning rate \u03b7(t), t \u2190 0\nwhile not converge do\n\nsample two training samples (xi, li) and (xj, lj) from \u03c7\n\nt \u2190 t + 1\nfi = Conv(xi, \u03b8c) and fj = Conv(xj, \u03b8c)\n\u2207\u03b8id = \u2202Ident(fi,li,\u03b8id)\n+ \u2202Ident(fj ,lj ,\u03b8id)\n\u2207\u03b8ve = \u03bb \u00b7 \u2202Verif(fi,fj ,yij ,\u03b8ve)\n\u2207fi = \u2202Ident(fi,li,\u03b8id)\n\u2207fj = \u2202Ident(fj ,lj ,\u03b8id)\n\u2207\u03b8c = \u2207fi \u00b7 \u2202Conv(xi,\u03b8c)\nupdate \u03b8id = \u03b8id \u2212 \u03b7(t) \u00b7 \u2207\u03b8id, \u03b8ve = \u03b8ve \u2212 \u03b7(t) \u00b7 \u2207\u03b8ve, and \u03b8c = \u03b8c \u2212 \u03b7(t) \u00b7 \u2207\u03b8c.\n\n+ \u03bb \u00b7 \u2202Verif(fi,fj ,yij ,\u03b8ve)\n+ \u03bb \u00b7 \u2202Verif(fi,fj ,yij ,\u03b8ve)\n+ \u2207fj \u00b7 \u2202Conv(xj ,\u03b8c)\n\n, where yij = 1 if li = lj, and yij = \u22121 otherwise.\n\n\u2202\u03b8id\n\n\u2202\u03b8id\n\n\u2202\u03b8c\n\n\u2202\u03b8c\n\n\u2202\u03b8ve\n\n\u2202fi\n\n\u2202fj\n\n\u2202fi\n\n\u2202fj\n\nend while\noutput \u03b8c\n\nFigure 2: Patches selected for feature extraction. The Joint Bayesian [3] face veri\ufb01cation accuracy\n(%) using features extracted from each individual patch is shown below.\n\nthe feature distances (cid:107)fi \u2212 fj(cid:107) to minimize the veri\ufb01cation error of the previous N training pairs.\nUpdating m is not included in Tab. 1 for simplicity.\n\n3 Face Veri\ufb01cation\n\nTo evaluate the feature learning algorithm described in Sec. 2, DeepID2 features are embedded into\nthe conventional face veri\ufb01cation pipeline of face alignment, feature extraction, and face veri\ufb01cation.\nWe \ufb01rst use the recently proposed SDM algorithm [24] to detect 21 facial landmarks. Then the face\nimages are globally aligned by similarity transformation according to the detected landmarks. We\ncropped 400 face patches, which vary in positions, scales, color channels, and horizontal \ufb02ipping,\naccording to the globally aligned faces and the position of the facial landmarks. Accordingly,\n400 DeepID2 feature vectors are extracted by a total of 200 deep ConvNets, each of which is\ntrained to extract two 160-dimensional DeepID2 feature vectors on one particular face patch and\nits horizontally \ufb02ipped counterpart, respectively, of each face.\nTo reduce the redundancy among the large number of DeepID2 features and make our system\npractical, we use the forward-backward greedy algorithm [25] to select a small number of effective\nand complementary DeepID2 feature vectors (25 in our experiment), which saves most of the feature\nextraction time during test. Fig. 2 shows all the selected 25 patches, from which 25 160-dimensional\nDeepID2 feature vectors are extracted and are concatenated to a 4000-dimensional DeepID2 feature\nvector. The 4000-dimensional vector is further compressed to 180 dimensions by PCA for face\nveri\ufb01cation. We learned the Joint Bayesian model [3] for face veri\ufb01cation based on the extracted\nDeepID2 features. Joint Bayesian has been successfully used to model the joint probability of two\nfaces being the same or different persons [3, 4].\n\n4\n\n\f4 Experiments\n\nWe report face veri\ufb01cation results on the LFW dataset [11], which is the de facto standard test set\nfor face veri\ufb01cation in unconstrained conditions. It contains 13, 233 face images of 5749 identities\ncollected from the Internet. For comparison purposes, algorithms typically report the mean face\nveri\ufb01cation accuracy and the ROC curve on 6000 given face pairs in LFW. Though being sound\nas a test set, it is inadequate for training, since the majority of identities in LFW have only one\nface image. Therefore, we rely on a larger outside dataset for training, as did by all recent high-\nperformance face veri\ufb01cation algorithms [4, 2, 21, 20, 13]. In particular, we use the CelebFaces+\ndataset [20] for training, which contains 202, 599 face images of 10, 177 identities (celebrities)\ncollected from the Internet. People in CelebFaces+ and LFW are mutually exclusive. DeepID2\nfeatures are learned from the face images of 8192 identities randomly sampled from CelebFaces+\n(referred to as CelebFaces+A), while the remaining face images of 1985 identities (referred to as\nCelebFaces+B) are used for the following feature selection and learning the face veri\ufb01cation models\n(Joint Bayesian). When learning DeepID2 features on CelebFaces+A, CelebFaces+B is used as\na validation set to decide the learning rate, training epochs, and hyperparameter \u03bb. After that,\nCelebFaces+B is separated into a training set of 1485 identities and a validation set of 500 identities\nfor feature selection. Finally, we train the Joint Bayesian model on the entire CelebFaces+B data\nand test on LFW using the selected DeepID2 features. We \ufb01rst evaluate various aspect of feature\nlearning from Sec. 4.1 to Sec. 4.3 by using a single deep ConvNet to extract DeepID2 features\nfrom the entire face region. Then the \ufb01nal system is constructed and compared with existing best\nperforming methods in Sec. 4.4.\n\n4.1 Balancing the identi\ufb01cation and veri\ufb01cation signals\n\ni=1\n\nx\u2208Di\n\n(x \u2212 \u00afxi) (x \u2212 \u00afxi)\n\n(cid:80)c\ni=1 ni \u00b7 (\u00afxi \u2212 \u00afx) (\u00afxi \u2212 \u00afx)\nmatrix is Sintra = (cid:80)c\n(cid:80)\n\nWe investigates the interactions of identi\ufb01cation and veri\ufb01cation signals on feature learning, by\nvarying \u03bb from 0 to +\u221e. At \u03bb = 0, the veri\ufb01cation signal vanishes and only the identi\ufb01cation signal\ntakes effect. When \u03bb increases, the veri\ufb01cation signal gradually dominates the training process. At\nthe other extreme of \u03bb \u2192 +\u221e, only the veri\ufb01cation signal remains. The L2 norm veri\ufb01cation loss\nin Eq. (2) is used for training. Figure 3 shows the face veri\ufb01cation accuracy on the test set by\ncomparing the learned DeepID2 features with L2 norm and the Joint Bayesian model, respectively.\nIt clearly shows that neither the identi\ufb01cation nor the veri\ufb01cation signal is the optimal one to learn\nfeatures. Instead, effective features come from the appropriate combination of the two.\nThis phenomenon can be explained from the view of inter- and intra-personal variations, which\ncould be approximated by LDA. According to LDA, the inter-personal scatter matrix is Sinter =\n(cid:62), where \u00afxi is the mean feature of the i-th identity, \u00afx is the mean of the\nentire dataset, and ni is the number of face images of the i-th identity. The intra-personal scatter\n(cid:62), where Di is the set of features of the i-th\nidentity, \u00afxi is the corresponding mean, and c is the number of different identities. The inter- and\nintra-personal variances are the eigenvalues of the corresponding scatter matrices, and are shown in\nFig. 5. The corresponding eigenvectors represent different variation patterns. Both the magnitude\nand diversity of feature variances matter in recognition. If all the feature variances concentrate on a\nsmall number of eigenvectors, it indicates the diversity of intra- or inter-personal variations is low.\nThe features are learned with \u03bb = 0, 0.05, and +\u221e, respectively. The feature variances of each\ngiven \u03bb are normalized by the corresponding mean feature variance.\nWhen only the identi\ufb01cation signal is used (\u03bb = 0), the learned features contain both diverse\ninter- and intra-personal variations, as shown by the long tails of the red curves in both \ufb01gures.\nWhile diverse inter-personal variations help to distinguish different identities, large and diverse\nintra-personal variations are disturbing factors and make face veri\ufb01cation dif\ufb01cult. When both the\nidenti\ufb01cation and veri\ufb01cation signals are used with appropriate weighting (\u03bb = 0.05), the diversity\nof the inter-personal variations keeps unchanged while the variations in a few main directions\nbecome even larger, as shown by the green curve in the left compared to the red one. At the\nsame time, the intra-personal variations decrease in both the diversity and magnitude, as shown\nby the green curve in the right. Therefore, both the inter- and intra-personal variations changes in\na direction that makes face veri\ufb01cation easier. When \u03bb further increases towards in\ufb01nity, both the\ninter- and intra-personal variations collapse to the variations in only a few main directions, since\nwithout the identi\ufb01cation signal, diverse features cannot be formed. With low diversity on inter-\n\n5\n\n\fFigure 3: Face veri\ufb01cation accuracy by varying\nthe weighting parameter \u03bb. \u03bb is plotted in log\nscale.\n\nFigure 4: Face veri\ufb01cation accuracy of DeepID2\nfeatures learned by both the the face identi\ufb01cation\nand veri\ufb01cation signals, where the number of\ntraining identities (shown in log scale) used for\nface identi\ufb01cation varies. The result may be\nfurther improved with more than 8192 identities.\n\nFigure 5: Spectrum of eigenvalues of the inter- and intra-personal scatter matrices. Best viewed in\ncolor.\n\npersonal variations, distinguishing different identities becomes dif\ufb01cult. Therefore the performance\ndegrades signi\ufb01cantly.\nFigure 6 shows the \ufb01rst two PCA dimensions of features learned with \u03bb = 0, 0.05, and +\u221e,\nrespectively. These features come from the six identities with the largest numbers of face images in\nLFW, and are marked by different colors. The \ufb01gure further veri\ufb01es our observations. When \u03bb = 0\n(left), different clusters are mixed together due to the large intra-personal variations, although the\ncluster centers are actually different. When \u03bb increases to 0.05 (middle), intra-personal variations\nare signi\ufb01cantly reduced and the clusters become distinguishable. When \u03bb further increases towards\nin\ufb01nity (right), although the intra-personal variations further decrease, the cluster centers also begin\nto collapse and some clusters become signi\ufb01cantly overlapped (as the red, blue, and cyan clusters in\nFig. 6 right), making it hard to distinguish again.\n\n4.2 Rich identity information improves feature learning\n\nWe investigate how would the identity information contained in the identi\ufb01cation supervisory signal\nin\ufb02uence the learned features. In particular, we experiment with an exponentially increasing number\nof identities used for identi\ufb01cation during training from 32 to 8192, while the veri\ufb01cation signal is\ngenerated from all the 8192 training identities all the time. Fig. 4 shows how the veri\ufb01cation\naccuracies of the learned DeepID2 features (derived from the L2 norm and Joint Bayesian) vary\non the test set with the number of identities used in the identi\ufb01cation signal.\nIt shows that\n\n6\n\n\fFigure 6: The \ufb01rst two PCA dimensions of DeepID2 features extracted from six identities in LFW.\n\nTable 2: Comparison of different veri\ufb01cation signals.\n\nveri\ufb01cation signal\nL2 norm (%)\nJoint Bayesian (%)\n\nL2\n94.95\n95.12\n\nL2+\n94.43\n94.87\n\nL2-\n86.23\n92.98\n\nL1\n92.92\n94.13\n\ncosine none\n86.43\n87.07\n93.38\n92.73\n\nidentifying a large number (e.g., 8192) of identities is key to learning effective DeepID2 feature\nrepresentation. This observation is consistent with those in Sec. 4.1. The increasing number of\nidentities provides richer identity information and helps to form DeepID2 features with diverse inter-\npersonal variations, making the class centers of different identities more distinguishable.\n\n4.3\n\nInvestigating the veri\ufb01cation signals\n\nAs shown in Sec. 4.1, the veri\ufb01cation signal with moderate intensity mainly takes the effect of\nreducing the intra-personal variations. To further verify this, we compare our L2 norm veri\ufb01cation\nsignal on all the sample pairs with those only constrain either the positive or negative sample pairs,\ndenoted as L2+ and L2-, respectively. That is, the L2+ only decreases the distances between\nDeepID2 features of the same identity, while L2- only increases the distances between DeepID2\nfeatures of different identities if they are smaller than the margin. The face veri\ufb01cation accuracies\nof the learned DeepID2 features on the test set, measured by the L2 norm and Joint Bayesian\nrespectively, are shown in Table 2. It also compares with the L1 norm and cosine veri\ufb01cation signals,\nas well as no veri\ufb01cation signal (none). The identi\ufb01cation signal is the same (classifying the 8192\nidentities) for all the comparisons.\nDeepID2 features learned with the L2+ veri\ufb01cation signal are only slightly worse than those learned\nwith L2. In contrast, the L2- veri\ufb01cation signal helps little in feature learning and gives almost\nthe same result as no veri\ufb01cation signal is used. This is a strong evidence that the effect of the\nveri\ufb01cation signal is mainly reducing the intra-personal variations. Another observation is that the\nface veri\ufb01cation accuracy improves in general whenever the veri\ufb01cation signal is added in addition\nto the identi\ufb01cation signal. However, the L2 norm is better than the other compared veri\ufb01cation\nmetrics. This may be due to that all the other constraints are weaker than L2 and less effective in\nreducing the intra-personal variations. For example, the cosine similarity only constrains the angle,\nbut not the magnitude.\n\n4.4 Final system and comparison with other methods\n\nBefore learning Joint Bayesian, DeepID2 features are \ufb01rst projected to 180 dimensions by PCA.\nAfter PCA, the Joint Bayesian model is trained on the entire CelebFaces+B data and tested on the\n6000 given face pairs in LFW, where the log-likelihood ratio given by Joint Bayesian is compared\nto a threshold optimized on the training data for face veri\ufb01cation. Tab. 3 shows the face veri\ufb01cation\naccuracy with an increasing number of face patches to extract DeepID2 features, as well as the time\nused to extract those DeepID2 features from each face with a single Titan GPU. We achieve 98.97%\naccuracy with all the 25 selected face patches. The feature extraction process is also ef\ufb01cient and\ntakes only 35 ms for each face image. The face veri\ufb01cation accuracy of each individual face patch\nis provided in Fig. 2. The short DeepID2 signature is extremely ef\ufb01cient for face identi\ufb01cation and\nface image search when matching a query image with a large number of candidates.\n\n7\n\n\fTable 3: Face veri\ufb01cation accuracy with DeepID2 features extracted from an increasing number of\nface patches.\n\n# patches\naccuracy (%)\ntime (ms)\n\n1\n95.43\n1.7\n\n2\n97.28\n3.4\n\n4\n97.75\n6.1\n\n8\n98.55\n11\n\n16\n98.93\n23\n\n25\n98.97\n35\n\nTable 4: Accuracy comparison with the previous best results on LFW.\n\nmethod\nHigh-dim LBP [4]\nTL Joint Bayesian [2]\nDeepFace [21]\nDeepID [20]\nGaussianFace [13]\nDeepID2\n\naccuracy (%)\n95.17 \u00b1 1.13\n96.33 \u00b1 1.08\n97.35 \u00b1 0.25\n97.45 \u00b1 0.26\n98.52 \u00b1 0.66\n99.15 \u00b1 0.13\n\nFigure 7: ROC comparison with the previous best results on LFW. Best viewed in color.\n\nTo further exploit the rich pool of DeepID2 features extracted from the large number of patches, we\nrepeat the feature selection algorithm for another six times, each time choosing DeepID2 features\nfrom the patches that have not been selected by previous feature selection steps. Then we learn\nthe Joint Bayesian model on each of the seven groups of selected features, respectively. We fuse the\nseven Joint Bayesian scores on each pair of compared faces by further learning an SVM. In this way,\nwe achieve an even higher 99.15% face veri\ufb01cation accuracy. The accuracy and ROC comparison\nwith previous state-of-the-art methods on LFW are shown in Tab. 4 and Fig. 7, respectively. We\nachieve the best results and improve previous results with a large margin.\n\n5 Conclusion\n\nThis paper have shown that the effect of the face identi\ufb01cation and veri\ufb01cation supervisory signals\non deep feature representation coincide with the two aspects of constructing ideal features for face\nrecognition, i.e., increasing inter-personal variations and reducing intra-personal variations, and the\ncombination of the two supervisory signals lead to signi\ufb01cantly better features than either one of\nthem. When embedding the learned features to the traditional face veri\ufb01cation pipeline, we achieved\nan extremely effective system with 99.15% face veri\ufb01cation accuracy on LFW. The arXiv report of\nthis paper was published in June 2014 [19].\n\n8\n\n\fReferences\n[1] P. N. Belhumeur, J. a. P. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces:\n\nRecognition using class speci\ufb01c linear projection. PAMI, 19:711\u2013720, 1997.\n\n[2] X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun. A practical transfer learning algorithm for face\n\n[3] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation.\n\nveri\ufb01cation. In Proc. ICCV, 2013.\n\nIn Proc. ECCV, 2012.\n\n[4] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature\n\nand its ef\ufb01cient compression for face veri\ufb01cation. In Proc. CVPR, 2013.\n\n[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with\n\napplication to face veri\ufb01cation. In Proc. CVPR, 2005.\n\n[6] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for face\n\n[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant\n\n[8] J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face veri\ufb01cation in the\n\nidenti\ufb01cation. In Proc. ICCV, 2009.\n\nmapping. In Proc. CVPR, 2006.\n\nwild. In Proc. CVPR, 2014.\n\n[9] C. Huang, S. Zhu, and K. Yu. Large scale strongly supervised ensemble metric learning, with\n\napplications to face veri\ufb01cation and retrieval. NEC Technical Report TR115, 2011.\n\n[10] G. B. Huang, H. Lee, and E. Learned-Miller. Learning hierarchical representations for face\n\nveri\ufb01cation with convolutional deep belief networks. In Proc. CVPR, 2012.\n\n[11] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled Faces in the Wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-\n49, University of Massachusetts, Amherst, 2007.\n\n[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 1998.\n\n[13] C. Lu and X. Tang. Surpassing human-level face veri\ufb01cation performance on LFW with\n\nGaussianFace. Technical report, arXiv:1404.3840, 2014.\n\n[14] A. Mignon and F. Jurie. PCCA: A new approach for distance learning from sparse pairwise\n\nconstraints. In Proc. CVPR, 2012.\n\n[15] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In\n\n[16] B. Moghaddam, T. Jebara, and A. Pentland. Bayesian face recognition. PR, 33:1771\u20131782,\n\n[17] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In\n\n[18] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face veri\ufb01cation. In Proc. ICCV,\n\nProc. ICML, 2009.\n\nProc. ICML, 2010.\n\n2000.\n\n2013.\n\n[19] Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identi\ufb01cation-\n\nveri\ufb01cation. Technical report, arXiv:1406.4773, 2014.\n\n[20] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000\n\nclasses. In Proc. CVPR, 2014.\n\n[21] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level\n\nperformance in face veri\ufb01cation. In Proc. CVPR, 2014.\n\n[22] X. Wang and X. Tang. Uni\ufb01ed subspace analysis for face recognition. In Proc. ICCV, 2003.\n[23] X. Wang and X. Tang. A uni\ufb01ed framework for subspace face recognition. PAMI, 26:1222\u2013\n\n[24] X. Xiong and F. De la Torre Frade. Supervised descent method and its applications to face\n\n1228, 2004.\n\nalignment. In Proc. CVPR, 2013.\n\n[25] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations.\n\nIEEE Trans. Inf. Theor., 57:4689\u20134708, 2011.\n\n[26] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-preserving face space. In Proc.\n\nICCV, 2013.\n\n[27] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning and disentangling face representation by\n\nmulti-view perceptron. In Proc. NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1083, "authors": [{"given_name": "Yi", "family_name": "Sun", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Yuheng", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Xiaogang", "family_name": "Wang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xiaoou", "family_name": "Tang", "institution": "Chinese University of Hong Kong"}]}