{"title": "Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 225, "abstract": "Various factors, such as identities, views (poses), and illuminations, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of human brain. Intriguingly, even without accessing 3D data, human not only can recognize face identity, but can also imagine face images of a person under different viewpoints given a single 2D image, making face perception in the brain robust to view changes. In this sense, human brain has learned and encoded 3D face models from 2D images. To take into account this instinct, this paper proposes a novel deep neural net, named multi-view perceptron (MVP), which can untangle the identity and view features, and infer a full spectrum of multi-view images in the meanwhile, given a single 2D face image. The identity features of MVP achieve superior performance on the MultiPIE dataset. MVP is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.", "full_text": "Multi-View Perceptron: a Deep Model for Learning\n\nFace Identity and View Representations\n\nZhenyao Zhu1,3\n\nPing Luo3,1\n\nXiaogang Wang2,3\n\n1Department of Information Engineering, The Chinese University of Hong Kong\n2Department of Electronic Engineering, The Chinese University of Hong Kong\n3Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced Technology,\n\n{zz012,lp011}@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk\n\nChinese Academy of Sciences, Shenzhen, China\n\nXiaoou Tang1,3\n\nAbstract\n\nVarious factors, such as identity, view, and illumination, are coupled in face\nimages. Disentangling the identity and view representations is a major challenge\nin face recognition. Existing face recognition systems either use handcrafted\nfeatures or learn features discriminatively to improve recognition accuracy. This\nis different from the behavior of primate brain. Recent studies [5, 19] discovered\nthat primate brain has a face-processing network, where view and identity are\nprocessed by different neurons. Taking into account this instinct, this paper\nproposes a novel deep neural net, named multi-view perceptron (MVP), which can\nuntangle the identity and view features, and in the meanwhile infer a full spectrum\nof multi-view images, given a single 2D face image. The identity features of MVP\nachieve superior performance on the MultiPIE dataset. MVP is also capable to\ninterpolate and predict images under viewpoints that are unobserved in the training\ndata.\n\n1\n\nIntroduction\n\nThe performance of face recognition systems depends heavily on facial representation, which is\nnaturally coupled with many types of face variations, such as view, illumination, and expression. As\nface images are often observed in different views, a major challenge is to untangle the face identity\nand view representations. Substantial efforts have been dedicated to extract identity features by\nhand, such as LBP [1], Gabor [14], and SIFT [15]. The best practise of face recognition extracts\nthe above features on the landmarks of face images with multiple scales and concatenates them into\nhigh dimensional feature vectors [4, 21]. Deep learning methods, such as Boltzmann machine [9],\nsum product network [17], and deep neural net [16, 25, 22, 23, 24, 26] have been applied to face\nrecognition. For instance, Sun et al. [25, 22] employed deep neural net to learn identity features\nfrom raw pixels by predicting 10, 000 identities.\nDeep neural net is inspired by the understanding of hierarchical cortex in the primate brain and\nmimicking some aspects of its activities. Recent studies [5, 19] discovered that macaque monkeys\nhave a face-processing network that was made of six interconnected face-selective regions, where\nneurons in some of these regions were view-speci\ufb01c, while some others were tuned to identity across\nviews, making face recognition in brain of primate robust to view variation. This intriguing function\nof primate brain inspires us to develop a novel deep neural net, called multi-view perceptron (MVP),\nwhich can disentangle identity and view representations, and also reconstruct images under multiple\nviews. Speci\ufb01cally, given a single face image of an identity under an arbitrary view, it can generate\na sequence of output face images of the same identity, one at a time, under a full spectrum of\nviewpoints. Examples of the input images and the generated multi-view outputs of two identities\nare illustrated in Fig. 1. The images in the last two rows are from the same person. The extracted\nfeatures of MVP with respect to identity and view are plotted correspondingly in blue and orange.\n\n1\n\n\fFigure 1: The inputs (\ufb01rst column) and the multi-view outputs (remaining columns) of two identities. The \ufb01rst\ninput is from one identity and the last two inputs are from the other. Each reconstructed multi-view image (left)\nhas its ground truth (right) for comparison. The extracted identity features of the inputs (the second column),\nand the view features of both the inputs and outputs are plotted in blue and orange, respectively. The identity\nfeatures of the same identity are similar, even though the inputs are captured in diverse views, while the view\nfeatures of the same viewpoint are similar, although they are from different identities. The two persons look\nsimilar in the frontal view, but can be better distinguished in other views.\n\nWe can observe that the identity features of the same identity are similar, even though the inputs are\ncaptured in very different views, whilst the view features of images in the same view are similar,\nalthough they are across different identities.\nUnlike other deep networks that produce a deterministic output from an input, MVP employs the\ndeterministic hidden neurons to learn the identity features, whilst using the random hidden neurons\nto capture the view representation. By sampling distinct values of the random neurons, output\nimages in distinct views are generated. Moreover, to yield images of different viewpoints, we\nadd regularization that images under similar viewpoints should have similar view representations\non the random neurons. The two types of neurons are modeled in a probabilistic way.\nIn the\ntraining stage, the parameters of MVP are updated by back-propagation, where the gradient is\ncalculated by maximizing a variational lower bound of the complete data log-likelihood. With our\nproposed learning algorithm, the EM updates on the probabilistic model are converted to forward\nand backward propagation. In the testing stage, given an input image, MVP can extract its identity\nand view features. In addition, if an order of viewpoints is also provided, MVP can sequentially\nreconstruct multiple views of the input image by following this order.\nThis paper has several key contributions. (i) We propose a multi-view perceptron (MVP) and its\nlearning algorithm to factorize the identity and view representations with different sets of neurons,\nmaking the learned features more discriminative and robust. (ii) MVP can reconstruct a full spectrum\nof views given a single 2D image. The full spectrum of views can better distinguish identities, since\ndifferent identities may look similar in a particular view but differently in others as illustrated in Fig.\n1. (iii) MVP can interpolate and predict images under viewpoints that are unobserved in the training\ndata, in some sense imitating the reasoning ability of human.\nRelated Works. In the literature of computer vision, existing methods that deal with view (pose)\nvariation can be divided into 2D- and 3D-based methods. For example, the 2D methods, such as [6],\ninfer the deformation (e.g. thin plate splines) between 2D images across poses. The 3D methods,\nsuch as [2, 12], capture 3D face models in different parametric forms. The above methods have\ntheir inherent shortages. Extra cost and resources are necessitated to capture and process 3D data.\nBecause of lacking one degree of freedom, inferring 3D deformation from 2D transformation is\noften ill-posed. More importantly, none of the existing approaches simulates how the primate brain\nencodes view representations. In our approach, instead of employing any geometric models, view\ninformation is encoded with a small number of neurons, which can recover the full spectrum of\nviews together with identity neurons. This representation of encoding identity and view information\ninto different neurons is closer to the face-processing system in the primate brain and new to the\ndeep learning literature. Our previous work [28] learned identity features by using CNN to recover a\nsingle frontal view face image, which is a special case of MVP after removing the random neurons.\n[28] did not learn the view representation as we do. Experimental results show that our approach not\nonly provides rich multi-view representation but also learns better identity features compared with\n\n2\n\n\f[28]. Fig. 1 shows examples that different persons may look similar in the front view, but are better\ndistinguished in other views. Thus it improves the performance of face recognition signi\ufb01cantly.\nMore recently, Reed et al. [20] untangled factors of image variation by using a high-order Boltzmann\nmachine, where all the neurons are stochastic and it is solved by gibbs sampling. MVP contains both\nstochastic and deterministic neurons and thus can be ef\ufb01ciently solved by back-propagation.\n\n2 Multi-View Perceptron\n\nimage pairs, I = {xij,\nThe training data is a set of\n(yik, vik)}N,M,M\ni=1,j=1,k=1, where xij is the input image of the i-\nth identity under the j-th view, yik denotes the output image of\nthe same identity in the k-th view, and vik is the view label of\nthe output. vik is a M dimensional binary vector, with the k-th\nelement as 1 and the remaining zeros. MVP is learned from the\ntraining data such that given an input x, it can output images y\nof the same identity in different views and their view labels v.\nThen, the output v and y are generated as1,\n\nv = F (y, hv; \u0398), y = F (x, hid, hv, hr; \u0398) + \u0001,\n\n(1)\n\nNetwork\n\nFigure 2:\nstructure\nof MVP, which has\nsix layers,\nincluding three layers with only\nthe deterministic neurons (i.e.\nthe\nlayers parameterized by the weights\nof U0, U1, U4), and three layers\nwith both the deterministic and\nrandom neurons (i.e. the weights of\nU2, V2, W2, U3, V3, U5, W5).\nThis structure is used throughout\nthe experiments.\n\nwhere F is a non-linear function and \u0398 is a set of weights and\nbiases to be learned. There are three types of hidden neurons,\nhid, hv, and hr, which respectively extract identity features,\nview features, and the features to reconstruct the output face\nimage. \u0001 signi\ufb01es a noise variable.\nFig. 2 shows the architecture2 of MVP, which is a directed\ngraphical model with six layers, where the nodes with and\nwithout \ufb01lling represent the observed and hidden variables, and\nthe nodes in green and blue indicate the deterministic and random\nneurons, respectively. The generation process of y and v starts\nfrom x, \ufb02ows through the neurons that extract identity feature\nhid, which combines with the hidden view representation hv to\nyield the feature hr for face recovery. Then, hr generates y.\nMeanwhile, both hv and y are united to generate v. hid and hr are the deterministic binary hidden\nneurons, while hv are random binary hidden neurons sampled from a distribution q(hv). Different\nsampled hv generates different y, making the perception of multi-view possible. hv usually has a\nlow dimensionality, approximately ten, as ten binary neurons can ideally model 210 distinct views.\nFor clarity of derivation, we take an example of MVP that contains only one hidden layer of hid\nand hv. More layers can be added and derived in a similar fashion. We consider a joint distribution,\nwhich marginalizes out the random hidden neurons,\np(y, v, hv|hid; \u0398) =\n\np(v |y, hv; \u0398)p(y|hid, hv; \u0398)p(hv),\n\np(y, v |hid; \u0398) =\n\n(cid:88)\n\n(cid:88)\n\n(2)\n\nhv\n\nhv\n\nwhere \u0398 = {U0, U1, V1, U2, V2}, the identity feature is extracted from the input image, hid =\nf (U0x), and f is the sigmoid activation function, f (x) = 1/(1 + exp(\u2212x)). Other activation\nfunctions, such as recti\ufb01ed linear function [18] and tangent [11], can be used as well. To model\ncontinuous values of the output, we assume y follows a conditional diagonal Gaussian distribution,\np(y|hid, hv; \u0398) = N (y|U1hid + V1hv, \u03c32\ny). The probability of y belonging to the j-th view\nis modeled with the softmax function, p(vj = 1|y, hv; \u0398) =\nk\u2217hv), where Uj\u2217\nindicates the j-th row of the matrix.\n\nexp(U2\nk=1 exp(U2\n\nj\u2217hv)\nk\u2217y+V2\n\n(cid:80)K\n\nj\u2217y+V2\n\n1The subscripts i, j, k are omitted for clearness.\n2For clarity, the biases are omitted.\n\n3\n\n\f2.1 Learning Procedure\n\nThe weights and biases of MVP are learned by maximizing the data log-likelihood. The lower bound\nof the log-likelihood can be written as,\n\np(y, v, hv|hid; \u0398) \u2265(cid:88)\n\nq(hv) log\n\np(y, v, hv|hid; \u0398)\n\n.\n\n(3)\n\nlog p(y, v |hid; \u0398) = log\n\n(cid:88)\n+ (cid:80)\n\nhv\n\nS(cid:88)\n\ns=1\n\n(cid:88)\n\nhv\n\nhv\n\nq(hv)\n\nq(hv)\n\n\u2212(cid:80)\n\nhv q(hv) log p(hv|y,v;\u0398)\n\nhv q(hv) log p(y,v,hv|hid;\u0398)\n\nEq.(3) is attained by decomposing the log-likelihood into two terms,\n\nq(hv)\nlog p(y, v |hid; \u0398) =\n, which can be easily veri\ufb01ed by\nsubstituting the product, p(y, v, hv|hid) = p(y, v |hid)p(hv|y, v), into the right hand side of the\ndecomposition. In particular, the \ufb01rst term is the KL-divergence [10] between the true posterior\nand the distribution q(hv). As KL-divergence is non-negative, the second term is regarded as the\nvariational lower bound on the log-likelihood.\nThe above lower bound can be maximized by using the Monte Carlo Expectation Maximization\n(MCEM) algorithm recently introduced by [27], which approximates the true posterior by using the\nimportance sampling with the conditional prior as the proposal distribution. With the Bayes\u2019 rule, the\ntrue posterior of MVP is p(hv|y, v) = p(y,v |hv)p(hv)\n, where p(y, v |hv) represents the multi-view\nperception error, p(hv) is the prior distribution over hv, and p(y, v) is a normalization constant.\nSince we do not assume any prior information on the view distribution, p(hv) is chosen as a uniform\ndistribution between zero and one. To estimate the true posterior, we let q(hv) = p(hv|y, v; \u0398old).\nIt is approximated by sampling hv from the uniform distribution, i.e. hv \u223c U(0, 1), weighted by the\nimportance weight p(y, v |hv; \u0398old). With the EM algorithm, the lower bound of the log-likelihood\nturns into\n\np(y,v)\n\nL(\u0398, \u0398old) =\n\np(hv|y, v; \u0398old) log p(y, v, hv|hid; \u0398) (cid:39) 1\nS\n\nws log p(y, v, hv\n\ns|hid; \u0398),\n\n(4)\nwhere ws = p(y, v |hv; \u0398old) is the importance weight. The E-step samples the random hidden\nneurons, i.e. hv\n\ns \u223c U(0, 1), while the M-step calculates the gradient,\nS(cid:88)\n\n\u2202L(\u0398, \u0398old)\n\nS(cid:88)\n\n{log p(v |y, hv\n\ns) + log p(y|hid, hv\n\ns)},\n\n(5)\n\n\u2202L\n\u2202\u0398\n\n(cid:39) 1\nS\n\n=\n\n1\nS\n\nws\n\n\u2202\n\u2202\u0398\n\ns=1\n\ns=1\n\n\u2202\u0398\n\nwhere the gradient is computed by averaging over all the gradients with respect to the importance\nsamples.\nThe two steps have to be iterated. When more samples are needed to estimate the posterior, the\nspace complexity will increase signi\ufb01cantly, because we need to store a batch of data, the proposed\nsamples, and their corresponding outputs at each layer of the deep network. When implementing the\nalgorithm with GPU, one needs to make a tradeoff between the size of the data and the accurateness\nof the approximation, if the GPU memory is not suf\ufb01cient for large scale training data. Our empirical\nstudy (Sec. 3.1) shows that the M-step of MVP can be computed by using only one sample, because\nthe uniform prior typically leads to sparse weights during training. Therefore, the EM process\ndevelops into the conventional back-propagation.\nIn the forward pass, we sample a number of hv\ns based on the current parameters \u0398, such that only\nthe sample with the largest weight need to be stored. We demonstrate in the experiment (Sec. 3.1)\nthat a small number of times (e.g. < 20) are suf\ufb01cient to \ufb01nd good proposal. In the backward pass,\nwe seek to update the parameters by the gradient,\n\n\u2202L(\u0398)\n\u2202\u0398\n\n(cid:39) \u2202\n\u2202\u0398\n\ns)(cid:1)(cid:9),\n\n(cid:8)ws\n(cid:0) log p(v |y, hv\ns) + log p(y|hid, hv\ns) = \u2212 log \u03c3y \u2212(cid:107)(cid:98)y\u2212(U1hid+V1hv\n), where(cid:98)y and(cid:98)v are the ground truth.\n\ns )(cid:107)2\n\nwhere hv\nlowing two terms, log p(y|hid, hv\n\ns is the sample that has the largest weight ws. We need to optimize the fol-\ns) =\n(cid:80)K\n\nand log p(v |y, hv\n\n(cid:80)\nj(cid:98)vj log(\n\nj\u2217y+V2\n\nj\u2217hv\ns )\nk\u2217y+V2\nk\u2217hv\ns )\n\nexp(U2\nk=1 exp(U2\n\n2\u03c32\ny\n\n2\n\n(6)\n\n4\n\n\f\u2022 Continuous View In the previous discussion, v is assumed to be a binary vector. Note that v can\nalso be modeled as a continuous variable with a Gaussian distribution,\np(v |y, hv) = N (v |U2y + V2hv, \u03c3v),\n\n(7)\n\nIn this case, we can\n\nwhere v is a scalar corresponding to different views from \u221290\u25e6 to +90\u25e6.\ngenerate views not presented in the training data by interpolating v, as shown in Fig. 6.\n\u2022 Difference with multi-task learning Our model, which only has a single task, is also different\nfrom multi-task learning (MTL), where reconstruction of each view could be treated as a different\ntask, although MTL has not been used for multi-view reconstruction in literature to the best of our\nknowledge. In MTL, the number of views to be reconstructed is prede\ufb01ned, equivalent to the number\nof tasks, and it encounters problems when the training data of different views are unbalanced;\nwhile our approach can sample views continuously and generate views not presented in the training\ndata by interpolating v as described above. Moreover, the model complexity of MTL increases\nas the number of views and its training is more dif\ufb01cult since different tasks may have difference\nconvergence rates.\n\n2.2 Testing Procedure\n\nGiven the view label v, and the input x, we generate the face image y under the viewpoint of v in\ns=1 \u223c U(0, 1), which corresponds to a set of\nthe testing stage. A set of hv are \ufb01rst sampled, {hv\noutputs {ys}S\ns=1. For example, in a simple network with only one hidden layer, ys = U1hid+V1hv\ns\nand hid = f (U0x). Then, the desired face image in view v is the output ys that produces the\nlargest probability of p(v |ys, hv\ns). A full spectrum of multi-view images are reconstructed for all\nthe possible view labels v.\n\ns}S\n\n2.3 View Estimation\n\nOur model can also be used to estimate viewpoint of the input image x. First, given all possible\nvalues of viewpoint v, we can generate a set of corresponding output images {yz}, where z\nindicates the index of the values of view we generated (or interpolated). Then, to estimate\nviewpoint, we assign the view label of the z-th output yz to x, such that yz is the most similar\nimage to x. The above procedure is formulated as below.\nIf v is discrete, the problem is,\nz) \u2212 p(vj = 1|yz, hv\narg minj,z (cid:107) p(vj = 1|x, hv\n(cid:80)K\n(cid:107)2\n2. If v is continuous, the problem is de\ufb01ned as, arg minz (cid:107) (U2x +\n\u2212\nexp(U2\nk=1 exp(U2\n2 = arg minz (cid:107) x \u2212 yz (cid:107)2\nz) (cid:107)2\nz) \u2212 (U2yz + V2hv\n2.\nV2hv\n\n2 = arg minj,z (cid:107)\n\nj\u2217hv\nz )\nk\u2217x+V2\nk\u2217hv\nz )\n\nj\u2217yz+V2\n\nj\u2217hv\nz )\nk\u2217yz+V2\nk\u2217hv\nz )\n\nz) (cid:107)2\n\n(cid:80)K\n\nj\u2217x+V2\n\nexp(U2\nk=1 exp(U2\n\n3 Experiments\n\nSeveral experiments are designed for evaluation and comparison3. In Sec. 3.1, MVP is evaluated\non a large face recognition dataset to demonstrate the effectiveness of the identity representation.\nSec. 3.2 presents a quantitative evaluation, showing that the reconstructed face images are in good\nquality and the multi-view spectrum has retained discriminative information for face recognition.\nSec. 3.3 shows that MVP can be used for view estimation and achieves comparable result as the\ndiscriminative methods specially designed for this task. An interesting experiment in Sec. 3.4\nshows that by modeling the view as a continuous variable, MVP can analyze and reconstruct views\nnot seen in the training data.\n\n3.1 Multi-View Face Recognition\n\nMVP on multi-view face recognition is evaluated on the MultiPIE dataset [7], which contains\n754, 204 images of 337 identities. Each identity was captured under 15 viewpoints from \u221290\u25e6\nto +90\u25e6 and 20 different illuminations. It is the largest and most challenging dataset for evaluating\nface recognition under view and lighting variations. We conduct the following three experiments to\ndemonstrate the effectiveness of MVP.\n\n3http://mmlab.ie.cuhk.edu.hk/projects/MVP.htm. For more technical details of this work,\n\nplease contact the corresponding author Ping Luo (pluo.lhi@gmail.com).\n\n5\n\n\f\u2022 Face recognition across views This setting follows the existing methods, e.g. [2, 12, 28], which\nemploys the same subset of MultiPIE that covers images from \u221245\u25e6 to +45\u25e6 and with neutral\nillumination. The \ufb01rst 200 identities are used for training and the remaining 137 identities for test.\nIn the testing stage, the gallery is constructed by choosing one canonical view image (0\u25e6) from each\ntesting identity. The remaining images of the testing identities from \u221245\u25e6 to +45\u25e6 are selected as\nprobes. The number of neurons in MVP can be expressed as 32 \u00d7 32 \u2212 512 \u2212 512(10) \u2212 512(10) \u2212\n1024\u2212 32\u00d7 32[7], where the input and output images have the size of 32\u00d7 32, [7] denotes the length\nof the view label vector (v), and (10) represents that the third and forth layers have ten random\nneurons.\nWe examine the performance of using the identity features, i.e. hid\n), and\ncompare it with seven state-of-the-art methods in Table 1. The \ufb01rst three methods are based on\n3D face models and the remaining ones are 2D feature extraction methods, including deep models,\nsuch as FIP [28] and RL [28], which employed the traditional convolutional network to recover\nthe frontal view face image. As the existing methods did, LDA is applied to all the 2D methods\nto reduce the features\u2019 dimension. The \ufb01rst and the second best results are highlighted for each\nviewpoint, as shown in Table 1. The two deep models (MVP and RL) outperform all the existing\nmethods, including the 3D face models. RL achieves the best results on three viewpoints, whilst\nMVP is the best on four viewpoints. The extracted feature dimensions of MVP and RL are 512\nand 9216, respectively. In summary, MVP obtains comparable averaged accuracy as RL under this\nsetting, while the learned feature representation is more compact.\n\n2 (denoted as MVPhid\n\n2\n\nTable 1: Face recognition accuracies across views. The \ufb01rst and the second best performances are in bold.\n\nVAAM [2]\nFA-EGFC [12]\nSA-EGFC [12]\nLE [3]+LDA\nCRBM [9]+LDA\nFIP [28]+LDA\nRL [28]+LDA\nMVPhid\n\n+LDA\n\n2\n\nAvg.\n86.9\n92.7\n97.2\n93.2\n87.6\n95.6\n98.3\n98.1\n\n\u221215\u25e6 +15\u25e6 \u221230\u25e6 +30\u25e6 \u221245\u25e6 +45\u25e6\n74.8\n95.7\n85.2\n99.3\n93.6\n99.7\n81.8\n99.9\n94.9\n75.2\n100.0\n89.8\n97.8\n100.0\n100.0\n95.6\n\n95.7\n99.0\n99.7\n99.7\n96.4\n98.5\n99.3\n100.0\n\n89.5\n92.9\n98.3\n95.5\n88.3\n96.4\n98.5\n100.0\n\n91.0\n95.0\n98.7\n95.5\n90.5\n95.6\n98.5\n99.3\n\n74.1\n84.7\n93.0\n86.9\n80.3\n93.4\n95.6\n93.4\n\nTable 2: Face recognition accuracies across views and illuminations. The \ufb01rst and the second best\nperformances are in bold.\n\nRaw Pixels+LDA\nLBP [1]+LDA\nLandmark LBP [4]+LDA\nCNN+LDA\nFIP [28]+LDA\nRL [28]+LDA\nMTL+RL+LDA\nMVPhid\n+LDA\n+LDA\nMVPhid\n2\n+LDA\nMVPhr\n3\nMVPhr\n+LDA\n4\n\n1\n\nAvg.\n36.7\n50.2\n63.2\n58.1\n72.9\n70.8\n74.8\n61.5\n79.3\n72.6\n62.3\n\n0\u25e6\n81.3\n89.1\n94.9\n64.6\n94.3\n94.3\n93.8\n92.5\n95.7\n91.0\n83.4\n\n\u221215\u25e6 +15\u25e6 \u221230\u25e6 +30\u25e6 \u221245\u25e6 +45\u25e6 \u221260\u25e6 +60\u25e6\n7.63\n59.2\n14.6\n77.4\n83.9\n32.1\n44.2\n66.2\n42.5\n91.4\n38.9\n90.5\n91.7\n50.2\n85.4\n28.3\n60.0\n93.3\n56.0\n86.7\n77.3\n46.9\n\n12.8\n16.2\n35.5\n46.4\n49.3\n44.6\n51.5\n35.1\n60.2\n55.7\n44.4\n\n19.7\n29.7\n48.3\n57.9\n62.0\n59.5\n63.8\n45.4\n70.6\n63.8\n53.2\n\n37.3\n55.9\n68.2\n63.6\n82.5\n80.0\n83.3\n67.0\n83.9\n74.2\n63.9\n\n58.3\n79.1\n82.9\n62.8\n90.0\n89.8\n89.6\n84.9\n92.2\n84.1\n73.1\n\n21.0\n35.2\n52.8\n56.4\n66.1\n63.6\n70.4\n51.6\n75.2\n68.5\n57.3\n\n35.5\n56.8\n71.4\n60.7\n78.9\n77.5\n80.1\n64.3\n83.4\n74.6\n62.0\n\n\u2022 Face recognition across views and illuminations To examine the robustness of different feature\nrepresentations under more challenging conditions, we extend the \ufb01rst setting by employing a\nlarger subset of MultiPIE, which contains images from \u221260\u25e6 to +60\u25e6 and 20 illuminations. Other\nexperimental settings are the same as the above. In Table 2, feature representations of different\nlayers in MVP are compared with seven existing features, including raw pixels, LBP [1] on image\ngrid, LBP on facial landmarks [4], CNN features, FIP [28], RL [28], and MTL+RL. LDA is applied\nto all the feature representations. Note that the last four methods are built on the convolutional\nneural networks. The only distinction is that they adopted different objective functions to learn\nfeatures. Speci\ufb01cally, CNN uses cross-entropy loss to classify face identity as in [26]. FIP and\nRL utilized least-square loss to recover the frontal view image. MTL+RL is an extension of RL.\nIt employs multiple tasks, each of which is formulated as a least square loss, to recover multi-view\nimages, and all the tasks share feature layers. To achieve fair comparisons, CNN, FIP, and MTL+RL\nadopt the same convolutional structure as RL [28], since RL achieves competitive results in our \ufb01rst\nexperiment.\n\n6\n\n\f2 > hr\n\n3 and hr\n\n4, because some randomly generated view factors (hv\n\n3 > hid\n\n1 > hr\n\n4, which conforms our expectation. hid\n2 performs better than hid\n\nThe \ufb01rst and second best results are emphasized in bold in Table 2. The identity feature hid\n2 of\nMVP outperforms all the other methods on all the views with large margins. MTL+RL achieves\nthe second best results except on \u00b160\u25e6. These results demonstrate the superior of modeling multi-\nview perception. For the features at different layers of MVP, the performance can be summarized\nas hid\n2 performs the best because it is\n1 because pose factors coupled in\nthe highest level of identity features. hid\n1 to hid\nthe input image x have be further removed, after one more forward mapping from hid\n2 . hid\n2\nalso outperforms hr\n2 and hv\n3) have been\nincorporated into these two layers during the construction of the full view spectrum. Please refer to\nFig. 2 for a better understanding.\n\u2022 Effectiveness of the BP Procedure\nFig. 3 (a) compares the\nconvergence rates dur-\ning training, when using\ndifferent number of sam-\nples to estimate the true\nposterior. We observe\nthat a few number of\nsamples, such as twenty,\ncan lead to reasonably\ngood convergence. Fig.\n3 (b) empirically shows\nthat uniform prior leads\nto sparse weights during\ntraining. In other words,\nif we seek to calculate the gradient of BP using only one sample, as did in Eq.(6). Fig. 3\n(b) demonstrates that 20 samples are suf\ufb01cient, since only 6 percent of the samples\u2019 weights\napproximate one (all the others are zeros). Furthermore, as shown in Fig. 3 (c), the convergence\nrates of the one-sample gradient and the weighted summation are comparable.\n\nFigure 3: Analysis of MVP on the MultiPIE dataset.\n(a) Comparison of\nconvergence, using different number of samples to estimate the true posterior. (b)\nComparison of sparsity of the samples\u2019 weights. (c) Comparison of convergence,\nusing the largest weighted sample and using the weighted average over all the\nsamples to compute gradient.\n\n3.2 Reconstruction Quality\n\nAnother experiment is designed to quantitatively evaluate the multi-\nview reconstruction result. The setting is the same as the \ufb01rst\nexperiment in Sec. 3.1. The gallery images are all in the frontal view\n(0\u25e6). Differently, LDA is applied to the raw pixels of the original\nimages (OI) and the reconstructed images (RI) under the same view,\nrespectively. Fig. 4 plots the accuracies of face recognition with\nrespect to distinct viewpoints. Not surprisingly, under the viewpoints\nof +30\u25e6 and \u221245\u25e6 the accuracies of RI are decreased compared to\nOI. Nevertheless, this decrease is comparatively small (< 5%). It\nimplies that the reconstructed images are in reasonably good quality.\nWe notice that the reconstructed images in Fig. 1 lose some detailed\ntextures, while well preserving the shapes of pro\ufb01le and the facial\ncomponents.\n\n3.3 Viewpoint Estimation\n\nFigure 4: Face recognition ac-\ncuracies. LDA is applied to the\nraw pixels of the original im-\nages and the reconstructed im-\nages.\n\nThis experiment is conducted to evaluate the performance of\nviewpoint estimation. MVP is compared to Linear Regression\n(LR) and Support Vector Regression (SVR), both of which have\nbeen used in viewpoint estimation, e.g. [8, 13]. Similarly, we\nemploy the \ufb01rst setting as introduced in Sec. 3.1, implying that\nwe train the models using images of a set of identities, and then\nestimate poses of the images of the remaining identities. For\ntraining LR and SVR, the features are obtained by applying PCA on the raw image pixels. Fig. 5\nreports the view estimation errors, which are measured by the differences between the pose degrees\n\nFigure 5: Errors of view estima-\ntion.\n\n7\n\n0246810120\u00b0+15\u00b0\u221215\u00b0+30\u00b0\u221230\u00b0+45\u00b0\u221245\u00b0ViewpointError of view estimation (in degree) LRSVRMVP\fFigure 6: We adopt the images in 0\u25e6, 30\u25e6, and 60\u25e6 for training, and test whether MVP can analyze and\nreconstruct images under 15\u25e6 and 45\u25e6. The reconstructed images (left) and the ground truths (right) are shown\nin (a). (b) visualizes the full spectrum of the reconstructed images, when the images in unobserved views are\nused as inputs (\ufb01rst column).\n\nof ground truth and the predicted degrees. The averaged errors of MVP, LR, and SVR are 5.03\u25e6,\n9.79\u25e6, and 5.45\u25e6, respectively. MVP achieves slightly better results compared to the discriminative\nmodel, i.e. SVR, demonstrating that it is also capable for view estimation, even though it is not\ndesignated for this task.\n\n3.4 Viewpoint Interpolation\n\nWhen the viewpoint is modeled as a continuous variable as described in Sec. 2.1, MVP implicitly\ncaptures a 3D face model, such that it can analyze and reconstruct images under viewpoints that have\nnot been seen before, while this cannot be achieved with MTL. In order to verify such capability, we\nconduct two tests. First, we adopt the images from MultiPIE in 0\u25e6, 30\u25e6, and 60\u25e6 for training, and test\nwhether MVP can generate images under 15\u25e6 and 45\u25e6. For each testing identity, the result is obtained\nby using the image in 0\u25e6 as input and reconstructing images in 15\u25e6 and 45\u25e6. Several synthesized\nimages (left) compared with the ground truth (right) are visualized in Fig. 6 (a). Although the\ninterpolated images have noise and blurring effect, they have similar views as the ground truth and\nmore importantly, the identity information is preserved. Second, under the same training setting as\nabove, we further examine, when the images of the testing identities in 15\u25e6 and 45\u25e6 are employed as\ninputs, whether MVP can still generate a full spectrum of multi-view images and preserve identity\ninformation in the meanwhile. The results are illustrated in Fig. 6 (b), where the \ufb01rst image is the\ninput and the remaining are the reconstructed images in 0\u25e6, 30\u25e6, and 60\u25e6.\nThese two experiments show that MVP essentially models a continuous space of multi-view images\nsuch that \ufb01rst, it can predict images in unobserved views, and second, given an image under an\nunseen viewpoint, it can correctly extract identity information and then produce a full spectrum of\nmulti-view images. In some sense, it performs multi-view reasoning, which is an intriguing function\nof human brain.\n4 Conclusions\n\nIn this paper, we have presented a generative deep network, called Multi-View Perceptron (MVP), to\nmimic the ability of multi-view perception in primate brain. MVP can disentangle the identity and\nview representations from an input image, and also can generate a full spectrum of views of the input\nimage. Experiments demonstrated that the identity features of MVP achieve better performance on\nface recognition compared to state-of-the-art methods. We also showed that modeling the view\nfactor as a continuous variable enables MVP to interpolate and predict images under the viewpoints,\nwhich are not observed in training data, imitating the reasoning capacity of human.\nAcknowledgement This work is partly supported by Natural Science Foundation of China (91320101,\n61472410), Shenzhen Basic Research Program (JCYJ20120903092050890, JCYJ20120617114614438, J-\nCYJ20130402113127496), Guangdong Innovative Research Team Program (201001D0104648280).\n\nReferences\n[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns:\n\nApplication to face recognition. TPAMI, 28:2037\u20132041, 2006.\n\n[2] A. Asthana, T. K. Marks, M. J. Jones, K. H. Tieu, and M. Rohith. Fully automatic pose-\n\ninvariant face recognition via 3d pose normalization. In ICCV, 2011.\n\n8\n\n\f[3] Z. Cao, Q. Yin, X. Tang, and J. Sun. Face recognition with learning-based descriptor. In CVPR,\n\n2010.\n\n[4] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature\n\nand its ef\ufb01cient compression for face veri\ufb01cation. In CVPR, 2013.\n\n[5] W. A. Freiwald and D. Y. Tsao. Functional compartmentalization and viewpoint generalization\n\nwithin the macaque face-processing system. Science, 330(6005):845\u2013851, 2010.\n\n[6] D. Gonz\u00b4alez-Jim\u00b4enez and J. L. Alba-Castro. Toward pose-invariant 2-d face recognition\nIEEE Transactions on Information\n\nthrough point distribution models and facial symmetry.\nForensics and Security, 2:413\u2013429, 2007.\n\n[7] R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S. Baker. Multi-pie. In Image and Vision\n\nComputing, 2010.\n\n[8] Y. Hu, L. Chen, Y. Zhou, and H. Zhang. Estimating face pose by facial asymmetry and\n\ngeometry. In AFGR, 2004.\n\n[9] G. B. Huang, H. Lee, and E. Learned-Miller. Learning hierarchical representations for face\n\nveri\ufb01cation with convolutional deep belief networks. In CVPR, 2012.\n\n[10] S. Kullback and R. A. Leibler. On information and suf\ufb01ciency. In Annals of Mathematical\n\nStatistics, 1951.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document\n\nrecognition. Proceedings of IEEE, 86(11):2278\u20132324, 1998.\n\n[12] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, and S. Shan. Morphable displacement \ufb01eld based\n\nimage matching for face recognition across pose. In ECCV, 2012.\n\n[13] Y. Li, S. Gong, and H. Liddell. Support vector regression and classi\ufb01cation based multi-view\n\nface detection and recognition. In AFGR, 2000.\n\n[14] C. Liu and H. Wechsler. Gabor feature based classi\ufb01cation using the enhanced \ufb01sher linear\n\ndiscriminant model for face recognition. TIP, 11:467\u2013476, 2002.\n\n[15] D. G. Lowe. Distinctive image features from scale-invariant keypoints.\n\n2004.\n\nIJCV, 60:91\u2013110,\n\n[16] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via deep learning. In CVPR, 2012.\n[17] P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes\n\nanalysis. In ICCV, 2013.\n\n[18] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nICML, 2010.\n\n[19] S. Ohayon, W. A. Freiwald, and D. Y. Tsao. What makes a cell face selective? the importance\n\nof contrast. Neuron, 74:567\u2013581, 2013.\n\n[20] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with\n\nmanifold interaction. In ICML, 2014.\n\n[21] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. Fisher vector faces in the wild. In\n\nBMVC, 2013.\n\n[22] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face veri\ufb01cation. In ICCV, 2013.\n[23] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection.\n\nIn CVPR, 2013.\n\n[24] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint\n\nidenti\ufb01cation-veri\ufb01cation. In NIPS, 2014.\n\n[25] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000\n\nclasses. In CVPR, 2014.\n\n[26] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level\n\nperformance in face veri\ufb01cation. In CVPR, 2014.\n\n[27] Y. Tang and R. Salakhutdinov. Learning stochastic feedforward neural networks. In NIPS,\n\n2013.\n\n[28] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In ICCV,\n\n2013.\n\n9\n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Zhenyao", "family_name": "Zhu", "institution": "the Chinese University of Hong Kong"}, {"given_name": "Ping", "family_name": "Luo", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xiaogang", "family_name": "Wang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xiaoou", "family_name": "Tang", "institution": "Chinese University of Hong Kong"}]}