{"title": "Unsupervised Depth Estimation, 3D Face Rotation and Replacement", "book": "Advances in Neural Information Processing Systems", "page_first": 9736, "page_last": 9746, "abstract": "We present an unsupervised approach for learning to estimate three dimensional (3D) facial structure from a single image while also predicting 3D viewpoint transformations that match a desired pose and facial geometry.\nWe achieve this by inferring the depth of facial keypoints of an input image in an unsupervised manner, without using any form of ground-truth depth information. We show how it is possible to use these depths as intermediate computations within a new backpropable loss to predict the parameters of a 3D affine transformation matrix that maps inferred 3D keypoints of an input face to the corresponding 2D keypoints on a desired target facial geometry or pose.\nOur resulting approach, called DepthNets, can therefore be used to infer plausible 3D transformations from one face pose to another, allowing faces to be frontalized, transformed into 3D models or even warped to another pose and facial geometry.\nLastly, we identify certain shortcomings with our formulation, and explore adversarial image translation techniques as a post-processing step to re-synthesize complete head shots for faces re-targeted to different poses or identities.", "full_text": "Unsupervised Depth Estimation,\n3D Face Rotation and Replacement\n\nJoel Ruben Antony Moniz1\u21e4, Christopher Beckham2,3\u21e4, Simon Rajotte2,3,\n\nSina Honari2, Christopher Pal2,3,4\n\n1Carnegie Mellon University, 2Mila-University of Montreal, 3Polytechnique Montreal, 4Element AI\n\n1jrmoniz@andrew.cmu.edu, 2honaris@iro.umontreal.ca, 3firstname.lastname@polymtl.ca\n\nAbstract\n\nWe present an unsupervised approach for learning to estimate three dimensional\n(3D) facial structure from a single image while also predicting 3D viewpoint\ntransformations that match a desired pose and facial geometry. We achieve this\nby inferring the depth of facial keypoints of an input image in an unsupervised\nmanner, without using any form of ground-truth depth information. We show how\nit is possible to use these depths as intermediate computations within a new back-\npropable loss to predict the parameters of a 3D af\ufb01ne transformation matrix that\nmaps inferred 3D keypoints of an input face to the corresponding 2D keypoints on\na desired target facial geometry or pose. Our resulting approach, called DepthNets,\ncan therefore be used to infer plausible 3D transformations from one face pose\nto another, allowing faces to be frontalized, transformed into 3D models or even\nwarped to another pose and facial geometry. Lastly, we identify certain shortcom-\nings with our formulation, and explore adversarial image translation techniques as\na post-processing step to re-synthesize complete head shots for faces re-targeted to\ndifferent poses or identities. 1\n\n1\n\nIntroduction\n\nFace rotation is an important task in computer vision.\nIt has been used to frontalize faces for\nveri\ufb01cation [8; 19; 25; 28] or to generate faces of arbitrary poses [22; 18]. In this paper we present\na novel unsupervised learning technique for face rotation and warping from a 2D source image \u2013\nwhose facial appearance will be used in the rotation \u2013 to a target face \u2013 to which the facial pose\nand geometry inferred from the source image is mapped. A use case is when we have an image of\nsomeone in a particular target pose and we want to put a given source face into that pose, without\nknowing the exact target face pose. This can be leveraged, for example, in the advertisement industry,\nwhen putting someone in a particular location can be costly or unfeasible, or in the movie industry\nwhen the main actor\u2019s limited time or high cost can enforce using another actor whose face can be\nlater replaced by the main actor\u2019s. This is achieved through estimating the source face depth and\nthe 3D af\ufb01ne parameters that warp the source to the target face using neural networks. These neural\nnetworks use a novel loss formulation for the structured prediction of keypoint depths. Once the 3D\naf\ufb01ne transformation matrix is estimated, it can be used to warp the source image onto the target\nface geometry using a textured triangular mesh. The use of a 3D af\ufb01ne transform means that we\ncan capture both a 3D rotation of the face to a new viewpoint as well as a global non-Euclidean\nwarping of the geometry to match a target face. We call these neural networks Depth Estimation-Pose\nTransformation Hybrid Networks, or DepthNets in short.\nOur \ufb01rst contribution is to propose a neural architecture that predicts both the depth of source\nkeypoints as well as the parameters of a 3D geometric af\ufb01ne transformation which constitute the\n\n\u21e4Indicates equal contribution.\n1Code will be released at: https://github.com/joelmoniz/DepthNets/\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fexplicit outputs of the DepthNet model. The predicted depth and af\ufb01ne transformation could be then\nused to map a source face to a target face for object orientation, distortion and viewpoint changes.\nOur second contribution consists of making the observation that given 3D source and 2D target\nkeypoints, closed form least squares solutions exist for estimating geometric af\ufb01ne transformation\nmodels between these sets of keypoint correspondences, and we can therefore develop a model that\ncaptures the dependency between depth and the af\ufb01ne transformation parameters. More speci\ufb01-\ncally, we express the af\ufb01ne transformation as a function of the pseudoinverse transformation of 2D\nkeypoints in a source image \u2013 augmented by inferred depths \u2013 and the target keypoints. Thus, the\nsecond and major contribution in this work is capturing the relationship between an estimated af\ufb01ne\ntransformation and the inferred depth as a deterministic relationship. In this formulation, DepthNet\nonly predicts depth values explicitly and the af\ufb01ne parameters are inferred through a pseudoinverse\ntransformation of source and target keypoints. Here, one can directly optimize through the solutions\nof what might otherwise be formulated as a secondary minimization step.\nOur proposed DepthNet can map the central region of the source face to the target geometry. This\nleads to background mismatch when warping one face to another. Finally, our third contribution is to\nuse an adversarial unpaired image-to-image transformation approach to repair the appearance of 3D\nmodels inferred from DepthNet. Together these contributions allow 3D models of faces that construct\nrealistic images in the target pose. Our proposed method can be used for pose normalization or face\nswaps with no manually speci\ufb01ed 3D face model. To the best of our knowledge, this is the \ufb01rst such\nneural network based model that estimates a 3D af\ufb01ne transformation model for face rotation which\nneither requires ground-truth 3D images nor any ground truth 3D face information such as depth.\n\n2 Our Approach\n\nAs we have outlined above, our approach uses neural networks for inferring depth and geometric\ntransformation \u2013 referred to as DepthNets; and, an adversarial image-to-image transformation network\nwhich improves the quality of the appearance of a 3D model inferred from a DepthNet.\n\nDepthNets\n\nWe propose three DepthNet formulations, described in Sections 2.1, 2.2, and 2.3. For each of the\nthree models we explore two architectural scenarios: (A) a Siamese-like architecture that uses the\nsource and target images themselves as well as keypoints extracted from these images, and (B) a\nfully-connected neural network variant which uses only facial keypoints in the source and target\nimages. See Figure 1 (left) for details.\n\n3@80x80\n\nC\n4x4 \n\nP \n2x2 \n\n32@38x38\n\n48@18x18\n\nC\n3x3 \n\nP \n2x2 \n\nC\n2x2 \n\nP \n2x2 \n\n64@8x8\n\n2048\n\n512+2N\n\nN+8\n\n256\n\n3@80x80\n\n32@38x38\n\nC\n4x4 \n\nP \n2x2 \n\n48@18x18\n\nC\n3x3 \n\nP \n2x2 \n\nC\n2x2 \n\nP \n2x2 \n\n64@8x8\n\nFC FC\n\nFC\n\nFC\n\n\u00a08 \n\naf\ufb01ne \nparams \n\nN \n\nDepth \nValues \n\nFigure 1: (Left) DepthNet architecture. The blue region is only used in case (A) and the red part is used in\nboth cases (A) and (B), described in Section 2. The orange output (the 8 af\ufb01ne transformation parameters) is\npredicted only by model variations described in Sections 2.1 and 2.2, and not the model described in section\n2.3. All three models predict the N depth values of the source keypoints. C, P, and FC correspond to valid conv,\npool and fully-connected layers. The two paths of Siamese network share parameters and the black dots indicate\nconcatenating keypoint values to FC units. (Right) Visualizing face rotation by re-projecting a frontal face (far\nleft) to a range of other poses de\ufb01ned by the faces in the row above (in each pair of rows). In this experiment, we\nonly use keypoints from the top-row in the DepthNet model (Model 7 in Table 1).\nIt is interesting to note that if DepthNets are used to register a set of images of objects to the same\ncommon viewpoint, the same image and geometry can be used as the target. This is the case for\nthe frontalization of faces, for example. While the DepthNet framework is suf\ufb01ciently general to be\napplied to any object type where 2D keypoint detections have been made, our experiments here focus\non faces. We describe the three variants of DepthNets below.\n\n2\n\n\f2.1 Predicting Depth and Viewpoint Separately\n\nIn this variant of DepthNets, the model predicts both depths and viewpoint geometry, but as separate\nexplicit outputs of a neural network. The input is comprised of only the geometry and pose of the\nsource and target faces (encoded in the form of a 2D keypoint template), in case (B), or both keypoints\nand images of the source and target faces, in case (A). The key phases of this stage are described by\nthe sequence of steps given below:\n1. Keypoint extraction: Raw (x, y) pixel coordinates corresponding to the keypoints of each image\nare extracted using a Recombinator Network (RCN) [10] architecture, and then concatenated before\nbeing passed into the keypoint processing step.\n2. (Optional) Image Feature Extraction: DepthNets can be conditioned on only keypoints, case (B),\nor on keypoints and the original images, case (A). We can therefore optionally subject the source and\ntarget images to alternating conv-maxpool layers. If this component of the architecture is used, the\nlast spatial feature maps in the Siamese architecture are concatenated before being given to a set of\ndensely connected hidden layers.\n3. Keypoint processing: In this step keypoints are passed through a set of hidden layers. If the Image\nFeature Extraction stage is used, the keypoints are concatenated to image features, the output of\nwhich is in turn fed to densely connected layers. The output layer of this phase will be of size N + 8,\nwhere N is the number of keypoints. The \ufb01rst N points represent the Depth proxy, and the last 8\npoints form a 4 \u21e5 2 matrix representing the learned parameters of the af\ufb01ne transform. See Figure 1.\n4. Geometric Af\ufb01ne Transformation Normalizer: This phase applies the predicted af\ufb01ne transform\non each (depth augmented) source keypoint to estimate its target location. Let (xi\ns) represent the\nith source keypoint, (xi\nn) the corresponding source normalized keypoint estimated by applying\nt) the ith target keypoint (as ground truth (GT)), and Is and It\nthe af\ufb01ne transformation matrix, (xi\nrepresent the source and target images respectively. Depending on which underlying architectural\nvariant we use, two cases arise: one that utilizes only the keypoints (B), and another utilizing\nboth the keypoints and the images (A). Since the keypoints are generated using RCNs, they are\ntechnically functions of the input images: [xs, ys] = R(Is), and [xt, yt] = R(It). Depending\non the (A) or (B) variant, the ith keypoint\u2019s predicted depth proxy zi\np is inferred as a function\nof the input keypoints, or both input keypoints and input images.\nIn both cases the keypoints\np(Is, It). Similarly, the 3D-2D af\ufb01ne transform F is a\nare derived from the images, so zi\nfunction of the images, such that F = F(Is, It), where the 8 predicted parameters are: m =\n{m1, m2, m3, tx, m4, m5, m6, ty)}. These constitute the 3D-2D af\ufb01ne transform which is used by\nall keypoints. In other words, each of the i points is transformed using xi\ns, or:\n\nn = F(Is, It) xi\n\np = zi\n\nn, yi\n\ns, yi\n\nt, yi\n\n\uf8ff xi\nn =\uf8ffm1 m2 m3\n\nm4 m5 m6\n\nn\nyi\n\ntx\n\nty264\n\nxi\ns\nyi\ns\n\n1\n\nzi\np(Is, It)\n\n375\n\nThe loss function of a DepthNet is obtained by transforming the source face to match the target face\nusing the simple squared error of the corresponding target object\u2019s keypoint vector xt = [xt, yt]T\n, as GT values, and the source object\u2019s normalized keypoint vector [xn, yn]T . The loss for one\nexample where we predict depth and af\ufb01ne viewpoint geometry can therefore be expressed as:\n\nL =\n\nt F(Is, It) [xi\n\ns yi\n\ns zi\n\n2\n\n(1)\n\nKXi=1xi\n\np(Is, It)]T\n\n5. Image Warper: This phase consists of using the depth proxy and af\ufb01ne transform matrix generated\nto actually warp the face from its source pose to be matched to the target object geometry. The \ufb01nal\nprojection to 2D is achieved by simply dropping the transformed z coordinate (which corresponds\nto an orthographic projection model). In the case of DepthNets, this orthographic projection is\neffectively embedded in the Geometric Af\ufb01ne Transformation Normalizer step, since the af\ufb01ne\ncorresponding to the z coordinate is not predicted, essentially dropping it.\nAs we operate on keypoints, the actual warping of pixels can be performed with a high quality\nOpenGL pipeline that performs the warp separately from the rest of the architecture. Source image,\nkeypoints augmented with depth, and the af\ufb01ne matrix are passed to OpenGL pipeline to warp the\nsource image towards the target pose. This OpenGL warping is not needed during DepthNet training,\nwhich means we do not have to do feedforward or backprop through OpenGL. In Summary, for step 1\nthe RCN model [10] is used, for steps 2 to 4 the DepthNet model, shown in Figure 1 (left), is trained,\n\n3\n\n\fand for step 5 an OpenGL pipeline is used. No data or parameters are needed to train the OpenGL\npipeline. It warps images by directly using the provided data.\n\n2.2 Estimating Viewpoint Geometry as a Second Step\nIn this model variant, training is similar to Section 2.1 and the model outputs depth and 3D\naf\ufb01ne transformation parameters. However, at test time, rather than using the predicted 3D af\ufb01ne\ntransformation for pairs of faces, we use only the predicted depths and estimate the af\ufb01ne ge-\nometry parameters as a second estimation step. More precisely, given 3D points for a scene\nand the corresponding 2D points for a target geometry it is possible to formulate the estima-\ntion of a 3D af\ufb01ne transformation as a linear least squares estimation problem. An overdeter-\nmined system of the form Am = xt for this problem can be constructed as shown in (2).\n\nThis corresponds to an af\ufb01ne camera model\nfollowed by an orthographic projection to\n2D keypoints. This setup also leads to\nthe following closed form solution for the\naf\ufb01ne transformation parameters:\nm = [AT A]1AT xt,\n\n(3)\n\nx1\ns\n0\nx2\ns\n0\n\ny1\ns\n0\ny2\ns\n0\n\nz1\ns\n0\nz2\ns\n0\n\nxK\ns\n0\n\nyK\ns\n0\n\nzK\ns\n0\n\n0\nx1\ns\n0\nx2\ns\n\n0\nxK\ns\n\n0\ny1\ns\n0\ny2\ns\n...\n0\nyK\ns\n\n0\nz1\ns\n0\nz2\ns\n\n0\nzK\ns\n\n2666666664\n\n1 0\n0 1\n1 0\n0 1\n\n1 0\n0 1\n\n3777777775\n\n2666666664\n\nm1\nm2\nm3\nm4\nm5\nm6\ntx\nty\n\n3777777775\n\nx1\nt\ny1\nt\nx2\nt\ny2\n\nt...\n\nxK\nt\nyK\nt\n\n2666666664\n\n3777777775\n\n=\n\n(2)\n\nwhere this pseudoinverse based transformation is parameterized by the reference points and their\npredicted depths.\n\nJoint Viewpoint and Depth Prediction\n\n2.3\nOur key observation is that one can alternatively use the closed form analytical solution, measured in\nEq. (3), for the least squares estimation problem as the underlying af\ufb01ne transformation matrix within\nthe loss function. This leads to a special form of structured prediction problem for geometrically\nconsistent depths and af\ufb01ne transformation matrix. For each image we have L =\n\nwhere the matrix A is parameterized as a function of xs as shown in Eq. (2). In this variant, the\nmodel explicitly outputs only depth values during train and test time. The af\ufb01ne transformation matrix\nin the equation above is replaced by Eq. (3), which measures the af\ufb01ne transformation as a pure\nfunction of source and target keypoints plus the inferred depth. The big difference of this formulation\ncompared to Sections 2.1 and 2.2 is that geometric af\ufb01ne transformation parameters are no longer\npredicted by DepthNet during training and at both train and test time \u2013 it solves the least square loss\nt}j=1...N ) is predicted\nthrough the pseudoinverse based transformation. Since zi\nwithin the analytical formulation of the solution to the least squares minimization problem, we can\nbackpropagate through the solution of a minimization problem that depends on the predicted depths.\nWhile we leverage keypoints for depth estimation, the proposed approach is novel in how the depth is\nestimated. Note that it is unsupervised with respect to depth labels. No depth supervision either by\nusing depth targets (as in [2; 13; 15]), or by using depth in an adversarial setting (as in [24]), is used\nto estimate depth values for the base DepthNet models described in Sections 2.1, 2.2, and 2.3.\nThe depths learned for keypoints by these approaches are not necessarily true depths, but are likely to\nstrongly correlate with the actual depth of each keypoint. This is because even though the method\nsucceeds (as we shall see below) in aligning poses, the inferred depth and the af\ufb01ne transform may\neach be scaled by factors so as to cancel each other out (i.e., by factors which are multiplicative\ninverses of each other). Real world viewpoint geometry also involves perspective projection.\n\np({xj\n\ns = zi\n\ns, xj\n\nt , yj\n\ns, yj\n\nAdversarial Image-to-Image Transformation\nDepthNet transforms the central region of the source face to the target pose. Inevitably, the face\nbackground will be missing, which might make the proposed method unsuitable for many application\n\n4\n\nKXi=1\n\nt\nyi\n\n\uf8ff xi\nt \n| {z }xi\n\nt\n\nm4 m5 m6\n\n\uf8ffm1 m2 m3\n{z\n|\n\nm\n\ntx\n\nty\n}\n\nzi\np(Is, It)\n\n2664\n|\n\nxi\ns\nyi\ns\n\n1\n\nxi\ns\n\n{z\n\n\n\n3775\n}\n\n2\n\n=\n\nKXi=1xi\n\ns2\nt reshape\u21e5[AT A]1AT xt\u21e4xi\n\n\fwhere the full face is required. To address this issue, we utilize CycleGAN [30], an adversarial\nimage-to-image translation technique. This serves to repair the background of faces that have\nundergone frontalization or face swap through the DepthNet pipeline. Importantly, the adversarial\nnature of CycleGAN allows one to perform image transformation between two domains without\nthe requirement of paired data. In our work, we perform experiments translating between various\ndomains of interest but one example is translating between the domain of images in the dataset (i.e.\nthe ground truth) and the domain of images where the DepthNet output is pasted onto the face region\n(in the case of face-swap). By doing so we clean the face background in an unsupervised manner.\n\n3 Experiments\n\n3.1 DepthNet Evaluation on Paired Faces\n\n1.562\n0.724\n0.568\n0.539\n0.400\n0.399\n0.357\n0.349\n\n9.547\n7.486\n6.292\n6.115\n5.184\n5.175\n4.932\n4.891\n\nColor MSE MSE_norm\ngrey\npurple\nbrown\nviolet\nred\ngreen\norange\nblue\n\nFor the experiments in this section, we use a subset of the VGG dataset [16], with training and\nvalidating on all possible pairs of images belonging to the same identity for 2401 identities. This\nyields 322,227 train and 43,940 validation pairs. Check experimental setup details in Supplementary.\nModel\n1) A simple 2D af\ufb01ne registration\n2) A 3D af\ufb01ne registration model using an average 3D face template\n3) A DepthNet that separately estimates depth and geometry\n4) The model above, but with a Siamese CNN image model\n5) Secondary least squares estimation for visual geometry using the depths from 3)\n6) Secondary least squares estimation for visual geometry using the depths from 4)\n7) Backpropagation through the pseudoinverse based solution for visual geometry\n8) The model above, but with a Siamese CNN image model\nTable 1: (left) Comparing the Mean Squared Error (MSE) and MSE normalized by inter-ocular distance\n(MSE_norm) of different models. (right) Histogram of Mean Squared Errors. The second column in the Table\n(on left) corresponds to the color of the model in the \ufb01gure (on right).\nWe explore the three variants of DepthNets described in Sections 2.1, 2.2, and 2.3, each with two\narchitectural cases (A) and (B), depending on whether image features are used in addition to keypoints\nor not. We also compare with a number of baselines. We measure the mean square error (MSE)\nbetween the estimated keypoints on the target face (source face normalized keypoints) and ground\ntruth target keypoints. Results for the following models are shown in Table 1:\n1) A baseline model registrations using a simple 2D af\ufb01ne transformation.\n2) We generate a 3D average face template from the 3DFAW dataset [14; 26; 6] by aligning the 3D\nkeypoints of all faces in the dataset to a front-facing face using Procrustes superimposition. We report\nerror by mapping the template face to each source face via Procrustes superimposition (to get a 3D\nface f) and then use an af\ufb01ne transformation from the 3D face f to the target face.\n3, 4) We use our proposed approach to predict both depth and geometry (described in Sections 2.1).\n5, 6) These models described in Section 2.2. Note that during training, these two cases are similar to\nmodels 3 and 4 in Table 1.\n7, 8) The pseudo-inverse formulation model described in Section 2.3.\nAs observed in Table 1, a simple 2D af\ufb01ne transform (model 1) without estimating depth and a\ntemplate 3D face (model 2) get high errors on mapping to the target faces. DepthNet models get lower\nerrors and the pseudo-inverse formulation (models 7 and 8) further reduces the error by 10%. The\nCNN models slightly reduce errors compared to their equivalent models that rely only on keypoints.\n\n3.2 DepthNet Evaluation on Unpaired Faces and Comparison to other Models\n\nIn this section we train DepthNet on unpaired faces belonging to different identities and compare\nwith other models that estimate depth. We use the 3DFAW dataset [14; 26; 6] that contains 66\n3D keypoints to facilitate comparing with ground truth (GT) depth. It provides 13,671 train and\n4,500 valid images. We extract from the valid set, 75 frontal, left and right looking faces yielding\na total of 225 test images, which provides a total of 50,400 source and target pairs. We train the\npsuedoinverse DepthNet model that relies on only keypoints (model 7 in Table 1). We also train\na variant of DepthNet that applies an adversarial loss on the depth values (DepthNet+GAN). This\nmodel uses a conditional discriminator that is conditioned on 2D keypoints and discriminates GT\nfrom estimated depth values. The model is trained with both keypoint and adversarial losses.\n\n5\n\n\fXXXXXXXXX\n\nSource\n\nTarget\n\nLeft\nFront\nRight\n\nDepthNet\n\nDepthNet + GAN\n\nLeft\n24.67\n25.54\n21.66\n\nFront Right\n29.70\n27.71\n26.19\n27.22\n21.48\n23.87\n\nAvg\n27.36\n26.32\n22.34\n\nLeft\n59.78\n58.77\n59.97\n\nFront Right\n59.63\n59.67\n58.61\n58.67\n59.70\n59.60\n\nAvg\n59.69\n58.68\n59.76\n\nTable 2: Comparing DepthCorr for different DepthNet models when mapping variant source to target poses. The\nAvg column measures the average over the three preceding columns.\n\nWe measure the correlation matrix between GT and estimated depths, where the element k in the\ndiagonal indicates the correlation between estimated and ground truth depth values for keypoint k,\nyielding a value between -1 and 1. We report the sum of absolute values of the diagonal of this\nmatrix, indicated by DepthCorr. We compare DepthNet models on DepthCorr in Table 2. For this\nexperiment we take every possible pair of source to target faces, where source and target are one of\n{left, front, right} looking faces. This yields a total of 5,550 pairs when the source and the target are\nfrom the same subset, and 5,625 pairs otherwise. This experiment measures the accuracy of depth\nestimation of the DepthNet models on different orientations of source-target faces. The baseline\nDepthNet model that does not leverage the depth labels performs well in different cases. DepthCorr\nimproves more than twice for the DepthNet+GAN model, indicating a direct supervision loss using\ndepth labels can enhance the depth estimation.\n\nModel\n\nNeed Depth Manual Init. MSE (\u21e5105) Depth Correlation Matrix Trace (DepthCorr)\nRight pose\n\nFront pose\n\nLeft pose\n\nGT Depth\nAIGN [24]\nMOFA [20]\nDepthNet (Ours)\nDepthNet + GAN (Ours)\n\nYes\nYes\nNo\nNo\nYes\n\n-\nNo\nYes\nNo\nNo\n\n8.86 \u00b1 6.55\n9.06 \u00b1 6.61\n8.75 \u00b1 6.33\n7.65 \u00b1 6.97\n8.74 \u00b1 6.24\n\n66\n44.08\n11.14\n27.36\n59.69\n\n66\n50.81\n15.97\n26.32\n58.68\n\n66\n49.04\n17.54\n22.34\n59.76\n\nTable 3: Comparing MSE and DepthCorr for different models. A lower MSE indicates the model maps better to\nthe target faces. A higher DepthCorr indicates more correlation between estimated and GT depths.\n\nWe compare our two DepthNet models with three baselines: 1) AIGN [24], 2) MOFA [20] and 3)\nGT Depth (no model trained). AIGN estimates 3D keypoints conditioned on 2D heatmaps of the\nkeypoints. MOFA estimates a 3D mesh using only an image. We implemented the AIGN model\nand asked the authors of MOFA to run their model on our test-set. They provided MOFA\u2019s results\nfor 134 images in the test set. In Table 3 we compare these three models with our DepthNet models\non DepthCorr. We also compare them on MSE, which is measured between GT and estimated\ntarget keypoints. Since the three baselines estimate depth on a single image due to their different\nmodel formulation, we \ufb01rst measure m using closed form solution in Eq. 3 and then apply m to\nthe estimated source keypoints to get the target keypoint estimations. We contrast the estimated\nvalues with the GT target keypoints. As shown in Table 3, GT depth has the highest DepthCorr\n(the maximum possible value). The depths estimated by DepthNet+GAN and AIGN have stronger\ncorrelation to GT depth compared to the baseline DepthNet and MOFA, while baseline DepthNet\nperforms better than MOFA. On MSE the baseline DepthNet model gets smaller MSE when mapping\nto target faces, indicating it is better suited for this task.\nIn Figure 2 we plot heatmaps of the estimated depth of different models (on Y axis) and the GT depth\n(on X axis) aggregated over all 66 keypoints on all test data. As can be seen, the depth estimated\nby DepthNet+GAN and AIGN models form a 45 degree rotated ellipses showing a stronger linear\ncorrespondence with respect to the GT depth compared to the the baseline DepthNet and MOFA.\n\nFigure 2: Predicted (Y axis) versus Ground Truth (X axis) depth heatmaps for different models.\n\nIn Figure 3 we show some estimated depth samples for different models (see more samples in Figure\nS1). AIGN and DepthNet+GAN generate more realistic results. MOFA generates very similar\nface templates for different poses. Baseline DepthNet estimates reliable depth values in most cases,\nhowever it has some failure modes as shown in the last row.\n\n6\n\n\fBy comparing different models in Table 3, MOFA requires proper initialization to map face meshes\nto each image. AIGN requires depth labels to train the model. Our baseline DepthNet model neither\nrequire any depth labels nor any manual tuning. The results also show DepthNet can work well on\nunpaired data. We would also like to emphasize that MOFA and AIGN are designed to estimate a 3D\nmodel, while DepthNet is designed to estimate the parameters that facilitate warping a face pose to\nanother without having depth values, so these models are designed to solve different problems.\n\nFigure 3: Depth visualization for different models (color coded by depth). From left to right: RGB image,\nGround Truth, DepthNet, DepthNet+GAN, AIGN and MOFA estimated depth values.\nAn interesting observation is that GT depth gets a higher MSE compared to DepthNet. This can be\ndue to not having a perspective projection between source and target faces. However, since DepthNet\nis trained to map to the target faces, it learns the af\ufb01ne parameters in a way to minimize this loss.\n\n3.3 Face Rotation, Replacement and Adversarial Repair\n\nIn this section we show how DepthNet can be used for different applications. In Figure 1 (right) we\nvisualize the face rotation by re-projecting a frontal face, from Multi-PIE [6], (far left) to a range of\nother poses de\ufb01ned by the faces in the row above. Since DepthNet (case B) computes transformation\non keypoints rather than pixels it is robust to illuminations changes between source and target faces.\nSee Figures S2 to S5 for further examples. Note that DepthNet preserves well the identity. However,\nit carries forward the emotion from source to target since using a global af\ufb01ne transformation imparts\na degree of robustness to dramatic expression changes. The views in these \ufb01gures are rendered from\na 3D model in OpenGL. Note the model can align well to the target face poses.\n\nFigure 4: Background synthesis with CycleGAN. Left to right: source face; keypoints overlaid; DepthNet (DN);\nDN + background ! frontal;\nIn another experiment, we do face frontalization with synthesized background. Here we use Cy-\ncleGAN to add background detail to a face that has been frontalized with DepthNet. Referring to\nFigure 4, we perform this by conditioning the CycleGAN on the DepthNet image (column 3) and the\nbackground of column 2 (masking interior face region determined by the convex hull spanned by\nthe keypoints). The second domain contains ground truth frontal faces. This experiment shows how\nto leverage DepthNet for full face generation. Note that we do not use identity information in this\nexperiment. However, it can be used to better preserve the identity.\n\nFigure 5: Left to right: source face; target face; warp to target; repaired result.\n\n7\n\n\fFinally, we do face swaps, where we warp the face of one identity onto the geometry and background\nof another identity using DepthNet. To do so, we paste the rotated face by DepthNet onto the\nbackground of the target image and train a CycleGAN to map from the domain of \u2018swapped in faces\u2019\nto the ground truth faces in our dataset, effectively learning to clean up face swaps so that the face\nregion matches the hair and background. Some examples of this procedure are shown in Figure 5.\n4 Related Work\n\n3D Transformation on Faces\n\n4.1\nWhile there is a large body of literature on 3D facial analysis, many standard techniques are not\napplicable to our setting here. As an example, morphable models [1] cover a wide variety of\napproaches which are capable of high quality 3D reconstructions, but such methods usually require\n3D face scans or reconstructions from multi-view stereo to be assembled so as to learn complex\nparametric distributions over face shapes. A close approach to our own is that of [7] on viewing\nreal world faces in 3D. Similar to our work, this approach does not require aligned 3D face scans,\nhighly engineered models or manual interventions. They make the observation that if 2D keypoints\ncan be obtained from a single input image of a face and these keypoints are matched to an arbitrary\n3D target geometry, then standard camera calibration techniques can be used to estimate plausible\nintrinsics and extrinsics of the camera. This allows the estimated camera matrix, 3D rotation matrix\nand 3D translation vector to be used to transform the target 3D model to the pose of the query image\nfrom which an approximate depth can be obtained. Hassner et. al [8] explore the use of a single\nunmodi\ufb01ed 3D surface as an approximation to the shape of all input faces. In contrast, our approach\nonly requires 2D keypoints from the source and target faces as input. It then estimates the depth of\nthe source face keypoints, thereby inferring an image speci\ufb01c 3D model of the face.\nDeepFace [19] uses face frontalization to improve the performance of a face veri\ufb01cation system. It\nuses a 3D mask composed of facial keypoints, detects the corresponding locations of these keypoints\nin the image, and maps the 2D keypoints onto a 3D face model to frontalize it. DeepFace, however,\nmaps to a template 3D face, therefore always mapping to a speci\ufb01c pose and geometry. DepthNet, on\nthe other hand, can map to any pose and geometry, giving it more expressive \ufb02exibility.\n\n4.2 Generative Adversarial Networks on Face Rotation\nRecently, adversarial models in [12; 22; 25; 27; 18; 28] have explored face rotation. TP-GAN [12]\nperforms face frontalization through introducing several losses to preserve identity and symmetry of\nthe frontalized faces. PIM [28] frontalizes faces in a composed adversarial loss and then extracts pose\ninvariant features for face recognition. These models are mainly aimed for face veri\ufb01cation, where\nthey can only do face-frontalization. Another limitation of these models is in requiring ground truth\nfrontal images of the same identity during training. DR-GAN [22] rotates faces to any target pose by\nusing a discriminator that also does identity classi\ufb01cation in addition to pose prediction, to preserve\nid and pose. While these models do pure face rotation of a 2D face, our model can warp the input\nface to any other target face, allowing warping the input face to any other identity, with a different\ngeometry and pose. Moreover, our model also estimates the 3D geometric af\ufb01ne transformation\nparameters explicitly, allowing these parameters to be used later, e.g., for face texture swap.\nFF-GAN [25], DA-GAN [27], and FaceID-GAN [18] estimate parameters of either a 3D Morphable\nModel (3DMM), as in [25; 18], or source to target pose transformation, as in [27]. FF-GAN uses\n3DMM parameters to frontalize faces in an adversarial approach, while FaceID-GAN uses the\n3DMM parameters to generate any target pose. These models, however, train 3DMM on ground\ntruth labels such as identity, expression and pose. DepthNet, on the other hand, estimates depth and\naf\ufb01ne transformation parameters without requiring ground truth af\ufb01ne or depth labels or pre-training.\nSimilar to DepthNet, DA-GAN [27] estimates parameters of an af\ufb01ne transformation model that\nmaps a 2D face to a 3D face. Unlike DepthNet that estimates depth on the source face, DA-GAN\nuses depth in a template target face. While their approach eliminates the need for depth estimation, it\nonly allows the source face to be mapped to the target template geometry, while DepthNet can map\nthe source face to any target geometry, provided by a target image, or its keypoints. We demonstrate\nthe application of this \ufb02exibility for the face replacement task.\nThe aforementioned adversarial models use an identity preserving loss to maintain identity. The core\nDepthNet model does not need identity labels and preserves well the identity (as shown in Figure 1\n\n8\n\n\f(right)). However, the identity information can be used by the proposed adversarial components, as in\nbackground synthesis, to further improve the results. Unlike some of these models that take target\npose as input, DepthNet uses the target keypoints to estimate the target geometry and does not require\nthe target pose. This has several advantages; 1) DepthNet can map to the geometry of the target face\nin addition to the pose, and 2) in the face replacement task, DepthNet can replace the target face with\nthe warped source face directly onto the target face location. Its application is shown in the face swap\nexperiment in Section 3.3.\n\n4.3 Depth Estimation\n\nThewlis et. al [21] propose a mapping technique to learn a proxy of 2D landmarks in an unsupervised\nway. A semi-supervised technique has been also proposed in [11] that improves landmark localization\nby using weaker class labels (e.g. emotion or pose) and also by making the model predict equivariant\nvariations of landmarks when such transformations are applied to the image. Similar to these\napproaches, DepthNet also maps a source to a target to learn its parameters. However, unlike these\ntwo approaches that estimate 2D landmarks, DepthNet estimates the depth of the landmarks using\n2D matching of keypoints, by formulating af\ufb01ne parameters as a function of depth augmentated\nkeypoints in a closed form solution.\nWhile several models [13; 2; 15] estimate depth with direct supervision, there has been recent models\n[29; 5; 3] that estimate depth in an unsupervised training procedure. These models rely on pixel\nreconstruction by using frames that are captured from very similar scenes, e.g. nearby frames of\na video [29] or left-right frames captures by stereo cameras [5; 3]. These models estimate depth\non one frame and then by using the disparity map, measure how pixel values of nearby frames\ncompare to each other. To do this, they also require camera intrinsic parameters, e.g. focal length or\ndistance between cameras. Unlike these models, our approach does not require source to target pixel\nmapping. This allows mapping faces from different people with completely different skin colors,\nwithout knowing camera parameters or how they are positioned with respect to each other. Therefore,\nDepthNet is not susceptible to variations in illuminations or lighting between source and target faces.\nTung et. al [23] estimate 3D human pose in videos, where they use synthetic data to pre-train internal\nparameters of the model and \ufb01ne-tune them by keypoint, segmentation and motion loss. Adversarial\nInverse Graphics Networks (AIGN) [24] estimates 3D human pose from 2D keypoint heatmaps in a\nsemi-supervised manner with a similar formulation to that of CycleGAN. They apply an adversarial\nloss on the 3D pose to make them look realistic. These models leverage the depth values either\nthrough synthetic data [23], or by adversarial usage of ground truth depth values [24]. Unlike these\nmodels, DepthNet does not rely on any depth signal, either directly or indirectly. MOFA [20] builds\na 3D face mesh using a single image, where the 3D face parameters such as 3D shape and skin\nre\ufb02ectance are estimated by an encoder and then using a differentiable model they are rendered back\nto the image by the decoder. This model requires manual initialization to map the input image to the\n3D mesh, since otherwise it is doing an unconstrained optimization by adapting both the face pose\nand the skin re\ufb02ectance. Our model, however, does not require any manual initialization.\n\n5 Conclusion\n\nWe have proposed a novel approach to 3D face model creation which enables pose normalization\nwithout using any ground truth depth data. We achieve our best quantitative keypoint registration\nresults using our novel formulation for predicting depth and 3D visual geometry simultaneously,\nlearned by backpropagating through the analytic solution for the visual geometry estimation problem\nexpressed as a function of predicted depths. We have illustrated the quality and utility of the depths\nand 3D transformations obtained using our method by transforming source faces to a wide variety of\ntarget poses and geometries. Our technique can be used for face rotation and replacement and when\ncombined with adversarial repair it can blend warped faces to also synthesize the background. The\nproposed model, however, carries forward emotion from source to target due to learning a shared\naf\ufb01ne parameters for all keypoints. Moreover, for extreme non-frontal faces, while DepthNet can\nextract the transformation params (since it only relies on keypoints), OpenGL cannot extract texture\ndue to occlusion. We show an example of how to address this in the supplementary material. An\ninteresting extension to this paper can be replacing the OpenGL pipeline with a generative adversarial\nframework that synthesizes a face using the parameters estimated by DepthNet.\n\n9\n\n\f6 Acknowledgments\n\nWe would like to thank Samsung and Google for partially funding this project. We are also thankful\nto Compute Canada and Calcul Quebec for providing computational resources, and to Poonam Goyal\nfor helpful discussions.\n\nReferences\n[1] Blanz, Volker and Vetter, Thomas. A morphable model for the synthesis of 3d faces.\n\nIn\nProceedings of the 26th annual conference on Computer graphics and interactive techniques,\npp. 187\u2013194. ACM Press/Addison-Wesley Publishing Co., 1999.\n\n[2] Eigen, David, Puhrsch, Christian, and Fergus, Rob. Depth map prediction from a single image\nusing a multi-scale deep network. In Advances in Neural Information Processing Systems\n(NIPS), pp. 2366\u20132374, 2014.\n\n[3] Garg, Ravi, BG, Vijay Kumar, Carneiro, Gustavo, and Reid, Ian. Unsupervised cnn for single\nview depth estimation: Geometry to the rescue. In Proceedings of the European Conference on\nComputer Vision (ECCV), pp. 740\u2013756. Springer, 2016.\n\n[4] Glorot, Xavier and Bengio, Yoshua. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In International conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npp. 249\u2013256, 2010.\n\n[5] Godard, Cl\u00e9ment, Mac Aodha, Oisin, and Brostow, Gabriel J. Unsupervised monocular depth\nestimation with left-right consistency. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n[6] Gross, Ralph, Matthews, Iain, Cohn, Jeffrey, Kanade, Takeo, and Baker, Simon. Multi-pie.\n\nImage and Vision Computing, 28(5):807\u2013813, 2010.\n\n[7] Hassner, Tal. Viewing real-world faces in 3d.\n\nIn Proceedings of the IEEE International\n\nConference on Computer Vision (ICCV), pp. 3607\u20133614, 2013.\n\n[8] Hassner, Tal, Harel, Shai, Paz, Eran, and Enbar, Roee. Effective face frontalization in uncon-\nstrained images. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pp. 4295\u20134304, 2015.\n\n[9] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\nInternational Conference on Computer Vision (ICCV), pp. 1026\u20131034, 2015.\n\n[10] Honari, Sina, Yosinski, Jason, Vincent, Pascal, and Pal, Christopher. Recombinator networks:\nLearning coarse-to-\ufb01ne feature aggregation. In Proceedings of the IEEE Conference on Com-\nputer Vision and Pattern Recognition (CVPR), pp. 5743\u20135752, 2016.\n\n[11] Honari, Sina, Molchanov, Pavlo, Tyree, Stephen, Vincent, Pascal, Pal, Christopher, and Kautz,\nJan. Improving landmark localization with semi-supervised learning. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[12] Huang, Rui, Zhang, Shu, Li, Tianyu, and He, Ran. Beyond face rotation: Global and local\nperception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings\nof the IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[13] Jackson, Aaron S, Bulat, Adrian, Argyriou, Vasileios, and Tzimiropoulos, Georgios. Large pose\n3d face reconstruction from a single image via direct volumetric cnn regression. In Proceedings\nof the IEEE International Conference on Computer Vision (ICCV), pp. 1031\u20131039. IEEE, 2017.\n\n[14] Jeni, L\u00e1szl\u00f3 A, Cohn, Jeffrey F, and Kanade, Takeo. Dense 3d face alignment from 2d videos in\nreal-time. In IEEE International Conference and Workshops on Automatic Face and Gesture\nRecognition (FG), volume 1, pp. 1\u20138. IEEE, 2015.\n\n10\n\n\f[15] Liu, Fayao, Shen, Chunhua, and Lin, Guosheng. Deep convolutional neural \ufb01elds for depth\nestimation from a single image. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), pp. 5162\u20135170, 2015.\n\n[16] Parkhi, Omkar M, Vedaldi, Andrea, and Zisserman, Andrew. Deep face recognition. Proceedings\n\nof the British Machine Vision Conference (BMVC), 2015.\n\n[17] Sagonas, Christos, Tzimiropoulos, Georgios, Zafeiriou, Stefanos, and Pantic, Maja. 300 faces\nin-the-wild challenge: The \ufb01rst facial landmark localization challenge. In Proceedings of the\nIEEE International Conference on Computer Vision Workshops (CVPRW), pp. 397\u2013403, 2013.\n[18] Shen, Yujun, Luo, Ping, Yan, Junjie, Wang, Xiaogang, and Tang, Xiaoou. Faceid-gan: Learning\na symmetry three-player gan for identity-preserving face synthesis. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), pp. 821\u2013830, 2018.\n\n[19] Taigman, Yaniv, Yang, Ming, Ranzato, Marc\u2019Aurelio, and Wolf, Lior. Deepface: Closing the\ngap to human-level performance in face veri\ufb01cation. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pp. 1701\u20131708, 2014.\n\n[20] Tewari, Ayush, Zollh\u00f6fer, Michael, Kim, Hyeongwoo, Garrido, Pablo, Bernard, Florian, Perez,\nPatrick, and Theobalt, Christian. Mofa: Model-based deep convolutional face autoencoder for\nunsupervised monocular reconstruction. In Proceedings of the IEEE International Conference\non Computer Vision (ICCV), volume 2, 2017.\n\n[21] Thewlis, James, Bilen, Hakan, and Vedaldi, Andrea. Unsupervised learning of object landmarks\nby factorized spatial embeddings. In Proceedings of the IEEE International Conference on\nComputer Vision (ICCV), volume 1, pp. 5, 2017.\n\n[22] Tran, Luan, Yin, Xi, and Liu, Xiaoming. Disentangled representation learning gan for pose-\ninvariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2017.\n\n[23] Tung, Hsiao-Yu, Tung, Hsiao-Wei, Yumer, Ersin, and Fragkiadaki, Katerina. Self-supervised\nlearning of motion capture. In Advances in Neural Information Processing Systems (NIPS), pp.\n5242\u20135252, 2017.\n\n[24] Tung, Hsiao-Yu Fish, Harley, Adam W, Seto, William, and Fragkiadaki, Katerina. Adversarial\ninverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from\nunpaired supervision. In Proceedings of the IEEE International Conference on Computer Vision\n(ICCV), volume 2, 2017.\n\n[25] Yin, Xi, Yu, Xiang, Sohn, Kihyuk, Liu, Xiaoming, and Chandraker, Manmohan. Towards\nlarge-pose face frontalization in the wild. In Proceedings of the IEEE International Conference\non Computer Vision (ICCV), pp. 1\u201310, 2017.\n\n[26] Zhang, Xing, Yin, Lijun, Cohn, Jeffrey F, Canavan, Shaun, Reale, Michael, Horowitz, Andy,\nLiu, Peng, and Girard, Jeffrey M. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic\nfacial expression database. Image and Vision Computing, 32(10):692\u2013706, 2014.\n\n[27] Zhao, Jian, Xiong, Lin, Jayashree, Panasonic Karlekar, Li, Jianshu, Zhao, Fang, Wang, Zhecan,\nPranata, Panasonic Sugiri, Shen, Panasonic Shengmei, Yan, Shuicheng, and Feng, Jiashi. Dual-\nagent gans for photorealistic and identity preserving pro\ufb01le face synthesis. In Advances in\nNeural Information Processing Systems (NIPS), pp. 66\u201376, 2017.\n\n[28] Zhao, Jian, Cheng, Yu, Xu, Yan, Xiong, Lin, Li, Jianshu, Zhao, Fang, Jayashree, Karlekar,\nPranata, Sugiri, Shen, Shengmei, Xing, Junliang, et al. Towards pose invariant face recognition\nin the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pp. 2207\u20132216, 2018.\n\n[29] Zhou, Tinghui, Brown, Matthew, Snavely, Noah, and Lowe, David G. Unsupervised learning of\ndepth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2017.\n\n[30] Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros, Alexei A. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. In Proceedings of the IEEE International\nConference on Computer Vision (ICCV), 2017.\n\n11\n\n\f", "award": [], "sourceid": 6384, "authors": [{"given_name": "Joel Ruben Antony", "family_name": "Moniz", "institution": "Carnegie Mellon University"}, {"given_name": "Christopher", "family_name": "Beckham", "institution": "MILA"}, {"given_name": "Simon", "family_name": "Rajotte", "institution": "Polytechnique Montr\u00e9al"}, {"given_name": "Sina", "family_name": "Honari", "institution": "University of Montreal"}, {"given_name": "Chris", "family_name": "Pal", "institution": "MILA, Polytechnique Montr\u00e9al, Element AI"}]}