{"title": "First Order Motion Model for Image Animation", "book": "Advances in Neural Information Processing Systems", "page_first": 7137, "page_last": 7147, "abstract": "Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories.", "full_text": "First Order Motion Model for Image Animation\n\nAliaksandr Siarohin\n\nDISI, University of Trento\n\naliaksandr.siarohin@unitn.it\n\nSt\u00e9phane Lathuili\u00e8re\n\nDISI, University of Trento\n\nLTCI, T\u00e9l\u00e9com Paris, Institut polytechnique de Paris\n\nstephane.lathuilire@telecom-paris.fr\n\nSergey Tulyakov\n\nSnap Inc.\n\nstulyakov@snap.com\n\nElisa Ricci\n\nDISI, University of Trento\nFondazione Bruno Kessler\n\ne.ricci@unitn.it\n\nNicu Sebe\n\nDISI, University of Trento\nHuawei Technologies Ireland\nniculae.sebe@unitn.it\n\nAbstract\n\nImage animation consists of generating a video sequence so that an object in a\nsource image is animated according to the motion of a driving video. Our frame-\nwork addresses this problem without using any annotation or prior information\nabout the speci\ufb01c object to animate. Once trained on a set of videos depicting\nobjects of the same category (e.g. faces, human bodies), our method can be applied\nto any object of this class. To achieve this, we decouple appearance and motion\ninformation using a self-supervised formulation. To support complex motions,\nwe use a representation consisting of a set of learned keypoints along with their\nlocal af\ufb01ne transformations. A generator network models occlusions arising during\ntarget motions and combines the appearance extracted from the source image and\nthe motion derived from the driving video. Our framework scores best on diverse\nbenchmarks and on a variety of object categories. Our source code is publicly\navailable1.\n\nIntroduction\n\n1\nGenerating videos by animating objects in still images has countless applications across areas of\ninterest including movie production, photography, and e-commerce. More precisely, image animation\nrefers to the task of automatically synthesizing videos by combining the appearance extracted from\na source image with motion patterns derived from a driving video. For instance, a face image of a\ncertain person can be animated following the facial expressions of another individual (see Fig. 1). In\nthe literature, most methods tackle this problem by assuming strong priors on the object representation\n(e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can\nbe referred to as object-speci\ufb01c methods, as they assume knowledge about the model of the speci\ufb01c\nobject to animate.\nRecently, deep generative models have emerged as effective techniques for image animation and\nvideo retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks\n(GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expres-\nsions [37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches\nusually rely on pre-trained models in order to extract object-speci\ufb01c representations such as keypoint\nlocations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations\n[2, 27, 31] and are not available in general for an arbitrary object category. To address this issues,\nrecently Siarohin et al. [28] introduced Monkey-Net, the \ufb01rst object-agnostic deep model for image\n\n1https://github.com/AliaksandrSiarohin/\ufb01rst-order-model\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example animations produced by our method trained on different datasets: VoxCeleb [22]\n(top left), Tai-Chi-HD (top right), Fashion-Videos [41] (bottom left) and MGif [28] (bottom right).\nWe use relative motion transfer for VoxCeleb and Fashion-Videos and absolute transfer for MGif and\nTai-Chi-HD see Sec. 3.4. Check our project page for more qualitative results2.\n\nanimation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fash-\nion. At test time, the source image is animated according to the corresponding keypoint trajectories\nestimated in the driving video. The major weakness of Monkey-Net is that it poorly models object\nappearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we\nshow in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes\n(see Fig. 4). To tackle this issue, we propose to use a set of self-learned keypoints together with\nlocal af\ufb01ne transformations to model complex motions. We therefore call our method a \ufb01rst-order\nmotion model. Second, we introduce an occlusion-aware generator, which adopts an occlusion mask\nautomatically estimated to indicate object parts that are not visible in the source image and that\nshould be inferred from the context. This is especially needed when the driving video contains large\nmotion patterns and occlusions are typical. Third, we extend the equivariance loss commonly used\nfor keypoints detector training [18, 44], to improve the estimation of local af\ufb01ne transformations.\nFourth, we experimentally show that our method signi\ufb01cantly outperforms state-of-the-art image\nanimation methods and can handle high-resolution datasets where other approaches generally fail.\nFinally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a\nreference benchmark for evaluating frameworks for image animation and video generation.\n\n2 Related work\nVideo Generation. Earlier works on deep video generation discussed how spatio-temporal neural\nnetworks could render video frames from noise vectors [36, 26]. More recently, several approaches\ntackled the problem of conditional video generation. For instance, Wang et al. [38] combine a\nrecurrent neural network with a VAE in order to generate face videos. Considering a wider range\nof applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially\ntrained in order to synthesize videos from noise, categorical labels or static images. Another typical\ncase of conditional generation is the problem of future frame prediction, in which the generated video\nis conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can\nbe obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related\n\n2https://aliaksandrsiarohin.github.io/\ufb01rst-order-model-website/\n\n2\n\nimageSourceDrivingvideoimageSourceDrivingvideoimageSourceDrivingvideoimageSourceDrivingvideo\fFigure 2: Overview of our approach. Our method assumes a source image S and a frame of a\ndriving video frame D as inputs. The unsupervised keypoint detector extracts \ufb01rst order motion\nrepresentation consisting of sparse keypoints and local af\ufb01ne transformations with respect to the\nreference frame R. The dense motion network uses the motion representation to generate dense\noptical \ufb02ow \u02c6TS\u2190D from D to S and occlusion map \u02c6OS\u2190D. The source image and the outputs of the\ndense motion network are used by the generator to render the target image.\n\nto these previous works since we use a warping formulation to generate video sequences. However,\nin the case of image animation, the applied spatial deformations are not predicted but given by the\ndriving video.\nImage Animation. Traditional approaches for image animation and video re-targeting [6, 33,\n13] were designed for speci\ufb01c domains such as faces [45, 42], human silhouettes [8, 37, 27] or\ngestures [31] and required a strong prior of the animated object. For example, in face animation,\nmethod of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable\nmodel of the face. In many applications, however, such models are not available. Image animation\ncan also be treated as a translation problem from one visual domain to another. For instance, Wang\net al. [37] transferred human motion using the image-to-image translation framework of Isola et\nal. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal\ncues in order to improve video translation between two given domains. Such approaches in order to\nanimate a single person require hours of videos of that person labelled with semantic information,\nand therefore have to be retrained for each individual. In contrast to these works, we neither rely on\nlabels, prior information about the animated objects, nor on speci\ufb01c training procedures for each\nobject instance. Furthermore, our approach can be applied to any object within the same category\n(e.g., faces, human bodies, robot arms etc).\nSeveral approaches were proposed that do not require priors about the object. X2Face [40] uses\na dense motion \ufb01eld in order to generate the output video via image warping. Similarly to us\nthey employ a reference pose that is used to obtain a canonical representation of the object. In our\nformulation, we do not require an explicit reference pose, leading to signi\ufb01cantly simpler optimization\nand improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework\nfor animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ\nsparse trajectories induced by self-supervised keypoints. However, we model object motion in the\nneighbourhood of each predicted keypoint by a local af\ufb01ne transformation. Additionally, we explicitly\nmodel occlusions in order to indicate to the generator network the image regions that can be generated\nby warping the source image and the occluded areas that need to be inpainted.\n\n3 Method\nWe are interested in animating an object depicted in a source image S based on the motion of a similar\nobject in a driving video D. Since direct supervision is not available (pairs of videos in which objects\nmove similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training,\nwe employ a large collection of video sequences containing objects of the same object category. Our\nmodel is trained to reconstruct the training videos by combining a single frame and a learned latent\nrepresentation of the motion in the video. Observing frame pairs, each extracted from the same video,\nit learns to encode motion as a combination of motion-speci\ufb01c keypoint displacements and local\naf\ufb01ne transformations. At test time we apply our model to pairs composed of the source image and of\neach frame of the driving video and perform image animation of the source object.\n\n3\n\n\fAn overview of our approach is presented in Fig. 2. Our framework is composed of two main\nmodules: the motion estimation module and the image generation module. The purpose of the motion\nestimation module is to predict a dense motion \ufb01eld from a frame D \u2208 R3\u00d7H\u00d7W of dimension\nH \u00d7 W of the driving video D to the source frame S \u2208 R3\u00d7H\u00d7W . The dense motion \ufb01eld is later\nused to align the feature maps computed from S with the object pose in D. The motion \ufb01eld is\nmodeled by a function TS\u2190D : R2 \u2192 R2 that maps each pixel location in D with its corresponding\nlocation in S. TS\u2190D is often referred to as backward optical \ufb02ow. We employ backward optical \ufb02ow,\nrather than forward optical \ufb02ow, since back-warping can be implemented ef\ufb01ciently in a differentiable\nmanner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We\nindependently estimate two transformations: from R to S (TS\u2190R) and from R to D (TD\u2190R). Note\nthat unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations\nlater. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to\nindependently process D and S. This is desired since, at test time the model receives pairs of the\nsource image and driving frames sampled from a different video, which can be very different visually.\nInstead of directly predicting TD\u2190R and TS\u2190R, the motion estimator module proceeds in two steps.\nIn the \ufb01rst step, we approximate both transformations from sets of sparse trajectories, obtained by\nusing keypoints learned in a self-supervised way. The locations of the keypoints in D and S are\nseparately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck\nresulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion\nrepresentation is well-suited for animation as at test time, the keypoints of the source image can be\nmoved using the keypoints trajectories in the driving video. We model motion in the neighbourhood\nof each keypoint using local af\ufb01ne transformations. Compared to using keypoint displacements only,\nthe local af\ufb01ne transformations allow us to model a larger family of transformations. We use Taylor\nexpansion to represent TD\u2190R by a set of keypoint locations and af\ufb01ne transformations. To this end,\nthe keypoint detector network outputs keypoint locations as well as the parameters of each af\ufb01ne\ntransformation.\nDuring the second step, a dense motion network combines the local approximations to obtain the\nresulting dense motion \ufb01eld \u02c6TS\u2190D. Furthermore, in addition to the dense motion \ufb01eld, this network\noutputs an occlusion mask \u02c6OS\u2190D that indicates which image parts of D can be reconstructed by\nwarping of the source image and which parts should be inpainted, i.e.inferred from the context.\nFinally, the generation module renders an image of the source object moving as provided in the\ndriving video. Here, we use a generator network G that warps the source image according to \u02c6TS\u2190D\nand inpaints the image parts that are occluded in the source image. In the following sections we detail\neach of these step and the training procedure.\n\n3.1 Local Af\ufb01ne Transformations for Approximate Motion Description\nThe motion estimation module estimates the backward optical \ufb02ow TS\u2190D from a driving frame D to\nthe source frame S. As discussed above, we propose to approximate TS\u2190D by its \ufb01rst order Taylor\nexpansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the\nmotivation behind this choice, and detail the proposed approximation of TS\u2190D.\nWe assume there exist an abstract reference frame R. Therefore, estimating TS\u2190D consists in\nestimating TS\u2190R and TR\u2190D. Furthermore, given a frame X, we estimate each transformation\nTX\u2190R in the neighbourhood of the learned keypoints. Formally, given a transformation TX\u2190R, we\nconsider its \ufb01rst order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the\ncoordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the\nfollowing the point locations in the reference pose space are all denoted by p while the point locations\nin the X, S or D pose spaces are denoted by z. We obtain:\n\nTX\u2190R(p) = TX\u2190R(pk) +\n\nTX\u2190R(p)\n\n(1)\nIn this formulation, the motion function TX\u2190R is represented by its values in each keypoint pk and\nits Jacobians computed in each pk location:\nTX\u2190R(p)\nTX\u2190R(p) (cid:39)\n\nTX\u2190R(pk),\n\nTX\u2190R(p1),\n\nTX\u2190R(p)\n\n(cid:26)(cid:26)\n\n. (2)\n\n(p \u2212 pk) + o((cid:107)p \u2212 pk(cid:107)),\n\n(cid:27)(cid:27)\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pK\n\nd\ndp\n\nd\ndp\n\n(cid:18) d\n\ndp\n\n(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n(cid:26)\n\n(cid:27)\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=p1\n\n4\n\n, . . .\n\n\fFurthermore, in order to estimate TR\u2190X = T \u22121\nX\u2190R, we assume that TX\u2190R is locally bijective in the\nneighbourhood of each keypoint. We need to estimate TS\u2190D near the keypoint zk in D, given that\nzk is the pixel location corresponding to the keypoint location pk in R. To do so, we \ufb01rst estimate\nthe transformation TR\u2190D near the point zk in the driving frame D, e.g. pk = TR\u2190D(zk). Then we\nestimate the transformation TS\u2190R near pk in the reference R. Finally TS\u2190D is obtained as follows:\n(3)\n\nTS\u2190D = TS\u2190R \u25e6 TR\u2190D = TS\u2190R \u25e6 T \u22121\n\nD\u2190R,\n\nAfter computing again the \ufb01rst order Taylor expansion of Eq. (3) (see Sup. Mat.), we obtain:\n\nwith:\n\nJk =\n\n(cid:18) d\n\ndp\n\nTS\u2190D(z) \u2248 TS\u2190R(pk) + Jk(z \u2212 TD\u2190R(pk))\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n(cid:19)(cid:18) d\n\ndp\n\nTS\u2190R(p)\n\nTD\u2190R(p)\n\n(cid:19)\u22121\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n(4)\n\n(5)\n\nIn practice, TS\u2190R(pk) and TD\u2190R(pk) in Eq. (4) are predicted by the keypoint predictor. More\nprecisely, we employ the standard U-Net architecture that estimates K heatmaps, one for each\nkeypoint. The last layer of the decoder uses softmax activations in order to predict heatmaps that can\nbe interpreted as keypoint detection con\ufb01dence map. Each expected keypoint location is estimated\nusing the average operation as in [28, 24]. Note if we set Jk = 1 (1 is 2 \u00d7 2 identity matrix), we\nget the motion model of Monkey-Net. Therefore Monkey-Net uses a zeroth-order approximation of\nTS\u2190D(z) \u2212 z.\nFor both frames S and D, the keypoint predictor network also outputs four additional channels for\ndpTS\u2190R(p)|p=pk and\neach keypoint. From these channels, we obtain the coef\ufb01cients of the matrices d\ndpTS\u2190R(p)|p=pk in Eq. (5) by computing spatial weighted average using as weights the correspond-\nd\ning keypoint con\ufb01dence map.\nCombining Local Motions. We employ a convolutional network P to estimate \u02c6TS\u2190D from the set\nof Taylor approximations of TS\u2190D(z) in the keypoints and the original source frame S. Importantly,\nsince \u02c6TS\u2190D maps each pixel location in D with its corresponding location in S, the local patterns in\n\u02c6TS\u2190D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment\nissue makes the task harder for the network to predict \u02c6TS\u2190D from S. In order to provide inputs\nalready roughly aligned with \u02c6TS\u2190D, we warp the source frame S according to local transformations\nestimated in Eq. (4). Thus, we obtain K transformed images S1, . . . SK that are each aligned with\n\u02c6TS\u2190D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S0 = S\nfor the background.\n\nFor each keypoint pk we additionally compute heatmaps Hk indicating to the dense motion network\nwhere each transformation happens. Each Hk(z) is implemented as the difference of two heatmaps\ncentered in TD\u2190R(pk) and TS\u2190R(pk):\n\n(cid:32)\n\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\nHk(z) = exp\n\n(TD\u2190R(pk) \u2212 z)2\n\n\u03c3\n\n\u2212 exp\n\n(TS\u2190R(pk) \u2212 z)2\n\n\u03c3\n\n.\n\n(6)\n\nIn all our experiments, we employ \u03c3 = 0.01 following Jakab et al. [18].\n\nThe heatmaps Hk and the transformed images S0, . . . SK are concatenated and processed by a U-\nNet [25]. \u02c6TS\u2190D is estimated using a part-based model inspired by Monkey-Net [28]. We assume that\nan object is composed of K rigid parts and that each part is moved according to Eq. (4). Therefore\nwe estimate K+1 masks Mk, k = 0, . . . K that indicate where each local transformation holds. The\n\ufb01nal dense motion prediction \u02c6TS\u2190D(z) is given by:\n\n\u02c6TS\u2190D(z) = M0z +\n\nMk (TS\u2190R(pk) + Jk(z \u2212 TD\u2190R(pk)))\n\n(7)\n\nNote that, the term M0z is considered in order to model non-moving parts such as background.\n\nk=1\n\n5\n\nK(cid:88)\n\n\f3.2 Occlusion-aware Image Generation\n\nAs mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated\n\u02c6D. In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15].\nMore precisely, after two down-sampling convolutional blocks, we obtain a feature map \u03be \u2208 RH(cid:48)\u00d7W (cid:48)\nof dimension H(cid:48) \u00d7 W (cid:48). We then warp \u03be according to \u02c6TS\u2190D. In the presence of occlusions in S,\noptical \ufb02ow may not be suf\ufb01cient to generate \u02c6D. Indeed, the occluded parts in S cannot be recovered\nby image-warping and thus should be inpainted. Consequently, we introduce an occlusion map\n\u02c6OS\u2190D \u2208 [0, 1]H(cid:48)\u00d7W (cid:48)\nto mask out the feature map regions that should be inpainted. Thus, the\nocclusion mask diminishes the impact of the features corresponding to the occluded parts. The\ntransformed feature map is written as:\n(cid:48)\n\n(8)\nwhere fw(\u00b7,\u00b7) denotes the back-warping operation and (cid:12) denotes the Hadamard product. We estimate\nthe occlusion mask from our sparse keypoint representation, by adding a channel to the \ufb01nal layer\n(cid:48) is fed to subsequent network\nof the dense motion network. Finally, the transformed feature map \u03be\nlayers of the generation module (see Sup. Mat.) to render the sought image.\n\n= \u02c6OS\u2190D (cid:12) fw(\u03be, \u02c6TS\u2190D)\n\n\u03be\n\n3.3 Training Losses\n\nWe train our system in an end-to-end fashion combining several losses. First, we use the reconstruction\nloss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our\nmain driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving\nframe D and the corresponding reconstructed frame \u02c6D, the reconstruction loss is written as:\n\n(cid:12)(cid:12)(cid:12)Ni( \u02c6D) \u2212 Ni(D)\n(cid:12)(cid:12)(cid:12) ,\n\nI(cid:88)\n\ni=1\n\nLrec( \u02c6D, D) =\n\n(9)\n\nwhere Ni(\u00b7) is the ith channel feature extracted from a speci\ufb01c VGG-19 layer and I is the number of\nfeature channels in this layer. Additionally we propose to use this loss on a number of resolutions,\nforming a pyramid obtained by down-sampling \u02c6D and D, similarly to MS-SSIM [39, 32]. The\nresolutions are 256 \u00d7 256, 128 \u00d7 128, 64 \u00d7 64 and 32 \u00d7 32. There are 20 loss terms in total.\nImposing Equivariance Constraint. Our keypoint predictor does not require any keypoint anno-\ntations during training. This may lead to unstable performance. Equivariance constraint is one of\nthe most important factors driving the discovery of unsupervised keypoints [18, 43]. It forces the\nmodel to predict consistent keypoints with respect to known geometric transformations. We use thin\nplate splines deformations as they were previously used in unsupervised keypoint detection [18, 43]\nand are similar to natural image deformations. Since our motion estimator does not only predict the\nkeypoints, but also the Jacobians, we extend the well-known equivariance loss to additionally include\nconstraints on the Jacobians.\nWe assume that an image X undergoes a known spatial deformation TX\u2190Y. In this case TX\u2190Y can\nbe an af\ufb01ne transformation or a thin plane spline deformation. After this deformation we obtain a\nnew image Y. Now by applying our extended motion estimator to both images, we obtain a set of\nlocal approximations for TX\u2190R and TY\u2190R. The standard equivariance constraint writes as:\n\n(10)\nAfter computing the \ufb01rst order Taylor expansions of both sides, we obtain the following constraints\n(see derivation details in Sup. Mat.):\n\nTX\u2190R \u2261 TX\u2190Y \u25e6 TY\u2190R\n\n(cid:18) d\n\ndp\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\nTX\u2190R(p)\n\n(cid:19)\n\n(cid:18) d\n\ndp\n\n\u2261\n\nTX\u2190Y(p)\n\nTX\u2190R(pk) \u2261 TX\u2190Y \u25e6 TY\u2190R(pk),\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=TY\u2190R(pk)\n\n(cid:19)(cid:18) d\n\ndp\n\n(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n,\n\n(11)\n\n(12)\n\nTY\u2190R(p)\n\nNote that the constraint Eq. (11) is strictly the same as the standard equivariance constraint for the\nkeypoints [18, 43]. During training, we constrain every keypoint location using a simple L1 loss\nbetween the two sides of Eq. (11). However, implementing the second constraint from Eq. (12) with\n\n6\n\n\fL1 would force the magnitude of the Jacobians to zero and would lead to numerical problems. To\nthis end, we reformulate this constraint in the following way:\n\n(cid:18) d\n\n(cid:19)\u22121(cid:18) d\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=TY\u2190R(pk)\n\n(cid:19)(cid:18) d\n\n(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\nTX\u2190Y(p)\n\n1 \u2261\n\nTX\u2190R(p)\n\ndp\n\n(13)\nwhere 1 is 2 \u00d7 2 identity matrix. Then, L1 loss is employed similarly to the keypoint location\nconstraint. Finally, in our preliminary experiments, we observed that our model shows low sensitivity\nto the relative weights of the reconstruction and the two equivariance losses. Therefore, we use equal\nloss weights in all our experiments.\n\ndp\n\ndp\n\n,\n\nTY\u2190R(p)\n\n3.4 Testing Stage: Relative Motion Transfer\n\nAt this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT .\nEach frame Dt is independently processed to obtain St. Rather than transferring the motion encoded\nin TS1\u2190Dt (pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we\napply a transformation TDt\u2190D1(p) to the neighbourhood of each keypoint pk:\n(cid:19)\u22121\n\nTS1\u2190St(z) \u2248 TS1\u2190R(pk) + Jk(z \u2212 TS\u2190R(pk) + TD1\u2190R(pk) \u2212 TDt\u2190R(pk))\n\nwith\n\n(14)\n\n(cid:19)(cid:18) d\n\nJk =\n\nTD1\u2190R(p)\n\nTDt\u2190R(p)\n\ndp\n\n(cid:18) d\n\ndp\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n(cid:12)(cid:12)(cid:12)(cid:12)p=pk\n\n(15)\n\nDetailed mathematical derivations are provided in Sup. Mat.. Intuitively, we transform the neigh-\nbourhood of each keypoint pk in S1 according to its local deformation in the driving video. Indeed,\ntransferring relative motion over absolute coordinates allows to transfer only relevant motion patterns,\nwhile preserving global object geometry. Conversely, when transferring absolute coordinates, as in\nX2Face [40], the generated frame inherits the object proportions of the driving video. It\u2019s important\nto note that one limitation of transferring relative motion is that we need to assume that the objects\nin S1 and D1 have similar poses (see [28]). Without initial rough alignment, Eq. (14) may lead to\nabsolute keypoint locations physically impossible for the object of interest.\n\n4 Experiments\nDatasets. We train and test our method on four different datasets containing various objects. Our\nmodel is capable of rendering videos of much higher resolution compared to [28] in all our experi-\nments.\n\u2022 The VoxCeleb dataset [22] is a face dataset of 22496 videos, extracted from YouTube videos. For\npre-processing, we extract an initial bounding box in the \ufb01rst video frame. We track this face until\nit is too far away from the initial position. Then, we crop the video frames using the smallest crop\ncontaining all the bounding boxes. The process is repeated until the end of the sequence. We \ufb01lter\nout sequences that have resolution lower than 256 \u00d7 256 and the remaining videos are resized to\n256 \u00d7 256 preserving the aspect ratio. It\u2019s important to note that compared to X2Face [40], we obtain\nmore natural videos where faces move freely within the bounding box. Overall, we obtain 12331\ntraining videos and 444 test videos, with lengths varying from 64 to 1024 frames.\n\u2022 The UvA-Nemo dataset [9] is a facial analysis dataset that consists of 1240 videos. We apply the\nexact same pre-processing as for VoxCeleb. Each video starts with a neutral expression. Similar to\nWang et al. [38], we use 1116 videos for training and 124 for evaluation.\n\u2022 The BAIR robot pushing dataset [10] contains videos collected by a Sawyer robotic arm pushing\ndiverse objects over a table. It consists of 42880 training and 128 test videos. Each video is 30 frame\nlong and has a 256 \u00d7 256 resolution.\n\u2022 Following Tulyakov et al. [34], we collected 280 tai-chi videos from YouTube. We use 252 videos\nfor training and 28 for testing. Each video is split in short clips as described in pre-processing of\nVoxCeleb dataset. We retain only high quality videos and resized all the clips to 256 \u00d7 256 pixels\n(instead of 64 \u00d7 64 pixels in [34]). Finally, we obtain 3049 and 285 video chunks for training and\ntesting respectively with video length varying from 128 to 1024 frames. This dataset is referred to as\nthe Tai-Chi-HD dataset. The dataset will be made publicly available.\nEvaluation Protocol. Evaluating the quality of image animation is not obvious, since ground truth\nanimations are not available. We follow the evaluation protocol of Monkey-Net [28]. First, we\n\n7\n\n\fTable 1: Quantitative ablation study for video\nreconstruction on Tai-Chi-HD.\n\nBaseline\nPyr.+OS\u2190D\n\nPyr.\n\nJac. w/o Eq. (12)\n\nFull\n\nL1\n0.073\n0.069\n0.069\n0.073\n0.063\n\nTai-Chi-HD\n(AKD, MKR) AED\n0.235\n(8.945, 0.099)\n0.213\n(9.407, 0.065)\n0.205\n(8.773, 0.050)\n0.220\n(9.887, 0.052)\n(6.862, 0.036)\n0.179\n\nTable 2: Paired user study: user preferences\nin favour of our approach.\n\nX2Face [40] Monkey-Net [28]\n\nTai-Chi-HD\nVoxCeleb\n\nNemo\nBair\n\n92.0%\n95.8%\n79.8%\n95.0%\n\n80.6%\n68.4%\n60.6%\n67.0%\n\nInput D\n\nBaseline\n\nPyr.\n\nPyr.+OS\u2190D\n\nJac. w/o\nEq. (12)\n\nFull\n\nFigure 3: Qualitative ablation on Tai-Chi-HD.\n\nquantitatively evaluate each method on the \"proxy\" task of video reconstruction. This task consists of\nreconstructing the input video from a representation in which appearance and motion are decoupled.\nIn our case, we reconstruct the input video by combining the sparse motion representation in (2) of\neach frame and the \ufb01rst video frame. Second, we evaluate our model on image animation according\nto a user-study. In all experiments we use K=10 as in [28]. Other implementation details are given in\nSup. Mat.\nMetrics. To evaluate video reconstruction, we adopt the metrics proposed in Monkey-Net [28]:\n\u2022 L1. We report the average L1 distance between the generated and the ground-truth videos.\n\u2022 Average Keypoint Distance (AKD). For the Tai-Chi-HD, VoxCeleb and Nemo datasets, we use\n3rd-party pre-trained keypoint detectors in order to evaluate whether the motion of the input video\nis preserved. For the VoxCeleb and Nemo datasets we use the facial landmark detector of Bulat et\nal. [5]. For the Tai-Chi-HD dataset, we employ the human-pose estimator of Cao et al. [7]. These\nkeypoints are independently computed for each frame. AKD is obtained by computing the average\ndistance between the detected keypoints of the ground truth and of the generated video.\n\u2022 Missing Keypoint Rate (MKR). In the case of Tai-Chi-HD, the human-pose estimator returns an\nadditional binary label for each keypoint indicating whether or not the keypoints were successfully\ndetected. Therefore, we also report the MKR de\ufb01ned as the percentage of keypoints that are detected\nin the ground truth frame but not in the generated one. This metric assesses the appearance quality of\neach generated frame.\n\u2022 Average Euclidean Distance (AED). Considering an externally trained image representation, we\nreport the average euclidean distance between the ground truth and generated frame representation,\nsimilarly to Esser et al. [11]. We employ the feature embedding used in Monkey-Net [28].\nAblation Study. We compare the following variants of our model. Baseline: the simplest model\ntrained without using the occlusion mask (OS\u2190D=1 in Eq. (8)), jacobians (Jk = 1 in Eq. (4)) and\nis supervised with Lrec at the highest resolution only; Pyr.: the pyramid loss is added to Baseline;\nPyr.+OS\u2190D: with respect to Pyr., we replace the generator network with the occlusion-aware network;\nJac. w/o Eq. (12) our model with local af\ufb01ne transformations but without equivariance constraints on\njacobians Eq. (12); Full: the full model including local af\ufb01ne transformations described in Sec. 3.1.\nIn Fig. 3, we report the qualitative ablation. First, the pyramid loss leads to better results according\nto all the metrics except AKD. Second, adding OS\u2190D to the model consistently improves all the\nmetrics with respect to Pyr.. This illustrates the bene\ufb01t of explicitly modeling occlusions. We found\nthat without equivariance constraint over the jacobians, Jk becomes unstable which leads to poor\nmotion estimations. Finally, our Full model further improves all the metrics. In particular, we note\nthat, with respect to the Baseline model, the MKR of the full model is smaller by the factor of 2.75.\nIt shows that our rich motion representation helps generate more realistic images. These results are\ncon\ufb01rmed by our qualitative evaluation in Tab. 1 where we compare the Baseline and the Full models.\nIn these experiments, each frame D of the input video is reconstructed from its \ufb01rst frame (\ufb01rst\ncolumn) and the estimated keypoint trajectories. We note that the Baseline model does not locate any\n\n8\n\n\fTable 3: Video reconstruction: comparison with the state of the art on four different datasets.\n\nX2Face [40]\n\nMonkey-Net [28]\n\nOurs\n\nL1\n0.080\n0.077\n0.063\n\nTai-Chi-HD\n(AKD, MKR)\n(17.654, 0.109)\n(10.798, 0.059)\n(6.862, 0.036)\n\nAED\n0.272\n0.228\n0.179\n\nL1\n0.078\n0.049\n0.043\n\nVoxCeleb\n\nAKD AED\n0.405\n7.687\n0.199\n1.878\n1.294\n0.140\n\nL1\n0.031\n0.018\n0.016\n\nNemo\nAKD AED\n0.221\n3.539\n0.077\n1.285\n1.119\n0.048\n\nBair\nL1\n0.065\n0.034\n0.027\n\nX2Face [40]\n\nMonkey-\nNet [28]\n\nOurs\n\nFigure 4: Qualitative comparison with state of the art for the task of image animation on two\nsequences and two source images from the Tai-Chi-HD dataset.\n\nkeypoints in the arms area. Consequently, when the pose difference with the initial pose increases,\nthe model cannot reconstruct the video (columns 3,4 and 5). In contrast, the Full model learns to\ndetect a keypoint on each arm, and therefore, to more accurately reconstruct the input video even in\nthe case of complex motion.\nComparison with State of the Art. We now compare our method with state of the art for the video\nreconstruction task as in [28]. To the best of our knowledge, X2Face [40] and Monkey-Net [28] are\nthe only previous approaches for model-free image animation. Quantitative results are reported in\nTab. 3. We observe that our approach consistently improves every single metric for each of the four\ndifferent datasets. Even on the two face datasets, VoxCeleb and Nemo datasets, our approach clearly\noutperforms X2Face that was originally proposed for face generation. The better performance of our\napproach compared to X2Face is especially impressive X2Face exploits a larger motion embedding\n(128 \ufb02oats) than our approach (60=K*(2+4) \ufb02oats). Compared to Monkey-Net that uses a motion\nrepresentation with a similar dimension (50=K*(2+3)), the advantages of our approach are clearly\nvisible on the Tai-Chi-HD dataset that contains highly non-rigid objects (i.e.human body).\nWe now report a qualitative comparison for image animation. Generated sequences are reported in\nFig. 4. The results are well in line with the quantitative evaluation in Tab. 3. Indeed, in both examples,\nX2Face and Monkey-Net are not able to correctly transfer the body notion in the driving video,\ninstead warping the human body in the source image as a blob. Conversely, our approach is able\nto generate signi\ufb01cantly better looking videos in which each body part is independently animated.\nThis qualitative evaluation illustrates the potential of our rich motion description. We complete our\nevaluation with a user study. We ask users to select the most realistic image animation. Each question\nconsists of the source image, the driving video, and the corresponding results of our method and a\ncompetitive method. We require each question to be answered by 10 AMT worker. This evaluation\nis repeated on 50 different input pairs. Results are reported in Tab. 2. We observe that our method\nis clearly preferred over the competitor methods. Interestingly, the largest difference with the state\nof the art is obtained on Tai-Chi-HD: the most challenging dataset in our evaluation due to its rich\nmotions.\n\n5 Conclusions\nWe presented a novel approach for image animation based on keypoints and local af\ufb01ne transforma-\ntions. Our novel mathematical formulation describes the motion \ufb01eld between two frames and is\nef\ufb01ciently computed by deriving a \ufb01rst order Taylor expansion approximation. In this way, motion is\ndescribed as a set of keypoints displacements and local af\ufb01ne transformations. A generator network\ncombines the appearance of the source image and the motion representation of the driving video. In\naddition, we proposed to explicitly model occlusions in order to indicate to the generator network\nwhich image parts should be inpainted. We evaluated the proposed method both quantitatively and\nqualitatively and showed that our approach clearly outperforms state of the art on all the benchmarks.\n\n9\n\nimageSourceDrivingvideoimageSourceDrivingvideo\fReferences\n[1] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic\n\nvariational video prediction. In ICLR, 2017.\n\n[2] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of\n\nhumans in unseen poses. In CVPR, 2018.\n\n[3] Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-gan: Unsupervised video\n\nretargeting. In ECCV, 2018.\n\n[4] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999.\n\n[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment\n\nproblem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.\n\n[6] Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamic expression regression for real-time facial\n\ntracking and animation. TOG, 2014.\n\n[7] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using\n\npart af\ufb01nity \ufb01elds. In CVPR, 2017.\n\n[8] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In ECCV, 2018.\n\n[9] Hamdi Dibeklio\u02d8glu, Albert Ali Salah, and Theo Gevers. Are you really smiling at me? spontaneous versus\n\nposed enjoyment smiles. In ECCV, 2012.\n\n[10] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with\n\ntemporal skip connections. In CoRL, 2017.\n\n[11] Patrick Esser, Ekaterina Sutter, and Bj\u00f6rn Ommer. A variational u-net for conditional appearance and\n\nshape generation. In CVPR, 2018.\n\n[12] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In NIPS, 2016.\n\n[13] Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 3d guided \ufb01ne-grained face manipulation. In CVPR,\n\n2019.\n\n[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[15] Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor Lempitsky. Coordinate-based texture\n\ninpainting for pose-guided image generation. In CVPR, 2019.\n\n[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\n\nadversarial networks. In CVPR, 2017.\n\n[17] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n\n[18] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks\n\nthrough conditional image generation. In NIPS, 2018.\n\n[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In ECCV, 2016.\n\n[20] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[21] Yahui Liu, Marco De Nadai, Gloria Zen, Nicu Sebe, and Bruno Lepri. Gesture-to-gesture translation in the\n\nwild via category-independent conditional maps. ACM MM, 2019.\n\n[22] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identi\ufb01cation dataset. In\n\nINTERSPEECH, 2017.\n\n[23] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\n\nprediction using deep networks in atari games. In NIPS, 2015.\n\n[24] Joseph P Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and Sergey Tulyakov. Laplace landmark localiza-\n\ntion. In ICCV, 2019.\n\n10\n\n\f[25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\n\nimage segmentation. In MICCAI, 2015.\n\n[26] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular\n\nvalue clipping. In ICCV, 2017.\n\n[27] Aliaksandra Shysheya, Egor Zakharov, Kara-Ali Aliev, Renat Bashirov, Egor Burkov, Karim Iskakov,\nAleksei Ivakhnenko, Yury Malkov, Igor Pasechnik, Dmitry Ulyanov, Alexander Vakhitov, and Victor\nLempitsky. Textured neural avatars. In CVPR, June 2019.\n\n[28] Aliaksandr Siarohin, St\u00e9phane Lathuili\u00e8re, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating\n\narbitrary objects via deep motion transfer. In CVPR, 2019.\n\n[29] Aliaksandr Siarohin, Enver Sangineto, St\u00e9phane Lathuili\u00e8re, and Nicu Sebe. Deformable gans for pose-\n\nbased human image generation. In CVPR, 2018.\n\n[30] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representa-\n\ntions using lstms. In ICML, 2015.\n\n[31] Hao Tang, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Gesturegan for hand gesture-to-gesture translation\n\nin the wild. In ACM MM, 2018.\n\n[32] Hao Tang, Dan Xu, Wei Wang, Yan Yan, and Nicu Sebe. Dual generator generative adversarial networks\n\nfor multi-domain image-to-image translation. In ACCV, 2018.\n\n[33] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nie\u00dfner. Face2face:\n\nReal-time face capture and reenactment of rgb videos. In CVPR, 2016.\n\n[34] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and\n\ncontent for video generation. In CVPR, 2018.\n\n[35] Joost Van Amersfoort, Anitha Kannan, Marc\u2019Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith\n\nChintala. Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435, 2017.\n\n[36] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS,\n\n2016.\n\n[37] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.\n\nVideo-to-video synthesis. In NIPS, 2018.\n\n[38] Wei Wang, Xavier Alameda-Pineda, Dan Xu, Pascal Fua, Elisa Ricci, and Nicu Sebe. Every smile is\n\nunique: Landmark-guided diverse smile generation. In CVPR, 2018.\n\n[39] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality\n\nassessment. In ACSSC, 2003.\n\n[40] Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. X2face: A network for controlling face generation\n\nusing images, audio, and pose codes. In ECCV, 2018.\n\n[41] Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network\n\nfor pose-guided human video generation. BMVC, 2019.\n\n[42] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning\n\nof realistic neural talking head models. In ICCV, 2019.\n\n[43] Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsupervised discovery of\n\nobject landmarks as structural representations. In CVPR, 2018.\n\n[44] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris Metaxas. Learning to forecast and re\ufb01ne\n\nresidual motion for image-to-video generation. In ECCV, 2018.\n\n[45] Michael Zollh\u00f6fer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick P\u00e9rez, Marc\nStamminger, Matthias Nie\u00dfner, and Christian Theobalt. State of the art on monocular 3d face reconstruction,\ntracking, and applications. In Computer Graphics Forum, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3854, "authors": [{"given_name": "Aliaksandr", "family_name": "Siarohin", "institution": "University of Trento"}, {"given_name": "St\u00e9phane", "family_name": "Lathuili\u00e8re", "institution": "University of Trento"}, {"given_name": "Sergey", "family_name": "Tulyakov", "institution": "Snap Inc"}, {"given_name": "Elisa", "family_name": "Ricci", "institution": "FBK - Technologies of Vision"}, {"given_name": "Nicu", "family_name": "Sebe", "institution": "University of Trento"}]}