{"title": "Unsupervised learning of object frames by dense equivariant image labelling", "book": "Advances in Neural Information Processing Systems", "page_first": 844, "page_last": 855, "abstract": "One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations. Starting from the recent idea of viewpoint factorization, we propose a new approach that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame. This coordinate frame is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates. We demonstrate the applicability of this method to simple articulated objects and deformable objects such as human faces, learning embeddings from random synthetic transformations or optical flow correspondences, all without any manual supervision.", "full_text": "Unsupervised learning of object frames by dense\n\nequivariant image labelling\n\nJames Thewlis1\n\nHakan Bilen2\n\nAndrea Vedaldi1\n\n1 Visual Geometry Group\n\nUniversity of Oxford\n\n{jdt,vedaldi}@robots.ox.ac.uk\n\n2 School of Informatics\nUniversity of Edinburgh\n\nhbilen@ed.ac.uk\n\nAbstract\n\nOne of the key challenges of visual perception is to extract abstract models of 3D\nobjects and object categories from visual measurements, which are affected by\ncomplex nuisance factors such as viewpoint, occlusion, motion, and deformations.\nStarting from the recent idea of viewpoint factorization, we propose a new approach\nthat, given a large number of images of an object and no other supervision, can\nextract a dense object-centric coordinate frame. This coordinate frame is invariant\nto deformations of the images and comes with a dense equivariant labelling neural\nnetwork that can map image pixels to their corresponding object coordinates.\nWe demonstrate the applicability of this method to simple articulated objects\nand deformable objects such as human faces, learning embeddings from random\nsynthetic transformations or optical \ufb02ow correspondences, all without any manual\nsupervision.\n\n1\n\nIntroduction\n\nHumans can easily construct mental models of complex 3D objects and object categories from\nvisual observations. This is remarkable because the dependency between an object\u2019s appearance\nand its structure is tangled in a complex manner with extrinsic nuisance factors such as viewpoint,\nillumination, and articulation. Therefore, learning the intrinsic structure of an object from images\nrequires removing these unwanted factors of variation from the data.\nThe recent work of [39] has proposed an unsupervised approach to do so, based on on the concept\nof viewpoint factorization. The idea is to learn a deep Convolutional Neural Network (CNN) that\ncan, given an image of the object, detect a discrete set of object landmarks. Differently from\ntraditional approaches to landmark detection, however, landmarks are neither de\ufb01ned nor supervised\nmanually. Instead, the detectors are learned using only the requirement that the detected points must\nbe equivariant (consistent) with deformations of the input images. The authors of [39] show that this\nconstraint is suf\ufb01cient to learn landmarks that are \u201cintrinsic\u201d to the objects and hence capture their\nstructure; remarkably, due to the generalization ability of CNNs, the landmark points are detected\nconsistently not only across deformations of a given object instance, which are observed during\ntraining, but also across different instances. This behaviour emerges automatically from training on\nthousands of single-instance correspondences.\nIn this paper, we take this idea further, moving beyond a sparse set of landmarks to a dense model\nof the object structure (section 3). Our method relates each point on an object to a point in a low\ndimensional vector space in a way that is consistent across variation in motion and in instance identity.\nThis gives rise to an object-centric coordinate system, which allows points on the surface of an object\nto be indexed semantically (\ufb01gure 1). As an illustrative example, take the object category of a face\nand the vector space R3. Our goal is to semantically map out the object such that any point on a\nface, such as the left eye, lives at a canonical position in this \u201clabel space\u201d. We train a CNN to learn\nthe function that projects any face image into this space, essentially \u201ccoloring\u201d each pixel with its\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Dense equivariant image labelling. Left: Given an image x of an object or object category\nand no other supervision, our goal is to \ufb01nd a common latent space Z, homeomorphic to a sphere,\nwhich attaches a semantically-consistent coordinate frame to the object points. This is done by\nlearning a dense labelling function that maps image pixels to their corresponding coordinate in the\nZ space. This mapping function is equivariant (compatible) with image warps or object instance\nvariations. Right: An equivariant dense mapping learned in an unsupervised manner from a large\ndataset of faces. (Results of SIMPLE network, Ldist, \u03b3 = 0.5)\n\ncorresponding label. As a result of our learning formulation, the label space has the property of being\nlocally smooth: points nearby in the image are nearby in the label space. In an ideal case, we could\nimagine the surface of an object to be mapped to a sphere.\nIn order to achieve these results, we contribute several technical innovations (section 3.2). First,\nwe show that, in order to learn a non-trivial object coordinate frame, the concept of equivariance\nmust be complemented with the one of distinctiveness of the embedding. Then, we propose a CNN\nimplementation of this concept that can explicitly express uncertainty in the labelling of the object\npoints. The formulation is used in combination with a probabilistic loss, which is augmented with a\nrobust geometric distance to encourage better alignment of the object features.\nWe show that this framework can be used to learn meaningful object coordinate frames in a purely\nunsupervised manner, by analyzing thousands of deformations of visual objects. While [39] proposed\nto use Thin Plate Spline image warps for training, here we also consider simple synthetic articulated\nobjects having frames related by known optical \ufb02ow (section 4).\nWe conclude the paper with a summary of our \ufb01nding (section 5).\n\n2 Related Work\n\nLearning the structure of visual objects. Modeling the structure of visual objects is a widely-\nstudied (e.g. [6, 7, 11, 41, 12]) computer vision problem with important applications such as facial\nlandmark detection and human body pose estimation. Much of this work is supervised and aimed\nat learning detectors of objects or their parts, often using deep learning. A few approaches such as\nspatial transformer networks [20] can learn geometric transformations without explicit geometric\nsupervision, but do not build explicit geometric models of visual objects.\nMore related to our work, WarpNet [21] and geometric matching networks [35] learn a neural network\nthat predicts Thin Plate Spline [3] transformations between pairs of images of an object, including\nsynthetic warps. Deep Deformation Network [44] improves WarpNet by using a Point Transformer\nNetwork to re\ufb01ne the computed landmarks, but it requires manual supervision. None of these works\nlook at the problem of learning an invariant geometric embedding for the object.\nOur work builds on the idea of viewpoint factorization (section 3.1), recently introduced in [39, 32].\nHowever, we extend [39] in several signi\ufb01cant ways. First, we construct a dense rather than discrete\nembedding, where all pixels of an object are mapped to an invariant object-centric coordinate instead\nof just a small set of selected landmarks. Second, we show that the equivariance constraint proposed\nin [39] is not quite enough to learn such an embedding; it must be complemented with the concept of\na distinctive embedding (section 3.1). Third, we introduce a new neural network architecture and\ncorresponding training objective that allow such an embedding to be learned in practice (section 3.2).\nOptical/semantic \ufb02ow. A common technique to \ufb01nd correspondences between temporally related\nvideo frames is optical \ufb02ow [18]. The state-of-the-art methods [14, 40, 19] typically employ convolu-\n\n2\n\n\ftional neural networks to learn pairwise dense correspondences between the same object instances\nat subsequent frames. The SIFT Flow method [25] extends the between-instance correspondences\nto cross-instance mappings by matching SIFT features [27] between semantically similar object\ninstances. Learned-Miller [24] extends the pairwise correspondences to multiple images by posing\na problem of alignment among the images of a set. Collection Flow [22] and Mobahi et al. [29]\nproject objects onto a low-rank space that allow for joint alignment. FlowWeb [52], and Zhou et\nal. [51] construct fully connected graphs to maximise cycle consistency between each image pair and\nsynthethic data as an intermediary by training a CNN. In our experiments (section 4) \ufb02ow is known\nfrom synthetic warps or motion, but our work could build on any unsupervised optical \ufb02ow method.\n\nUnsupervised learning. Classical unsupervised learning methods such as autoencoders [4, 2, 17]\nand denoising autoencoders aim to learn useful feature representations from an input by simply\nreconstructing it after a bottleneck. Generative adversarial networks [16] target producing samples of\nrealistic images by training generative models. These models when trained joint with image encoders\nare also shown to learn good feature representations [9, 10]. More recently several studies have\nemerged that train neural networks by learning auxiliary or pseudo tasks. These methods exploit\ntypically some existing information in input as \u201cself-supervision\u201d without any manual labeling by\nremoving or perturbing some information from an input and requiring a network to reconstruct it.\nFor instance, Doersch et al. [8], and Noroozi and Favaro [31] train a network to predict the relative\nlocations of shuf\ufb02ed image patches. Other self-supervised tasks include colorizing images [46],\ninpainting [34], ranking frames of a video in temporally correct order [28, 13]. More related to our\napproach, Agrawal et al. [1] use egomotion as supervisory signal to learn feature representations\nin a Siamese network by predicting camera transformations from image pairs, [33] learn to group\npixels that move together in a video. [50, 15] use a warping-based loss to learn depth from video.\nRecent work [36] leverages RGB-D based reconstruction [30] and is similar to this work, showing\nqualitatively impressive results learning a consistent low-dimensional labelling on a human dataset.\n\n3 Method\n\nThis section discusses our method in detail, \ufb01rst introducing the general idea of dense equivariant\nlabelling (section 3.1), and then presenting a concrete implementation of the latter using a novel deep\nCNN architecture (section 3.2).\n\n3.1 Dense equivariant labelling\nConsider a 3D object S \u2282 R3 or a class of such objects S that are topologically isomorphic to\na sphere Z \u2282 R3 (i.e. the objects are simple closed surfaces without holes). We can construct a\nhomeomorphism p = \u03c0S(q) mapping points of the sphere q \u2208 Z to points p \u2208 S of the objects.\nFurthermore, if the objects belong to the same semantic category (e.g. faces), we can assume that\nS : S \u2192 S(cid:48) maps points of\nthese isomorphisms are semantically consistent, in the sense that \u03c0S(cid:48) \u25e6 \u03c0\u22121\nobject S to semantically-analogous points in object S(cid:48) (e.g. for human faces the right eye in one face\nshould be mapped to the right eye in another [39]).\nWhile this construction is abstract, it shows that we can endow the object (or object category) with\na spherical reference system Z. The authors of [39] build on this construction to de\ufb01ne a discrete\nsystem of object landmarks by considering a \ufb01nite number of points zk \u2208 Z. Here, we take the\ngeometric embedding idea more literally and propose to explicitly learn a dense mapping from images\nof the object to the object-centric coordinate space Z. Formally, we wish to learn a labelling function\n\u03a6 : (x, u) (cid:55)\u2192 z that takes a RGB image x : \u039b \u2192 R3, \u039b \u2282 R3 and a pixel u \u2208 \u039b to the object point\nz \u2208 Z which is imaged at u (\ufb01gure 1).\nSimilarly to [39], this mapping must be compatible or equivariant with image deformations. Namely,\nlet g : \u039b \u2192 \u039b be a deformation of the image domain, either synthetic or due to a viewpoint change\nor other motion. Furthermore, let gx = x \u25e6 g\u22121 be the action of g on the image (obtained by inverse\nwarp). Barring occlusions and boundary conditions, pixel u in image x must receive the same label\nas pixel gu in image gx, which results in the invariance constraint:\n\u2200x, u : \u03a6(x, u) = \u03a6(gx, gu).\n\n(1)\n\n3\n\n\fEquivalently, we can view the network as a functional x (cid:55)\u2192 \u03a6(x,\u00b7) that maps the image to a\ncorresponding label map. Since the label map is an image too, g acts on it by inverse warp.1 Using\nthis, the constraint (1) can be rewritten as the equivariance relation g\u03a6(x,\u00b7) = \u03a6(gx,\u00b7). This can be\nvisualized by noting that the label image deforms in the same way as the input image, as show for\nexample in \ufb01gure 3.\nFor learning, constraint (1) can be incorporated in a loss function as follows:\n\n(cid:90)\n\n\u039b\n\nL(\u03a6|\u03b1) =\n\n1\n|\u039b|\n\n(cid:107)\u03a6(x, u) \u2212 \u03a6(gx, gu)(cid:107)2 du.\n\nHowever, minimizing this loss has the signi\ufb01cant drawback that a global optimum is obtained by\nsimply setting \u03a6(x, u) = const. The reason for this issue is that (1) is not quite enough to learn a\nuseful object representation. In order to do so, we must require the labels not only to be equivariant,\nbut also distinctive, in the sense that\n\n\u03a6(x, u) = \u03a6(gx, v) \u21d4 v = gu.\n\nWe can encode this requirement as a loss in different ways. For example, by using the fact that points\n\u03a6(x, u) are on the unit sphere, we can use the loss:\n\nL(cid:48)(\u03a6|x, g) =\n\n1\n|\u039b|\n\n(cid:107)gu \u2212 argmaxv(cid:104)\u03a6(x, u), \u03a6(gx, v)(cid:105)(cid:107)2 du.\n\n(2)\n\n(cid:90)\n\n\u039b\n\nBy doing so, the labels \u03a6(x, u) must be able to discriminate between different object points, so that a\nconstant labelling would receive a high penalty.\nRelationship with learning invariant visual descriptors. As an alternative to loss (2), we could\nhave used a pairwise loss2 to encourage the similarity (cid:104)\u03a6(x, u), \u03a6(x(cid:48), gu)(cid:105) of the labels assigned\nto corresponding pixels u and gu to be larger than the similarity (cid:104)\u03a6(x, u), \u03a6(x(cid:48), v)(cid:105) of the labels\nassigned to pixels u and v that do not correspond. Formally, this would result in a pairwise loss\nsimilar to the ones often used to learn invariant visual descriptors for image matching. The reason\nwhy our method learns an object representation instead of a generic visual descriptor is that the\ndimensionality of the label space Z is just enough to represent a point on a surface. If we replace\nZ with a larger space such as Rd, d (cid:29) 2, we can expect \u03a6(x, u) to learn to extract generic visual\ndescriptors like SIFT instead. This establishes an interesting relationship between visual descriptors\nand object-speci\ufb01c coordinate vectors and suggests that it is possible to transition between the two by\ncontrolling their dimensionality.\n\n3.2 Concrete learning formulation\n\nIn this section we introduce a concrete implementation of our method (\ufb01gure 2). For the mapping\n\u03a6, we use a CNN that receives as input an image tensor x \u2208 RH\u00d7W\u00d7C and produces as output a\nlabel tensor z \u2208 RH\u00d7W\u00d7L. We use the notation \u03a6u(x) to indicate the L-dimensional label vector\nextracted at pixel u from the label image computed by the network.\nThe dimension of the label vectors is set to L = 3 (instead of L = 2) in order to allow the network\nto express uncertainty about the label assigned to a pixel. The network can do so by modulating\nthe norm of \u03a6u(x). In fact, correspondences are expressed probabilistically by computing the inner\nproduct of label vectors followed by the softmax operator. Formally, the probability that pixel v in\nimage x(cid:48) corresponds to pixel u in image x is expressed as:\n\np(v|u; x, x(cid:48), \u03a6) =\n\n(cid:80)\n\ne(cid:104)\u03a6u(x),\u03a6v(x(cid:48))(cid:105)\nz e(cid:104)\u03a6u(x),\u03a6z(x(cid:48))(cid:105) .\n\nIn this manner, a shorter vector \u03a6u results in a more diffuse probability distribution.\n\n1In the sense that g\u03a6(x,\u00b7) = \u03a6(x,\u00b7) \u25e6 g\u22121.\n2Formally, this is achieved by the loss\nL(cid:48)(cid:48)\n\n(\u03a6|x, g) =\n\n(cid:110)\n\n(cid:90)\n\nmax\n\n0, max\n\nv\n\n\u039b\n\n1\n|\u039b|\n\nwhere \u2206(u, v) \u2265 0 is an error-dependent margin.\n\n\u2206(u, v) + (cid:104)\u03a6(x, u), \u03a6(gx, v)(cid:105) \u2212 (cid:104)\u03a6(x, u), \u03a6(gx, gu)(cid:105)(cid:111)\n\n4\n\n(3)\n\ndu,\n\n\fFigure 2: Unsupervised dense correspondence network. From left to right: The network \u03a6 extracts\nlabel maps \u03a6u(x) and \u03a6v(x(cid:48)) from the image pair x and x(cid:48). An optical \ufb02ow module (or ground truth\nfor synthetic transformation) computes the warp (correspondence \ufb01eld) g such that x(cid:48) = gx. Then\nthe label of each point u in the \ufb01rst image is correlated to each point v in the second, obtaining a\nnumber of score maps. The loss evaluates how well the score maps predict the warp g.\n\nNext, we wish to de\ufb01ne a loss function for learning \u03a6 from data. To this end, we consider a triplet\n\u03b1 = (x, x(cid:48), g), where x(cid:48) = gx is an image that corresponds to x up to transformation g (the nature\nof the data is discussed below). We then assess the performance of the network \u03a6 on the triplet \u03b1\nusing two losses. The \ufb01rst loss is the negative log-likelihood of the ground-truth correspondences:\n\nLlog(\u03a6|x, x(cid:48), g) = \u2212 1\nHW\n\nlog p(gu|u; x, x(cid:48), \u03a6).\n\n(4)\n\n(cid:88)\n\nu\n\n(cid:88)\n\n(cid:88)\n\nu\n\nv\n\nThis loss has the advantage that it explicitly learns (3) as the probability of a match. However, it is\nnot sensitive to the size of a correspondence error v \u2212 gu. In order to address this issue, we also\nconsider the loss\n\nLdist(\u03a6|x, x(cid:48), g) =\n\n1\n\nHW\n\n(cid:107)v \u2212 gu(cid:107)\u03b3\n\n2 p(v|u; x, x(cid:48), \u03a6).\n\n(5)\n\nHere \u03b3 > 0 is an exponent used to control the robustness of the distance measure, which we set to\n\u03b3 = 0.5, 1.\n\nNework details. We test two architecture. The \ufb01rst one, denoted SIMPLE, is the same as [49, 39]\nand is a chain (5, 20)+, (2, mp),\u21932, (5, 48)+, (3, 64)+, (3, 80)+, (3, 256)+, (1, 3) where (h, c) is\na bank of c \ufb01lters of size h \u00d7 h, + denotes ReLU, (h, mp) is h \u00d7 h max-pooling, \u2193s is\ns\u00d7 downsampling. Better performance can be obtained by increasing the support of the \ufb01l-\nters in the network; for this, we consider a second network DILATIONS (5, 20)+, (2, mp),\u21932\n, (5, 48)+, (5, 64, 2)+, (3, 80, 4)+, (3, 256, 2)+, (1, 3) where (h, c, d) is a \ufb01lter with \u00d7d dilation [43].\n\n3.3 Learning from synthetic and true deformations\nLosses (4) and (5) learn from triplets \u03b1 = (x, x(cid:48), g). Here x(cid:48) can be either generated synthetically by\napplying a random transformation g to a natural image x [39, 21], or it can be obtained by observing\nimage pairs (x, x(cid:48)) containing true object deformations arising from a viewpoint change or an object\nmotion or deformation.\nThe use of synthetic transformations enables training even on static images and was considered\nin [39], who showed it to be suf\ufb01cient to learn meaningful landmarks for a number of real-world\nobject such as human and cat faces. Here, in addition to using synthetic deformations, we also\nconsider using animated image pairs x and x(cid:48). In principle, the learning formulation can be modi\ufb01ed\nso that knowledge of g is not required; instead, images and their warps can be compared and aligned\ndirectly based on the brightness constancy principle. In our toy video examples we obtain g from the\nrendering engine, but it can in theory be obtained using an off-the-shelf optical \ufb02ow algorithm which\nwould produce a noisy version of g.\n\n5\n\nOptical\ufb02ow\fFigure 3: Roboarm equivariant labelling. Top: Original video frames of a simple articulated object.\nMiddle and bottom: learned labels, which change equivariantly with the arm, learned using Llog and\nLdist, respectively. Different colors denote different points of the spherical object frame.\n4 Experiments\n\nThis section assesses our unsupervised method for dense object labelling on two representative tasks:\ntwo toy problems (sections 4.1 and 4.2) and human and cat faces (section 4.3).\n\n4.1 Roboarm example\n\nIn order to illustrate our method we consider a toy problem consisting of a simple articulated object,\nnamely an animated robotic arm (\ufb01gure 3) created using a 2D physics engine [38]. We do so for two\nreasons: to show that the approach is capable of labelling correctly deformable/articulated objects and\nto show that the spherical model Z is applicable also to thin objects, that have mainly a 1D structure.\nDataset details. The arm is anchored to the bottom left corner and is made up of colored capsules\nconnected with joints having reasonable angle limits to prevent unrealistic contortion and self-\nocclusion. Motion is achieved by varying the gravity vector, sampling each element from a Gaussian\nwith standard deviation 15 m s\u22122 every 100 iterations. Frames x of size 90 \u00d7 90 pixels and the\ncorresponding \ufb02ow \ufb01elds g : x (cid:55)\u2192 x(cid:48) are saved every 20 iterations. We also save the positions of the\ncapsule centers. The \ufb01nal dataset has 23999 frames.\nLearning. Using the correspondences \u03b1 = (x, x(cid:48), g) provided by the \ufb02ow \ufb01elds, we use our method\nto learn an object centric coordinate frame Z and its corresponding labelling function \u03a6u(x). We\ntest learning \u03a6 using the probabilistic loss (4) and distance-based loss (5). In the loss we ignore\nareas with zero \ufb02ow, which automatically removes the background. We use the SIMPLE network\narchitecture (section 3.2).\n\nResults. Figure 3 provides some qualitative results, showing by means of colormaps the labels \u03a6u(x)\nassociated to different pixels of each input image. It is easy to see that the method attaches consistent\nlabels to the different arm elements. The distance-based loss produces a more uniform embedding, as\nmay be expected. The embeddings are further visualized in Figure 4 by projecting a number of video\nframes back to the learned coordinate spaces Z. It can be noted that the space is invariant, in the sense\nthat the resulting \ufb01gure is approximately the same despite the fact that the object deforms signi\ufb01cantly\nin image space. This is true for both embeddings, but the distance-based ones are geometrically more\nconsistent.\n\nPredicting capsule centers. We evaluate quantitatively the ability of our object frames to localise the\ncapsule centers. If our assumption is correct and a coordinate system intrinsic to the object has been\nlearned, then we should expect there to be a speci\ufb01c 3-vector in Z corresponding to each center, and\nour job is to \ufb01nd these vectors. Various strategies could be used, such as averaging the object-centric\ncoordinates given to the centers over the training set, but we choose to incorporate the problem into\nthe learning framework. This is done using the negative log-likelihood in much the same way as\n(4), limiting our vectors u to the centers. This is done as an auxiliary layer with no backpropagation\nto the rest of the network, so that the embedding remains unsupervised. The error reported is the\nEuclidean distance as a percentage of the image width.\nResults are given for the different loss functions used for unsupervised training in Table 1 and\nvisualized in Figure 5 right, showing that the object centers can be located to a high degree of\n\n6\n\n\fFigure 4: Invariance of the object-centric coordinate space for Roboarm. The plot projects\nframes 3,6,9 of \ufb01gure 3 on the object-centric coordinate space Z, using the embedding functions\nlearned by means of the probabilistic (top) and distance (bottom) based losses. The sphere is then\nunfolded, plotting latitude and longitude (in radians) along the vertical and horizontal axes.\n\nFigure 5: Left: Embedding spaces of different dimension. Spherical embedding (from the 3D\nembedding function \u03a6u(x) \u2208 R3) learned using the distance loss compared to a circular embedding\nwith one dimension less. Right: Capsule center prediction for different losses.\n\naccuracy. The negative log likelihood performs best while the two losses incorporating distance\nperform similarly.\nWe also perform experiments varying the dimensionality L of the label space Z (Table 2). Perhaps\nmost interestingly, given the almost one-dimensional nature of the arm, is the case of L = 2, which\nwould correspond to an approximately circular space (since the length of vectors is used to code\nfor uncertainty). As seen in the right of Figure 5 left, the segments are represented almost perfectly\non the boundary of a circle, with the exception of the bifurcation which it is unable to accurately\nrepresent. This is manifested by the light blue segment trying, and failing, to be in two places at once.\n\nUnsupervised Loss\n\nError\n0.97 %\n1.13 %\n1.14 %\n\nLlog\n\nLdist, \u03b3 = 1\nLdist, \u03b3 = 0.5\n\nTable 1: Predicting capsule centers.\nError as percent of image width.\n4.2 Textured sphere example\n\n2\n3\n5\n20\n\nDescriptor Dimension\n\nError\n1.29 %\n1.14 %\n1.16 %\n1.28 %\nTable 2: Descriptor dimension\n(Ldist, \u03b3 = 0.5). L>3 shows no improvement,\nsuggesting L=3 is the natural manifold of the arm.\n\nThe experiment of Figure 6 tests the ability of the method to understand a complete rotation of a 3D\nobject, a simple textured sphere. Despite the fact that the method is trained on pairs of adjacent video\nframes (and corresponding optical \ufb02ow), it still learns a globally-consistent embedding. However,\nthis required switching from from the SIMPLE to the DILATIONS architecture (section 3.2).\n\n4.3 Faces\n\nAfter testing our method on a toy problem, we move to a much harder task and apply our method to\ngenerate an object-centric reference frame Z for the category of human faces. In order to generate\nan image pair and corresponding \ufb02ow \ufb01eld for training we warp each face synthetically using Thin\nPlate Spline warps in a manner similar to [39]. We train our models on the extensive CelebA [26]\ndataset of over 200k faces as in [39], excluding MAFL [49] test overlap from the given training split.\nIt has annotations of the eyes, nose and mouth corners. Note that we do not use these to train our\nmodel. We also use AFLW [23], testing on 2995 faces [49, 42, 48] with 5 landmarks. Like [39] we\n\n7\n\n-4-2024-2-1012-4-2024-2-1012-4-2024-2-1012-4-2024-2-1012-4-2024-2-1012-4-2024-2-1012LogDist 0.5Dist 1Ground truth\fFigure 6: Sphere equivariant labelling. Top: video frames of a rotating textured sphere. Middle:\nlearned dense labels, which change equivariantly with the sphere. Bottom: re-projection of the video\nframes on the object frame (also spherical). Except for occlusions, the reprojections are approximately\ninvariant, correctly mapping the blue and orange sides to different regions of the label space\n\nuse 10,122 faces for training. We additionally evaluate qualitatively on a dataset of cat faces [47],\nusing 8609 images for training.\n\nQualitative assessment. We \ufb01nd that for network SIMPLE the negative log-likelihood loss, while\nperforming best for the simple example of the arm, performs poorly on faces. Speci\ufb01cally, this model\nfails to disambiguate the left and right eye, as shown in Figure 9 (right). The distance-based loss (5)\nproduces a more coherent embedding, as seen in Figure 9 (left). Using DILATIONS this problem\ndisappears, giving qualitatively smooth and unambiguous labels for both the distance loss (Figure 7)\nand the log-likelihood loss (Figure 8). For cats our method is able to learn a consistent object frame\ndespite large variations in appearance (Figure 8).\n\nFigure 7: Faces. DILATIONS network with Ldist, \u03b3 = 0.5. Top: Input images, Middle: Predicted\ndense labels mapped to colours, Bottom: Image pixels mapped to label sphere and \ufb02attened.\n\nRegressing semantic landmarks. We would like to quantify the accuracy of our model in terms of\nability to consistently locate manually annotated points, speci\ufb01cally the eyes, nose, and mouth corners\ngiven in the CelebA dataset. We use the standard test split for evaluation of the MAFL dataset [49],\ncontaining 1000 images. We also use the MAFL training subset of 19k images for learning to predict\nthe ground truth landmarks, which gives a quantitative measure of the consistency of our object frame\nfor detecting facial features. These are reported as Euclidean error normalized as a percentage of\ninter-ocular distance.\nIn order to map the object frame to the semantic landmarks, as in the case of the robot arm centers, we\nlearn the vectors zk \u2208 Z corresponding to the position of each point in our canonical reference space\nand then, for any given image, \ufb01nd the nearest z and its corresponding pixel location u. We report\nthe localization performance of this model in Table 3 (\u201cError Nearest\u201d). We empirically validate\nthat with the SIMPLE network the negative log-likelihood is not ideal for this task (Figure 9) and\n\n8\n\n\fFigure 8: Cats. DILATIONS network with Llog. Top: Input images, Middle: Labels mapped to\ncolours, Bottom: Images mapped to the spherical object frames.\n\nFigure 9: Annotated landmark prediction from the shown unsupervised label maps (SIMPLE\nnetwork). Left: Trained with Ldist, \u03b3 = 0.5, Right: Failure to disambiguate eyes with Llog.\n\n(Prediction: green, Ground truth: Blue)\n\nwe obtain higher performance for the robust distance with power 0.5. However, after switching to\nDILATIONS to increase the receptive \ufb01eld both methods perform comparably.\nThe method of [39] learns to regress P ground truth coordinates based on M > P unsupervised\nlandmarks. By regressing from multiple points it is not limited to integer pixel coordinates. While we\nare not predicting landmarks as network output, we can emulate this method by allowing multiple\npoints in our object coordinate space to be predictive for a single ground truth landmark. We learn\none regressor per ground truth point, each formulated as a linear regressor R2M \u2192 R2 on top of\ncoordinates from M = 50 learned intermediate points. This allows the regression to say which points\nin Z are most useful for predicting each ground truth point.\nWe also report results after unsupervised \ufb01netuning of a CelebA network to the more challenging\nAFLW followed by regressor training on AFLW. As shown in Tables 3 and 4, we outperform other\nunsupervised methods on both datasets, and are comparable to fully supervised methods.\n\nCascaded CNN [37]\n\nMethod\nRCPR [5]\n\nCFAN [45]\nTCDCN [49]\n\nRAR [42]\n\nError\n11.6 %\n8.97 %\n10.94 %\n7.65 %\n7.23 %\n10.53 %\n8.80 %\n\nNetwork\n\nUnsup. Loss\n\nLlog\n\nSIMPLE\nLdist, \u03b3 = 1\nSIMPLE\nLdist, \u03b3 = 0.5\nSIMPLE\nDILATIONS\nDILATIONS Ldist, \u03b3 = 0.5\n\nLlog\n\n\u2014\n\nError\nRegress\n\nError\nNearest\n75.02 %\n14.57 % 7.94 %\n13.29 % 7.18 %\n11.05 % 5.83 %\n10.53 % 5.87 %\n6.67 %\n\n[39]\n\nUnsup. Landmarks [39]\nDILATIONS Ldist, \u03b3 = 0.5\nTable 4: Comparison with supervised and un-\nsupervised methods on AFLW\n\nTable 3: Nearest neighbour and regression\nlandmark prediction on MAFL\n5 Conclusions\nBuilding on the idea of viewpoint factorization, we have introduce a new method that can endow\nan object or object category with an invariant dense geometric embedding automatically, by simply\nobserving a large dataset of unlabelled images. Our learning framework combines in a novel way\nthe concept of equivariance with the one of distinctiveness. We have also proposed a concrete\nimplementation using novel losses to learn a deep dense image labeller. We have shown empirically\nthat the method can learn a consistent geometric embedding for a simple articulated synthetic robotic\narm as well as for a 3D sphere model and real faces. The resulting embeddings are invariant to\ndeformations and, importantly, to intra-category variations.\n\n9\n\n\fAcknowledgments: This work acknowledges the support of the AIMS CDT (EPSRC EP/L015897/1) and ERC\n677195-IDIU. Clipart: FreePik.\n\nReferences\n\n[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proc. ICCV,\n\n2015.\n\n[2] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine\n\nLearning, 2009.\n\n[3] Fred L. Bookstein. Principal Warps: Thin-Plate Splines and the Decomposition of Deformations.\n\nPAMI, 1989.\n\n[4] H Bourlard and Y Kamp. Auto-Association by Multilayer Perceptrons and Singular Value\n\nDecomposition. Biological Cybernetics, 1988.\n\n[5] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Doll\u00e1r. Robust face landmark estimation\n\nunder occlusion. In Proc. ICCV, 2013.\n\n[6] T F Cootes, C J Taylor, D H Cooper, and J Graham. Active shape models: their training and\n\napplication. CVIU, 1995.\n\n[7] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In\n\nProc. CVPR, 2005.\n\n[8] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised Visual Representation\n\nLearning by Context Prediction. In Proc. ICCV, 2015.\n\n[9] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. Proc.\n\nICLR, 2017.\n\n[10] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mas-\n\ntropietro, and Aaron Courville. Adversarially learned inference. Proc. ICLR, 2017.\n\n[11] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object\n\nDetection with Discriminatively Trained Part Based Models. PAMI, 2010.\n\n[12] Rob Fergus, Pietro Perona, and Andrew Zisserman. Object class recognition by unsupervised\n\nscale-invariant learning. In Proc. CVPR, 2003.\n\n[13] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video\n\nrepresentation learning with odd-one-out networks. In Proc. CVPR, 2017.\n\n[14] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip H\u00e4usser, Caner Haz\u0131rba\u00b8s, Vladimir Golkov,\nPatrick van der Smagt, Daniel Cremers, and Thomas Brox. FlowNet: Learning Optical Flow\nwith Convolutional Networks. In Proc. ICCV, 2015.\n\n[15] Ravi Garg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation:\n\nGeometry to the rescue. In Proc. ECCV, pages 740\u2013756, 2016.\n\n[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n\nhttp://www.deeplearningbook.org.\n\n[17] G E Hinton and R R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.\n\nScience, 2006.\n\n[18] Berthold K.P. Horn and Brian G. Schunck. Determining optical \ufb02ow. Arti\ufb01cial Intelligence,\n\n1981.\n\n[19] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas\nBrox. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. arXiv preprint\narXiv:1612.01925, 2016.\n\n[20] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Trans-\n\nformer Networks. In Proc. NIPS, 2015.\n\n[21] A. Kanazawa, D. W. Jacobs, and M. Chandraker. WarpNet: Weakly supervised matching for\n\nsingle-view reconstruction. In Proc. CVPR, 2016.\n\n10\n\n\f[22] Ira Kemelmacher-Shlizerman and Steven M. Seitz. Collection \ufb02ow. In Proc. CVPR, 2012.\n[23] Martin Koestinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated facial land-\nmarks in the wild: A large-scale, real-world database for facial landmark localization. In First\nIEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.\n\n[24] Erik G Learned-Miller. Data driven image models through continuous joint alignment. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2006.\n\n[25] Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT Flow: Dense correspondence across scenes\n\nand its applications. PAMI, 2011.\n\n[26] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proc. ICCV, 2015.\n\n[27] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal\n\nof computer vision, 60(2):91\u2013110, 2004.\n\n[28] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf\ufb02e and learn: unsupervised learning\n\nusing temporal order veri\ufb01cation. In Proc. ECCV, 2016.\n\n[29] Hossein Mobahi, Ce Liu, and William T. Freeman. A Compositional Model for Low-\n\nDimensional Image Set Representation. Proc. CVPR, 2014.\n\n[30] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and\n\ntracking of non-rigid scenes in real-time. In Proc. CVPR, 2015.\n\n[31] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving\n\njigsaw puzzles. In Proc. ECCV, 2016.\n\n[32] D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them.\n\nIn Proc. ICCV, 2017.\n\n[33] Deepak Pathak, Ross Girshick, Piotr Doll\u00e1r, Trevor Darrell, and Bharath Hariharan. Learning\n\nfeatures by watching objects move. In Proc. CVPR, 2017.\n\n[34] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\n\nEncoders: Feature Learning by Inpainting. In Proc. CVPR, 2016.\n\n[35] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. Convolutional neural network architecture for geometric\n\nmatching. In Proc. CVPR, 2017.\n\n[36] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self-supervised visual descriptor learning\n\nfor dense correspondence. IEEE Robotics and Automation Letters, 2(2):420\u2013427, 2017.\n\n[37] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point\n\ndetection. In Proc. CVPR, 2013.\n\n[38] Yuval Tassa.\n\nCapSim - the MATLAB physics engine.\n\nhttps://mathworks.com/\n\nmatlabcentral/fileexchange/29249-capsim-the-matlab-physics-engine.\n\n[39] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks\n\nby factorized spatial embeddings. In Proc. ICCV, 2017.\n\n[40] James Thewlis, Shuai Zheng, Philip H. S. Torr, and Andrea Vedaldi. Fully-Trainable Deep\n\nMatching. In Proc. BMVC, 2016.\n\n[41] Markus Weber, Max Welling, and Pietro Perona. Towards automatic discovery of object\n\ncategories. In Proc. CVPR, 2000.\n\n[42] Shengtao Xiao, Jiashi Feng, Junliang Xing, Hanjiang Lai, Shuicheng Yan, and Ashraf Kassim.\nRobust Facial Landmark Detection via Recurrent Attentive-Re\ufb01nement Networks. In Proc.\nECCV, 2016.\n\n[43] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In Proc.\n\nICLR, 2016.\n\n[44] Xiang Yu, Feng Zhou, and Manmohan Chandraker. Deep Deformation Network for Object\n\nLandmark Localization. In Proc. ECCV, Cham, 2016.\n\n11\n\n\f[45] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-\ufb01ne auto-encoder networks\n\n(CFAN) for real-time face alignment. In Proc. ECCV, 2014.\n\n[46] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful Image Colorization. In Proc. ECCV,\n\n2016.\n\n[47] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection - How to effectively exploit\n\nshape and texture features. In Proc. ECCV, 2008.\n\n[48] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by\n\ndeep multi-task learning. In Proc. ECCV, 2014.\n\n[49] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning Deep Representation\n\nfor Face Alignment with Auxiliary Attributes. PAMI, 2016.\n\n[50] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of\n\ndepth and ego-motion from video. In Proc. CVPR, 2017.\n\n[51] Tinghui Zhou, Philipp Kr\u00e4henb\u00fchl, Mathieu Aubry, Qixing Huang, and Alexei A. Efros.\n\nLearning Dense Correspondences via 3D-guided Cycle Consistency. In Proc. CVPR, 2016.\n\n[52] Tinghui Zhou, Yong Jae Lee, Stella X. Yu, and Alexei A. Efros. FlowWeb: Joint image set\n\nalignment by weaving consistent, pixel-wise correspondences. In Proc. CVPR, 2015.\n\n12\n\n\f", "award": [], "sourceid": 565, "authors": [{"given_name": "James", "family_name": "Thewlis", "institution": "University of Oxford"}, {"given_name": "Hakan", "family_name": "Bilen", "institution": "University of Edinburgh"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford"}]}