{"title": "Learning to Align from Scratch", "book": "Advances in Neural Information Processing Systems", "page_first": 764, "page_last": 772, "abstract": "Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the {\\em congealing} alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization on the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve a significantly higher accuracy in face verification than obtained using the original face images, prior work in unsupervised alignment, and prior work in supervised alignment. We also match the accuracy for the best available, but unpublished method.", "full_text": "Learning to Align from Scratch\n\nGary B. Huang1 Marwan A. Mattar1 Honglak Lee2 Erik Learned-Miller1\n\n1University of Massachusetts, Amherst, MA\n\n{gbhuang,mmattar,elm}@cs.umass.edu\n\n2University of Michigan, Ann Arbor, MI\n\nhonglak@eecs.umich.edu\n\nAbstract\n\nUnsupervised joint alignment of images has been demonstrated to improve per-\nformance on recognition tasks such as face veri\ufb01cation. Such alignment reduces\nundesired variability due to factors such as pose, while only requiring weak su-\npervision in the form of poorly aligned examples. However, prior work on unsu-\npervised alignment of complex, real-world images has required the careful selec-\ntion of feature representation based on hand-crafted image descriptors, in order to\nachieve an appropriate, smooth optimization landscape. In this paper, we instead\npropose a novel combination of unsupervised joint alignment with unsupervised\nfeature learning. Speci\ufb01cally, we incorporate deep learning into the congealing\nalignment framework. Through deep learning, we obtain features that can repre-\nsent the image at differing resolutions based on network depth, and that are tuned\nto the statistics of the speci\ufb01c data being aligned.\nIn addition, we modify the\nlearning algorithm for the restricted Boltzmann machine by incorporating a group\nsparsity penalty, leading to a topographic organization of the learned \ufb01lters and\nimproving subsequent alignment results. We apply our method to the Labeled\nFaces in the Wild database (LFW). Using the aligned images produced by our\nproposed unsupervised algorithm, we achieve higher accuracy in face veri\ufb01cation\ncompared to prior work in both unsupervised and supervised alignment. We also\nmatch the accuracy for the best available commercial method.\n\nIntroduction\n\n1\nOne of the most challenging aspects of image recognition is the large amount of intra-class vari-\nability, due to factors such as lighting, background, pose, and perspective transformation. For tasks\ninvolving a speci\ufb01c object category, such as face veri\ufb01cation, this intra-class variability can often be\nmuch larger than inter-class differences. This variability can be seen in Figure 1, which shows sam-\nple images from Labeled Faces in the Wild (LFW), a data set used for benchmarking unconstrained\nface veri\ufb01cation performance. The task in LFW is, given a pair of face images, determine if both\nfaces are of the same person (matched pair), or if each shows a different person (mismatched pair).\n\nFigure 1: Sample images from LFW: matched pairs (top row) and mismatched pairs (bottom row)\nRecognition performance can be signi\ufb01cantly improved by removing undesired intra-class variabil-\nity, by \ufb01rst aligning the images to some canonical pose or con\ufb01guration. For instance, face veri\ufb01-\ncation accuracy can be dramatically increased through image alignment, by detecting facial feature\npoints on the image and then warping these points to a canonical con\ufb01guration. This alignment\nprocess can lead to signi\ufb01cant gains in recognition accuracy on real-world face veri\ufb01cation, even\n\n1\n\n\ffor algorithms that were explicitly designed to be robust to some misalignment [1]. Therefore,\nthe majority of face recognition systems evaluated on LFW currently make use of a preprocessed\nversion of the data set known as LFW-a,1 where the images have been aligned by a commercial \ufb01du-\ncial point-based supervised alignment method [2]. Fiducial point (or landmark-based) alignment\nalgorithms [1, 3\u20135], however, require a large amount of supervision or manual effort. One must de-\ncide which \ufb01ducial points to use for the speci\ufb01c object class, and then obtain many example image\npatches of these points. These methods are thus hard to apply to new object classes, since all of this\nmanual collection of data must be re-done, and the alignment results may be sensitive to the choice\nof \ufb01ducial points and quality of training examples.\nAn alternative to this supervised approach is to take a set of poorly aligned images (e.g., images\ndrawn from approximately the same distribution as the inputs to the recognition system) and attempt\nto make the images more similar to each other, using some measure of joint similarity such as\nentropy. This framework of iteratively transforming images to reduce the entropy of the set is known\nas congealing [6], and was originally applied to speci\ufb01c types of images such as binary handwritten\ncharacters and magnetic resonance image volumes [7\u20139]. Congealing was extended to work on\ncomplex, real-world object classes such as faces and cars [10]. However, this required a careful\nselection of hand-crafted feature representation (SIFT [11]) and soft clustering, and does not achieve\nas large of an improvement in veri\ufb01cation accuracy as supervised alignment (LFW-a).\nIn this work, we propose a novel combination of unsupervised alignment and unsupervised fea-\nture learning, speci\ufb01cally by incorporating deep learning [12\u201314] into the congealing framework.\nThrough deep learning, we can obtain a feature representation tuned to the statistics of the speci\ufb01c\nobject class we wish to align, and capture the data at multiple scales by using multiple layers of a\ndeep learning architecture. Further, we incorporate a group sparsity constraint into the deep learn-\ning algorithm, leading to a topographic organization on the learned \ufb01lters, and show that this leads\nto improved alignment results. We apply our method to unconstrained face images and show that,\nusing the aligned images, we achieve a signi\ufb01cantly higher face veri\ufb01cation accuracy than obtained\nboth using the original face images and using the images produced by prior work in unsupervised\nalignment [10]. In addition, the accuracy surpasses that achieved using supervised \ufb01ducial points\nbased alignment [3], and matches the accuracy using the LFW-a images produced by commercial\nsupervised alignment.\n2 Related Work\nWe review relevant work in unsupervised joint alignment and deep learning.\n2.1 Unsupervised Joint Alignment\nCox et al. presented a variation of congealing for unsupervised alignment, where the entropy sim-\nilarity measure is replaced with a least-squares similarity measure [15, 16]. Liu et al. extended\ncongealing by modifying the objective function to allow for simultaneous alignment and cluster-\ning [17]. Mattar et al. developed a transformed variant of Bayesian in\ufb01nite models that can also\nsimultaneously align and cluster complex data sets [18]. Zhu et al. developed a method for non-rigid\nalignment using a model parameterized by mesh vertex coordinates in a deformable Lucas-Kanade\nformulation [19]. However, this technique requires additional supervision in the form of object part\n(e.g., eye) detectors speci\ufb01c to the data to be aligned.\nIn this work, we chose to extend the original congealing method, rather than other alignment frame-\nworks, for several reasons. The algorithm uses entropy as a measure of similarity, rather than vari-\nance or least squares, thus allowing for the alignment of data with multiple modes. Unlike other joint\nalignment procedures [15], the main loop scales linearly with the number of images to be aligned,\nallowing for a greater number of images to be jointly aligned, smoothing the optimization landscape.\nFinally, congealing requires only very weak supervision in the form of poorly aligned images. How-\never, our proposed extensions, using features obtained from deep learning, could also be applied\nto other alignment algorithms that have only been used with a pixel intensity representation, such\nas [15, 16, 19].\n2.2 Deep Learning\nA deep belief network (DBN) is a generative graphical model consisting of a layer of visible units\nand multiple layers of hidden units, where each layer encodes statistical dependencies in the units in\n\n1http://www.openu.ac.il/home/hassner/data/lfwa/\n\n2\n\n\fthe layer below [12]. DBNs and related unsupervised learning algorithms such as auto-encoders [13]\nand sparse coding [20, 21] have been used to learn higher-level feature representations from unla-\nbeled data, suitable for use in tasks such as classi\ufb01cation. These methods have been successfully\napplied to computer vision tasks [22\u201326], as well as audio recognition [27], natural language pro-\ncessing [28], and information retrieval [29]. To the best of our knowledge, our proposed method is\nthe \ufb01rst to apply deep learning to the alignment problem.\nDBNs are generally trained using images drawn from the same distribution as the test images, which\nin our case corresponds to learning from faces in the LFW training set. In many machine learning\nproblems, however, we are given only a limited amount of labeled data, which can cause an over-\n\ufb01tting problem. Thus, we also examine the strategy of self-taught learning [30] (related to semi-\nsupervised learning [31]). The idea of self-taught learning is to use a large amount of unlabeled\ndata from a distribution different from the labeled data, and transfer low-level structures that can be\nshared between unlabeled and labeled data. For generic object categorization, Raina et al. [30] and\nLee et al. [23] have shown successful applications of self-taught learning, using sparse coding and\ndeep belief networks to learn feature representations from natural images. In this paper, we examine\nwhether self-taught learning can be successful for alignment tasks.\nIn addition, we augment the training procedure of DBNs by adding a group sparsity regularization\nterm, leading to a set of learned \ufb01lters with a linear topographic organization. This idea is closely\nrelated to the Group Lasso for regression [32] and Topographic ICA [33], and has been applied to\nsparse coding with basis functions that form a generally two-dimensional topological map [34]. We\nextend this method to basis functions that are learned in a convolutional manner, and to higher-order\nfeatures obtained from a multi-layer convolutional DBN.\n3 Methodology\nWe begin with a review of the congealing framework. We then show how deep learning can be\nincorporated into this framework using convolutional DBNs, and how the learning algorithm can be\nmodi\ufb01ed through group sparsity regularization to improve congealing performance.\n3.1 Congealing\nWe \ufb01rst de\ufb01ne two terms used in congealing, the distribution \ufb01eld (DF) and the location stack. Let\nX = {1, 2, . . . , M} be the set of all feature values. For example, letting the feature space be intensity\nvalues, M = 2 for binary images and M = 256 for 8-bit grayscale images. A distribution \ufb01eld is\na distribution over X at each location in the image representation; e.g., for binary images, a DF\nwould be a distribution over {0, 1} at each pixel in the image. One can view the DF as a generative\nindependent pixel model of images, by placing a random variable Xi at each pixel location i. An\nimage then consists of a draw from the alphabet X for each Xi according to the distribution over X\nat the ith pixel of the DF. Given a set of images, the location stack is de\ufb01ned as the set of values,\nwith domain X , at a speci\ufb01c location across a set of images. Thus, the empirical distribution at a\ngiven location of a DF is determined by the corresponding location stack.\nCongealing proceeds by iteratively computing the empirical distribution de\ufb01ned by a set of images,\nthen for each image, choosing a transformation (we use the set of similarity transformations) that\nreduces the entropy of the distribution \ufb01eld. Figure 2 illustrates congealing on one dimensional\nbinary images. Under an independent pixel model and uniform distribution over transformations,\nminimizing the entropy of the distribution \ufb01eld is equivalent to maximizing the likelihood according\nto the distribution \ufb01eld [6].\nOnce congealing has been performed on a set of images (e.g., a training set), funneling [6,10] can be\nused to quickly align additional images, such as from a new test set. This is done by maintaining the\n\nFigure 2: Schematic illustration of congealing of one dimensional binary images, where the trans-\nformation space is left-right translation\n\n3\n\n\fFigure 3: Illustration of convolutional RBM with probabilistic max-pooling. For illustration, we\nused pooling ratio C = 2 and number of \ufb01lters K = 4. See text for details.\n\nsequence of DFs from each iteration of congealing. A new image is then aligned by transforming it\niteratively according to the sequence of saved DFs, thereby approximating the results of congealing\non the original set of images as well as the new test image. As mentioned earlier, congealing was ex-\ntended to work on complex object classes, such as faces, by using soft clustering of SIFT descriptors\nas the feature representation [10]. We will refer to this congealing algorithm as SIFT congealing.\nWe now describe our proposed extension, which we refer to as deep congealing.\n3.2 Deep Congealing\nTo incorporate deep learning within congealing, we use the convolutional restricted Boltzmann ma-\nchine (CRBM) [23,35] and convolutional deep belief network (CDBN) [23]. The CRBM is an exten-\nsion of the restricted Boltzmann machine, which is a Markov random \ufb01eld with a hidden layer and\na visible layer (corresponding to image pixels in computer vision problems), where the connection\nbetween layers is bipartite. In the CRBM, rather than fully connecting the hidden layer and visible\nlayer, the weights between the hidden units and the visible units are local (i.e., 10\u00d7 10 pixels instead\nof full image) and shared among all hidden units. An illustration of CRBM can be found in Figure 3.\nThe CRBM has three sets of parameters: (1) K convolution \ufb01lter weights between the hidden nodes\nand the visible nodes, where each \ufb01lter is NW \u00d7 NW pixels (i.e., W k \u2208 RNW \u00d7NW , k = 1, ..., K);\n(2) hidden biases bk \u2208 R that are shared among hidden nodes; and (3) visible bias c \u2208 R that is\nshared among visible nodes.\nTo make CRBMs more scalable, Lee et al. developed probabilistic max-pooling, a technique for\nincorporating local translation invariance. Max-pooling refers to operations where a local neighbor-\nhood (e.g., 2 \u00d7 2 grid) of feature detection outputs is shrunk to a pooling node by computing the\nmaximum of the local neighbors. Max-pooling makes the feature representation more invariant to\nlocal translations in the input data, and has been shown to be useful in computer vision [23, 25, 36].\nZ exp(\u2212E(v, h)), we de\ufb01ne the energy function of the probabilistic max-\nLetting P (v, h) = 1\npooling CRBM as follows:2\n\nE(v, h) = \u2212 K(cid:88)\n(cid:88)\n\nk=1\n\ns.t.\n\nNH(cid:88)\ni,j=1\nij \u2264 1, \u2200k, \u03b1\nhk\n\nij( \u02dcW k \u2217 v)ij +\nhk\n\nNV(cid:88)\n\n(i,j)\u2208B\u03b1\n\nrs \u2212 K(cid:88)\n\nv2\n\nbk\n\nNH(cid:88)\n\n1\n2\n\nNV(cid:88)\n\nij \u2212 c\nhk\n\nvrs\n\nr,s=1\n\nk=1\n\ni,j=1\n\nr,s=1\n\ni,j that are pooled to a pooling node pk\n\nHere, \u02dcW k refers to \ufb02ipping the original \ufb01lter W k in both upside-down and left-right directions, and\n\u2217 denotes convolution. B\u03b1 refers to a C \u00d7 C block of locally neighboring hidden units (i.e., pooling\nregion) hk\n\u03b1. The CRBM can be trained by approximately\nmaximizing the log-likelihood of the unlabeled data via contrastive divergence [37]. For details on\nlearning and inference in CRBMs, see [23].\nAfter training a CRBM, we can use it to compute the posterior of the pooling units given the input\ndata. These pooling unit activations can be used as input to further train the next layer CRBM. By\nstacking the CRBMs, the algorithm can capture high-level features, such as hierarchical object-part\ndecompositions. After constructing a convolutional deep belief network, we perform (approximate)\n\n2We use real-valued visible units in the \ufb01rst-layer CRBM; however, we use binary-valued visible units when\n\nconstructing the second-layer CRBM. See [23] for details.\n\n4\n\nNH = Nv - NW NV NW NP = NH/C Visible nodes Hidden detection nodes Hidden pooling nodes hkij (C=2) pk\u03b1 vij k=1,\u2026,K (K=4) \u201cfiltering\u201d (detection) \u201cpooling\u201d \finference of the whole network in a feedforward (bottom-up) manner. Speci\ufb01cally, letting I(hk\nbk + ( \u02dcW k \u2217 v)ij, we can infer the pooling unit activations as a softmax function:\n\nij) (cid:44)\n\n(cid:80)\n1 +(cid:80)\n\nP (pk\n\n\u03b1 = 1|v) =\n\n(i(cid:48),j(cid:48))\u2208B\u03b1\n\nexp(I(hk\n\ni(cid:48)j(cid:48)))\n\n(i(cid:48),j(cid:48))\u2208B\u03b1\n\nexp(I(hk\n\ni(cid:48)j(cid:48)))\n\n\u03b1\n\nk=1\n\n\u03b1\n\ns\u2208{0,1} Dk\n\n(cid:80)P\n\n\u03b1=1 H(Dk\n\nGiven a set of poorly aligned face images, our goal is to iteratively transform each image to reduce\nthe total entropy over the pooling layer outputs of a CDBN applied to each of the images. For a\nCDBN with K pooling layer groups, we now have K location stacks at each image location (after\nmax-pooling), over a binary distribution for each location stack. Given N unaligned face images,\nlet P be the number of pooling units in each group in the top-most layer of the CDBN. We use\nthe pooling unit probabilities, with the interpretation that the pooling unit can be considered as a\nmixture of sub-units that are on and off [6]. Letting pk,(n)\nbe the pooling unit \u03b1 in group k for image\nn under some transformation T n, we de\ufb01ne Dk\n\u03b1(1).\n\u03b1(1) = 1\nN\nThen, the entropy for a speci\ufb01c pooling unit is H(Dk\n\u03b1(s)). At each\niteration of congealing, we \ufb01nd a transformation for each image that decreases the total entropy\n\u03b1). Note that if K = 1, this reduces to the traditional congealing formulation on\n\n(cid:80)N\n\u03b1) = \u2212(cid:80)\n\nand Dk\n\u03b1(s) log(Dk\n\n\u03b1(0) = 1 \u2212 Dk\n\n(cid:80)K\n\nn=1 pk,(n)\n\nthe binary output of the single pooling layer.\n3.3 Learning a Topology\nAs congealing reduces entropy by performing local hill-climbing in the transformation parameters,\na key factor in the success of congealing is the smoothness of this optimization landscape. In SIFT\ncongealing, smoothness is achieved through soft clustering and the properties of the SIFT descriptor.\nSpeci\ufb01cally, to compute the descriptor, the gradient is computed at each pixel location and added\nto a weighted histogram over a \ufb01xed number of angles. The histogram bins have a natural circular\ntopology. Therefore, the gradient at each location contributes to two neighboring histogram bins,\nweighted using linear interpolation. This leads to a smoother optimization landscape when congeal-\ning. For instance, if a face is rotated a fraction of the correct angle to put it into a good alignment,\nthere will be a corresponding partial decrease in entropy due to this interpolated weighting.\nIn contrast, there is no topology on the \ufb01lters produced using standard learning of a CRBM. This\nmay lead to plateaus or local minima in the optimization landscape with congealing, for instance,\nif one \ufb01lter is a small rotation of another \ufb01lter, and a rotation of the image causes a section of the\nface to be between these two \ufb01lters. This problem may be particularly severe for \ufb01lters learned at\ndeeper layers of a CDBN. For instance, a second-layer CDBN trained on face images would likely\nlearn multiple \ufb01lters that resemble eye detectors, capturing slightly different types and scales of\neyes. If these \ufb01lters are activating independently, then the resulting entropy of a set of images may\nnot decrease even if eyes in different images are brought into closer alignment.\nA CRBM is generally trained with sparsity regularization [38], such that each \ufb01lter responds to\na sparse set of input stimuli. A smooth optimization for congealing requires that, as an image\npatch is transformed from one such sparse set to another, the change in pooling unit activations is\nalso gradual rather than abrupt. Therefore, we would like to learn \ufb01lters with a linear topological\nordering, such that when a particular pooling unit pk\n\u03b1 at location \u03b1 and associated with \ufb01lter k\n\u03b1 for k(cid:48)\nis activated, the pooling units at the same location, associated with nearby \ufb01lters, i.e., pk(cid:48)\nclose to k, will also have partial activation. To learn a topology on the learned \ufb01lters, we add the\nfollowing group sparsity penalty to the learning objective function (i.e., negative log-likelihood):\n2\u03c32 ). Let\nthe term array be used to refer to the set of pooling units associated with a particular \ufb01lter, i.e., pk\n\u03b1\nfor all locations \u03b1. This regularization penalty is a sum (L1 norm) of L2 norms, each of which is a\nGaussian weighting, centered at a particular array, of the pooling units across each array at a speci\ufb01c\nlocation. In practice, rather than weighting every array in each summand, we use a \ufb01xed kernel\ncovering \ufb01ve consecutive \ufb01lters, i.e., \u03c9d = 0 for |d| > 2.\nThe rationale behind such a regularization term is that, unlike an L2 norm, an L1 norm encourages\nsparsity. This sum of L2 norms thus encourages sparsity at the group level, where a group is a set\nof Gaussian weighted activations centered at a particular array. Therefore, if two \ufb01lters are similar\nand tend to both activate for the same visible data, a smaller penalty will be incurred if these \ufb01lters\n\n\u03b1)2, where \u03c9d is a Gaussian weighting, \u03c9d \u221d exp(\u2212 d2\n\nLsparsity = \u03bb(cid:80)\n\n(cid:112)(cid:80)\n\nk(cid:48) \u03c9k(cid:48)\u2212k(pk\n\nk,\u03b1\n\n5\n\n\f(a) Without topology\n\n(b) With topology\n\nFigure 4: Visualization of second layer \ufb01lters learned from face images, without topology (left) and\nwith topology (right). By learning with a linear topology, nearby \ufb01lters (in row major order) have\ncorrelated activations. This leads to \ufb01lters for particular facial features to be grouped together, such\nas eye detectors at the end of the row third from the bottom.\n\nij =\n\n1\n\n\u03b1(i,j))2\n\n\u03b1(i,j)(1 \u2212 pk\npk\n\n\u03b1(i,j))hk\n\nare nearby in the topological ordering, as this will lead to a more sparse representation at the group\nL2 level. To account for this penalty term, we augment the learning algorithm by taking a step in\nthe negative derivative with respect to the CRBM weights. We de\ufb01ne \u03b1(i, j) as the pooling location\nassociated with position (i, j), and J as J k,k(cid:48)\nij. We\n), where \u2217 denotes convolution\n\ufb02ipped horizontally and vertically. Thus we can ef\ufb01ciently compute the\n\ncan write the full gradient as \u2207W kLsparsity = \u03bb(cid:80)\n\n(cid:113)(cid:80)\nk(cid:48)(cid:48) \u03c9k(cid:48)(cid:48)\u2212k(cid:48) (pk(cid:48)(cid:48)\nk(cid:48) \u03c9k\u2212k(cid:48)(v \u2217 \u02dcJ k,k(cid:48)\n\nmeans J k,k(cid:48)\n\nand \u02dcJ k,k(cid:48)\ngradient as a sum of convolutions.\nFollowing the procedure given by Sohn et al. [39], we initialize the \ufb01lters using expectation-\nmaximization under a mixture of Gaussians/Bernoullis, before proceeding with CRBM learning.\nTherefore, when learning with the group sparsity penalty, we periodically reorder the \ufb01lters using\nthe following greedy strategy. Taking the \ufb01rst \ufb01lter, we iteratively add \ufb01lters one by one to the end\nof the \ufb01lter set, picking the \ufb01lter that minimizes the group sparsity penalty.\n4 Experiments\nWe learn three different convolutional DBN models to use as the feature representation for deep\ncongealing. First, we learn a one-layer CRBM from the Kyoto images,3 a standard natural image\ndata set, to evaluate the performance of congealing with self-taught CRBM features. Next, we learn\na one-layer CRBM from LFW face images, to compare performance when learning the features\ndirectly on images of the object class to be aligned. Finally, we learn a two-layer CDBN from LFW\nface images, to evaluate performance using higher-order features. For all three models, we also\ncompare learning the weights using the standard sparse CDBN learning, as well as learning with\ngroup sparsity regularization. Visualizations of the top layer weights of the two-layer CDBN are\ngiven in Figure 4, demonstrating the effect of adding the sparsity regularization term.\nWe used K = 32 \ufb01lters for the one-layer models and K = 96 in the top layer of the two-layer\nmodels. During learning, we used a pooling size of 5x5 for the one-layer models and 3x3 in both\nlayers of the two-layer model. We used \u03c32 = 1 in the Gaussian weighting for group sparsity\nregularization. For computing the pooling layer representation to use in congealing, we modi\ufb01ed\nthe pooling size to 3x3 for the one-layer models and 2x2 for the second layer in the two-layer\nmodel, and adjusted the hidden biases to give an expected activation of 0.025 for the hidden units.\nIn Figure 5, we show a selection of images under several alignment methods. Each image is shown\nin its original form, and aligned using SIFT Congealing, Deep Congealing with topology, using a\none-layer and two-layer CDBN trained on faces, and the LFW-a alignment.\nWe evaluate the effect of alignment on veri\ufb01cation accuracy using View 1 of LFW. For the con-\ngealing methods, 400 images from the training set were congealed and used to form a funnel to\nsubsequently align all of the images in both the training and test sets. To obtain veri\ufb01cation accu-\nracy, we use a variation on the method of Cosine Similarity Metric Learning (CSML) [40], one of\nthe top-performing methods on LFW. As in CSML, we \ufb01rst apply whitening PCA and reduce the\nrepresentation to 500 dimensions. We then normalize each image feature vector, and apply a linear\nSVM to an image pair by combining the image feature vectors using element-wise multiplication.\n\n3http://www.cnbc.cmu.edu/cplab/data_kyoto.html\n\n6\n\n\foriginal\n\nSIFT\n\ndeep l1\n\ndeep l2\n\nLFW-a\n\noriginal\n\nSIFT\n\ndeep l1\n\ndeep l2\n\nLFW-a\n\nFigure 5: Sample images from LFW produced by different alignment algorithms. For each set of \ufb01ve\nimages, the alignments are, from left to right: original images; SIFT Congealing; Deep Congealing,\nFaces, layer 1, with topology; Deep Congealing, Faces, layer 2, with topology; Supervised (LFW-a).\n\nNote that if the weights of the SVM are 1 and the bias is 0, then this is equivalent to cosine simi-\nlarity. We \ufb01nd that this procedure yields comparable accuracy to CSML but is much faster and less\nsensitive to the regularization parameters.4 As our goal is to improve veri\ufb01cation accuracy through\nbetter alignment, we focus on performance using a single feature representation, and only use the\nsquare root LBP features [40, 41] on 150x80 croppings of the full LFW images.\nTable 1 gives the veri\ufb01cation accuracy for this veri\ufb01cation system using images produced by a num-\nber of alignment algorithms. Deep congealing gives a signi\ufb01cant improvement over SIFT congeal-\ning. Using a CDBN representation learned with a group sparsity penalty, leading to learned \ufb01lters\nwith topographic organization, consistently gives a higher accuracy of one to two percentage points.\nWe compare with two supervised alignment systems, the \ufb01ducial points based system of [3],5 and\nLFW-a. Note that LFW-a was produced by a commercial alignment system, in the spirit of [3], but\nwith important differences that have not been published [2]. Congealing with a one-layer CDBN6\ntrained on faces, with topology, gives veri\ufb01cation accuracy signi\ufb01cantly higher than using images\nproduced by [3], and comparable to the accuracy using LFW-a images.\nMoreover, we can combine the veri\ufb01cation scores using images from the one-layer and two-layer\nCDBN trained on faces, learning a second SVM on these scores. By doing so, we achieve a further\ngain in veri\ufb01cation performance, achieving an accuracy of 0.831, exceeding the accuracy using\nLFW-a. This suggests that the two-layer CDBN alignment is somewhat complementary to the one-\nlayer alignment. That is, although the two-layer CDBN alignment produces a lower veri\ufb01cation\naccuracy, it is not strictly worse than the one-layer CDBN alignment for all images, but rather\nis aligning according to a different set of statistics, and achieves success on a different subset of\nimages than the one-layer CDBN model. As a control, we performed the same score combination\nusing the scores produced from images from the one-layer CDBN alignment trained on faces, with\ntopology, and the original images. This gave a veri\ufb01cation accuracy of 0.817, indicating that the\nimprovement from combining two-layer scores is not merely obtained from using two different sets\nof alignments.\n\n4We note that the accuracy published in [40] was higher than we were able to obtain in our own imple-\nmentation. After communicating with the authors, we found that they used a different training procedure than\ndescribed in the paper, which we believe inadvertently uses some test data as training, due to View 1 and View\n2 of LFW not being mutually exclusive. Following the training procedure detailed in the paper, which we view\nto be correct, we \ufb01nd the accuracy to be about 3% lower than the published results.\n\n5Using code available at http://www.robots.ox.ac.uk/\u02dcvgg/research/nface/\n6Technically speaking, the term \u201cone-layer CDBN\u201d denotes a CRBM.\n\n7\n\n\fTable 1: Unconstrained face veri\ufb01cation accuracy on View 1 of LFW using images produced by\ndifferent alignment algorithms. By combining the classi\ufb01er scores produced by layer 1 and 2 using\na linear SVM, we achieve higher accuracy using unsupervised alignment than obtained using the\nwidely-used LFW-a images, generated using a commercial supervised \ufb01ducial-points algorithm.\n\nAlignment\nOriginal\nSIFT Congealing\nDeep Congealing, Kyoto, layer 1\nDeep Congealing, Kyoto, layer 1, with topology\nDeep Congealing, Faces, layer 1\nDeep Congealing, Faces, layer 1, with topology\nDeep Congealing, Faces, layer 2\nDeep Congealing, Faces, layer 2, with topology\nCombining Scores of Faces, layers 1 and 2, with topology\nFiducial Points-based Alignment [3] (supervised)\nLFW-a (commercial)\n\nAccuracy\n\n0.742\n0.758\n0.807\n0.815\n0.802\n0.820\n0.780\n0.797\n0.831\n0.805\n0.823\n\n5 Conclusion\nWe have shown how to combine unsupervised joint alignment with unsupervised feature learning.\nBy congealing on the pooling layer representation of a CDBN, we are able to achieve signi\ufb01cant\ngains in veri\ufb01cation accuracy over existing methods for unsupervised alignment. By adding a group\nsparsity penalty to the CDBN learning algorithm, we can learn \ufb01lters with a linear topology, provid-\ning a smoother optimization landscape for congealing. Using face images aligned by this method,\nwe obtain higher veri\ufb01cation accuracy than the supervised \ufb01ducial points based method of [3]. Fur-\nther, despite being unsupervised, our method is still able to achieve comparable accuracy to the\nwidely used LFW-a images, obtained by a commercial \ufb01ducial point-based alignment system whose\ndetailed procedure is unpublished. We believe that our proposed method is an important contribution\nin developing generic alignment systems that do not require domain-speci\ufb01c \ufb01ducial points.\nReferences\n[1] L. Wolf, T. Hassner, and Y. Taigman. Similarity scores based on background samples. In ACCV, 2009.\n[2] Y. Taigman, L. Wolf, and T. Hassner. Multiple one-shots for utilizing class label information. In BMVC,\n\n2009.\n\n[3] M. Everingham, J. Sivic, and A. Zisserman. \u201cHello! My name is... Buffy\u201d - automatic naming of charac-\n\nters in TV video. In BMVC, 2006.\n\n[4] T. L. Berg, A. C. Berg, M. Maire, R. White, Y. W. Teh, E. Learned-Miller, and D. A. Forsyth. Names and\n\nfaces in the news. In CVPR, 2004.\n\n[5] Y. Zhou, L. Gu, and H.-J. Zhang. Bayesian tangent shape model: Estimating shape and pose parameters\n\nvia Bayesian inference. In CVPR, 2003.\n\n[6] E. Learned-Miller. Data driven image models through continuous joint alignment. PAMI, 2005.\n[7] E. Miller, N. Matsakis, and P. Viola. Learning from one example through shared densities on transforms.\n\nIn CVPR, 2000.\n\n[8] L. Zollei, E. Learned-Miller, E. Grimson, and W. Wells. Ef\ufb01cient population registration of 3d data.\nIn Workshop on Computer Vision for Biomedical Image Applications: Current Techniques and Future\nTrends, at ICCV, 2005.\n\n[9] E. Learned-Miller and V. Jain. Many heads are better than one: Jointly removing bias from multiple\nMRIs using nonparametric maximum likelihood. In Proceedings of Information Processing in Medical\nImaging, pages 615\u2013626, 2005.\n\n[10] G. B. Huang, V. Jain, and E. Learned-Miller. Unsupervised joint alignment of complex images. In ICCV,\n\n2007.\n\n[11] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[12] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[13] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\n\nNIPS, 2007.\n\n8\n\n\f[14] M. Ranzato, Y.-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In NIPS,\n\n2007.\n\n[15] M. Cox, S. Lucey, S. Sridharan, and J. Cohn. Least squares congealing for unsupervised alignment of\n\nimages. In CVPR, 2008.\n\n[16] M. Cox, S. Sridharan, S. Lucey, and J. Cohn. Least squares congealing for large numbers of images. In\n\nICCV, 2009.\n\n[17] X. Liu, Y. Tong, and F. W. Wheeler. Simultaneous alignment and clustering for an image ensemble. In\n\nICCV, 2009.\n\n[18] M. A. Mattar, A. R. Hanson, and E. G. Learned-Miller. Unsupervised joint alignment and clustering using\n\nBayesian nonparametrics. In UAI, 2012.\n\n[19] J. Zhu, L. V. Gool, and S. C. Hoi. Unsupervised face alignment by nonrigid mapping. In ICCV, 2009.\n[20] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[21] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Ef\ufb01cient sparse coding algorithms. In NIPS, 2007.\n[22] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.\n[23] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Unsupervised learning of hierarchical representations\n\nwith convolutional deep belief networks. Communications of the ACM, 54(10):95\u2013103, 2011.\n\n[24] J. Yang, K. Yu, Y. Gong, and T. S. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, pages 1794\u20131801, 2009.\n\n[25] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for\n\nobject recognition? In ICCV, 2009.\n\n[26] G. B. Huang, H. Lee, and E. Learned-Miller. Learning hierarchical representations for face veri\ufb01cation\n\nwith convolutional deep belief networks. In CVPR, 2012.\n\n[27] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for audio classi\ufb01cation using\n\nconvolutional deep belief networks. In NIPS, 2009.\n\n[28] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[29] R. Salakhutdinov and G. E. Hinton. Semantic hashing. International Journal of Approximate Reasoning,\n\n50:969\u2013978, 2009.\n\n[30] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from\n\nunlabeled data. In ICML, 2007.\n\n[31] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-supervised learning. MIT Press, 2006.\n[32] M. Yuan and L. Yin. Model selection and estimation in regression with grouped variables. Technical\n\nreport, University of Wisconsin, 2004.\n\n[33] A. Hyv\u00a8arinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Compu-\n\ntation, 13(7):1527\u20131558, 2001.\n\n[34] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic\n\n\ufb01lter maps. In CVPR, 2009.\n\n[35] M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted Boltzmann machines for shift-\n\ninvariant feature learning. In CVPR, pages 2735\u20132742, 2009.\n\n[36] Y.-L. Boureau, F. R. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In CVPR,\n\n2010.\n\n[37] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[38] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area V2. In NIPS, 2008.\n[39] K. Sohn, D. Y. Jung, H. Lee, and A. H. III. Ef\ufb01cient learning of sparse, distributed, convolutional feature\n\nrepresentations for object recognition. In ICCV, 2011.\n\n[40] H. V. Nguyen and L. Bai. Cosine similarity metric learning for face veri\ufb01cation. In ACCV, 2010.\n[41] T. Ojala, M. Pietikinen, and D. Harwood. A comparative study of texture measures with classi\ufb01cation\n\nbased on feature distributions. Pattern Recognition, 19(3):51\u201359, 1996.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Gary", "family_name": "Huang", "institution": null}, {"given_name": "Marwan", "family_name": "Mattar", "institution": null}, {"given_name": "Honglak", "family_name": "Lee", "institution": null}, {"given_name": "Erik", "family_name": "Learned-miller", "institution": null}]}