{"title": "Learning invariant representations and applications to face verification", "book": "Advances in Neural Information Processing Systems", "page_first": 3057, "page_last": 3065, "abstract": "One approach to computer object recognition and modeling the brain's ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformation-invariance, we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identity-preserving transformations. The model's wiring can be learned from videos of transforming objects---or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions for the case of 2D affine transformations. Next, we apply the model to non-affine transformations: as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter transformations'' which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig and a new dataset we gathered---achieving strong performance in these highly unconstrained cases as well.\"", "full_text": "Learning invariant representations and applications\n\nto face veri\ufb01cation\n\nQianli Liao, Joel Z Leibo, and Tomaso Poggio\n\nCenter for Brains, Minds and Machines\nMcGovern Institute for Brain Research\nMassachusetts Institute of Technology\n\nCambridge MA 02139\n\nlql@mit.edu, jzleibo@mit.edu, tp@ai.mit.edu\n\nAbstract\n\nOne approach to computer object recognition and modeling the brain\u2019s ventral\nstream involves unsupervised learning of representations that are invariant to com-\nmon transformations. However, applications of these ideas have usually been lim-\nited to 2D af\ufb01ne transformations, e.g., translation and scaling, since they are eas-\niest to solve via convolution. In accord with a recent theory of transformation-\ninvariance [1], we propose a model that, while capturing other common con-\nvolutional networks as special cases, can also be used with arbitrary identity-\npreserving transformations. The model\u2019s wiring can be learned from videos of\ntransforming objects\u2014or any other grouping of images into sets by their depicted\nobject. Through a series of successively more complex empirical tests, we study\nthe invariance/discriminability properties of this model with respect to different\ntransformations. First, we empirically con\ufb01rm theoretical predictions (from [1])\nfor the case of 2D af\ufb01ne transformations. Next, we apply the model to non-af\ufb01ne\ntransformations; as expected, it performs well on face veri\ufb01cation tasks requiring\ninvariance to the relatively smooth transformations of 3D rotation-in-depth and\nchanges in illumination direction. Surprisingly, it can also tolerate clutter \u201ctrans-\nformations\u201d which map an image of a face on one background to an image of the\nsame face on a different background. Motivated by these empirical \ufb01ndings, we\ntested the same model on face veri\ufb01cation benchmark tasks from the computer\nvision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset\nwe gathered\u2014achieving strong performance in these highly unconstrained cases\nas well.\n\n1\n\nIntroduction\n\nIn the real world, two images of the same object may only be related by a very complicated and\nhighly nonlinear transformation. Far beyond the well-studied 2D af\ufb01ne transformations, objects\nmay rotate in depth, receive illumination from new directions, or become embedded on different\nbackgrounds; they might even break into pieces or deform\u2014melting like Salvador Dali\u2019s pocket\nwatch [5]\u2014and still maintain their identity. Two images of the same face could be related by the\ntransformation from frowning to smiling or from youth to old age. This notion of an identity-\npreserving transformation is considerably more expansive than those normally considered in com-\nputer vision. We argue that there is much to be gained from pushing the theory (and practice) of\ntransformation-invariant recognition to accommodate this unconstrained notion of a transformation.\nThroughout this paper we use the formalism for describing transformation-invariant hierarchical\narchitectures developed by Poggio et al. (2012). In [1], the authors propose a theory which, they\nargue, is general enough to explain the strong performance of convolutional architectures across a\n\n1\n\n\fwide range of tasks (e.g. [6, 7, 8]) and possibly also the ventral stream. The theory is based on the\npremise that invariance to identity-preserving transformations is the crux of object recognition.\nThe present paper has two primary points. First, we provide empirical support for Poggio et al.\u2019s\ntheory of invariance (which we review in section 2) and show how various pooling methods for\nconvolutional networks can all be understood as building invariance since they are all equivalent to\nspecial cases of the model we study here. We also measure the model\u2019s invariance/discriminability\nwith face-matching tasks. Our use of computer-generated image datasets lets us completely control\nthe transformations appearing in each test, thereby allowing us to measure properties of the repre-\nsentation for each transformation independently. We \ufb01nd that the representation performs well even\nwhen it is applied to transformations for which there are no theoretical guarantees\u2014e.g., the clutter\n\u201ctransformation\u201d which maps an image of a face on one background to the same face on a different\nbackground.\nMotivated by the empirical \ufb01nding of strong performance with far less constrained transformations\nthan those captured by the theory, in the paper\u2019s second half we apply the same approach to face-\nveri\ufb01cation benchmark tasks from the computer vision literature: Labeled Faces in the Wild, Pub-\nFig [2, 3, 4], and a new dataset we gathered. All of these datasets consist of photographs taken\nunder natural conditions (gathered from the internet). We \ufb01nd that, despite the use of a very simple\nclassi\ufb01er\u2014thresholding the angle between face representations\u2014our approach still achieves results\nthat compare favorably with the current state of the art and even exceed it in some cases.\n\n2 Template-based invariant encodings for objects unseen during training\n\nWe conjecture that achieving invariance to identity-preserving transformations without losing dis-\ncriminability is the crux of object recognition. In the following we will consider a very expansive\nnotion of \u2018transformation\u2019, but \ufb01rst, in this section we develop the theory for 2D af\ufb01ne transforma-\ntions1.\nOur aim is to compute a unique signature for each image x that is invariant with respect to a group\nof transformations G. We consider the orbit {gx | g \u2208 G} of x under the action of the group. In this\nsection, G is the 2D af\ufb01ne group so its elements correspond to translations, scalings, and in-plane\nrotations of the image (notice that we use g to denote both elements of G and their representations,\nacting on vectors). We regard two images as equivalent if they are part of the same orbit, that is, if\nthey are transformed versions of one another (x(cid:48) = gx for some g \u2208 G).\nThe orbit of an image is itself invariant with respect to the group. For example, the set of images\nobtained by rotating x is exactly the same as the set of images obtained by rotating gx. The orbit\nis also unique for each object: the set of images obtained by rotating x only intersects with the\nset of images obtained by rotating x(cid:48) when x(cid:48) = gx. Thus, an intuitive method of obtaining an\ninvariant signature for an image, unique to each object, is just to check which orbit it belongs to. We\ncan assume access to a stored set of orbits of template images \u03c4k; these template orbits could have\nbeen acquired by unsupervised learning\u2014possibly by observing objects transform and associating\ntemporally adjacent frames (e.g. [9, 10]).\nThe key fact enabling this approach to object recognition is this: It is not necessary to have all\nthe template orbits beforehand. Even with a small, sampled, set of template orbits, not including\nthe actual orbit of x, we can still compute an invariant signature. Observe that when g is unitary\n(cid:104)gx, \u03c4k(cid:105) = (cid:104)x, g\u22121\u03c4k(cid:105). That is, the inner product of the transformed image with a template is the\nsame as the inner product of the image with a transformed template. This is true regardless of\nwhether x is in the orbit of \u03c4k or not. In fact, the test image need not resemble any of the templates\n(see [11, 12, 13, 1]).\nConsider gt\u03c4k to be a realization of a random variable. For a set {gt\u03c4k, | t = 1, ..., T} of images\nsampled from the orbit of the template \u03c4k, the distribution of (cid:104)x, gt\u03c4k(cid:105) is invariant and unique to each\nobject. See [1] for a proof of this fact in the case that G is the group of 2D af\ufb01ne transformations.\n\n1See [1] for a more complete exposition of the theory.\n\n2\n\n\fThus, the empirical distribution of the inner products (cid:104)x, gt\u03c4k(cid:105) is an estimate of an invariant. Fol-\nlowing [1], we can use the empirical distribution function (CDF) as the signature:\n\nT(cid:88)\n\nt=1\n\n\u00b5k\n\nn(x) =\n\n1\nT\n\n\u03c3((cid:104)x, gt\u03c4k(cid:105) + n\u2206)\n\n(1)\n\nwhere \u03c3 is a smooth version of the step function (\u03c3(x) = 0 for x \u2264 0, \u03c3(x) = 1 for x > 0), \u2206 is\nthe resolution (bin-width) parameter and n = 1, . . . , N. Figure 1 shows the results of an experiment\ndemonstrating that the \u00b5k\nn(x) are invariant to translation and in-plane rotation. Since each face has its\nown characteristic empirical distribution function, it also shows that these signatures could be used\nto discriminate between them. Table 1 reports the average Kolmogorov-Smirnov (KS) statistics\ncomparing signatures for images of the same face, and for different faces: Mean(KSsame) \u223c 0 =\u21d2\ninvariance and Mean(KSdifferent) > 0 =\u21d2 discriminability.\n\nFigure 1: Example signatures (empirical distribution functions\u2014CDFs) of images depicting two\ndifferent faces under af\ufb01ne transformations. (A) shows in-plane rotations. Signatures for the upper\nand lower face are shown in red and purple respectively. (B) Shows the analogous experiment with\ntranslated faces. Note: In order to highlight the difference between the two distributions, the axes\ndo not start at 0.\nSince the distribution of the (cid:104)x, gt\u03c4k(cid:105) is invariant, we have many choices of possible signatures.\nMost notably, we can choose any of its statistical moments and these may also be invariant\u2014or\nnearly so\u2014in order to be discriminative and \u201cinvariant for a task\u201d it only need be the case that for\neach k, the distributions of the (cid:104)x, gt\u03c4k(cid:105) have different moments. It turns out that many different\nconvolutional networks can be understood in this framework2. The differences between them cor-\nrespond to different choices of 1.\nthe inner product\n(more generally, we consider the template response function \u2206g\u03c4k (\u00b7) := f ((cid:104)\u00b7, gt\u03c4k(cid:105)), for a possibly\nnon-linear function f\u2014see [1]) and 3. the moment used for the signature. For example, a simple\nneural-networks-style convolutional net with one convolutional layer and one subsampling layer (no\nbias term) is obtained by choosing G =translations and \u00b5k(x) =mean(\u00b7). The k-th \ufb01lter is the\ntemplate \u03c4k. The network\u2019s nonlinearity could be captured by choosing \u2206g\u03c4k (x) = tanh(x \u00b7 g\u03c4k);\nnote the similarity to Eq. (1). Similar descriptions could be given for modern convolutional nets,\ne.g. [6, 7, 11]. It is also possible to capture HMAX [14, 15] and related models (e.g. [16]) with this\nframework. The \u201csimple cells\u201d compute normalized dot products or Gaussian radial basis functions\nof their inputs with stored templates and \u201ccomplex cells\u201d compute, for example, \u00b5k(x) = max(\u00b7).\nThe templates are normally obtained by translation or scaling of a set of \ufb01xed patterns, often Gabor\nfunctions at the \ufb01rst layer and patches of natural images in subsequent layers.\n\nthe set of template orbits (which group), 2.\n\n3\n\nInvariance to non-af\ufb01ne transformations\n\nThe theory of [1] only guarantees that this approach will achieve invariance (and discriminability)\nin the case of af\ufb01ne transformations. However, many researchers have shown good performance of\nrelated architectures on object recognition tasks that seem to require invariance to non-af\ufb01ne trans-\nformations (e.g. [17, 18, 19]). One possibility is that achieving invariance to af\ufb01ne transformations\n\n2The computation can be made hierarchical by using the signature as the input to a subsequent layer.\n\n3\n\n12(A) IN-PLANE ROTATION (B) TRANSLATION\fis itself a larger-than-expected part of the full object recognition problem. While not dismissing that\npossibility, we emphasize here that approximate invariance to many non-af\ufb01ne transformations can\nbe achieved as long as the system\u2019s operation is restricted to certain nice object classes [20, 21, 22].\nA nice class with respect to a transformation G (not necessarily a group) is a set of objects that all\ntransform similarly to one another under the action of G. For example, the 2D transformation map-\nping a pro\ufb01le view of one person\u2019s face to its frontal view is similar to the analogous transformation\nof another person\u2019s face in this sense. The two transformations will not be exactly the same since any\ntwo faces differ in their exact 3D structure, but all faces do approximately share a gross 3D structure,\nso the transformations of two different faces will not be as different from one another as would, for\nexample, the image transformations evoked by 3D rotation of a chair versus the analogous rotation\nof a clock. Faces are the prototypical example of a class of objects that is nice with respect to many\ntransformations3.\n\nFigure 2: Example signatures (empirical distribution functions) of images depicting two different\nfaces under non-af\ufb01ne transformations: (A) Rotation in depth. (B) Changing the illumination direc-\ntion (lighting from above or below).\n\nFigure 2 shows that unlike in the af\ufb01ne case, the signature of a test face with respect to template faces\nat different orientations (3D rotation in depth) or illumination conditions is not perfectly invariant\n(KSsame > 0), though it still tolerates substantial transformations. These signatures are also use-\nful for discriminating faces since the empirical distribution functions are considerably more varied\nbetween faces than they are across images of the same face (Mean(KSdifferent) > Mean(KSsame),\ntable 1). Table 2 reports the ratios of within-class discriminability (negatively related to invari-\nance) and between-class discriminability for moment-signatures. Lower values indicate both better\ntransformation-tolerance and stronger discriminability.\n\nTransformation\nTranslation\nIn-plane rotation\nOut-of-plane rotation\nIllumination\n\nMean(KSsame)\n0.0000\n0.2160\n2.8698\n1.9636\n\nMean(KSdifferent)\n1.9420\n19.1897\n5.2950\n2.8809\n\nTable 1: Average Kolmogorov-Smirnov statistics comparing the distributions of normalized inner\nproducts across transformations and across objects (faces).\n\nTransformation\nTranslation\nIn-plane rotation\nOut-of-plane rotation\nIllumination\n\nMEAN L1\n0.0000\n0.0031\n0.3045\n0.7197\n\n0.0000\n0.0031\n0.3045\n0.7197\n\nL2\n0.0000\n0.0033\n0.3016\n0.6994\n\nL5\n0.0000\n0.0042\n0.2923\n0.6405\n\nMAX\n0.0000\n0.0030\n0.1943\n0.2726\n\nTable 2: Table of ratios of \u201cwithin-class discriminability\u201d to \u201cbetween-class discriminability\u201d for\none template (cid:107)\u00b5(xi) \u2212 \u00b5(xj)(cid:107)2. within: xi, xj depict the same face, and between: xi, xj depict\ndifferent faces. Columns are different statistical moments used for pooling (computing \u00b5(x)).\n\n3It is interesting to consider the possibility that faces co-evolved along with natural visual systems in order\n\nto be highly recognizable.\n\n4\n\n(A) ROTATION IN DEPTH (B) ILLUMINATION\f4 Towards the fully unconstrained task\nThe \ufb01nding that this templates-and-signatures approach works well even in the dif\ufb01cult cases of 3D-\nrotation and illumination motivates us to see how far we can push it. We would like to accommodate\na totally-unconstrained notion of invariance to identity-preserving transformations. In particular,\nwe investigate the possibility of computing signatures that are invariant to all the task-irrelevant\nvariability in the datasets used for serious computer vision benchmarks. In the present paper we\nfocus on the problem of face-veri\ufb01cation (also called pair-matching). Given two images of new\nfaces, never encountered during training, the task is to decide if they depict the same person or not.\nWe used the following procedure to test the templates-and-signatures approach on face veri\ufb01cation\nproblems using a variety of different datasets (see \ufb01g. 4A). First, all images were preprocessed with\nlow-level features (e.g., histograms of oriented gradients (HOG) [23]), followed by PCA using all the\nimages in the training set and z-score-normalization4. At test-time, the k-th element of the signature\nof an image x is obtained by \ufb01rst computing all the (cid:104)x, gt\u03c4k(cid:105) where gt\u03c4k is the t-th image of the k-th\ntemplate person\u2014both encoded by their projection onto the training set\u2019s principal components\u2014\nthen pooling the results. We used (cid:104)\u00b7,\u00b7(cid:105) = normalized dot product, and \u00b5k(x) = mean(\u00b7).\nAt test time, the classi\ufb01er receives images of two faces and must classify them as either depicting\nthe same person or not. We used a simple classi\ufb01er that merely computes the angle between the\nsignatures of the two faces (via a normalized dot product) and responds \u201csame\u201d if it is above a \ufb01xed\nthreshold or \u201cdifferent\u201d if below threshold. We chose such a weak classi\ufb01er since the goal of these\nsimulations was to assess the value of the signature as a feature representation. We expect that the\noverall performance levels could be improved for most of these tasks by using a more sophisticated\nclassi\ufb01er5. We also note that, after extracting low-level features, the entire system only employs two\noperations: normalized dot products and pooling.\nThe images in the Labeled Faces in the Wild (LFW) dataset vary along so many different dimensions\nthat it is dif\ufb01cult to try to give an exhaustive list. It contains natural variability in, at least, pose,\nlighting, facial expression, and background [2] (example images in \ufb01g. 3). We argue here that LFW\nand the controlled synthetic data problems we studied up to now are different in two primary ways.\nFirst, in unconstrained tasks like LFW, you cannot rely on having seen all the transformations of any\ntemplate. Recall, the theory of [1] relies on previous experience with all the transformations of tem-\nplate images in order to recognize test images invariantly to the same transformations. Since LFW\nis totally unconstrained, any subset of it used for training will never contain all the transformations\nthat will be encountered at test time. Continuing to abuse the notation from section 2, we can say\nthat the LFW database only samples a small subset of G, which is now the set of all transformations\nthat occur in LFW. That is, for any two images in LFW, x and x(cid:48), only a small (relative to |G|) subset\nof their orbits are in LFW. Moreover, {g | gx \u2208 LFW} and {g(cid:48) | g(cid:48)x(cid:48) \u2208 LFW} almost surely do not\noverlap with one another6.\nThe second important way in which LFW differs from our synthetic image sets is the presence of\nclutter. Each LFW face appears on many different backgrounds. It is commmon to consider clut-\nter to be a separate problem from that of achieving transformation-invariance, indeed, [1] conjec-\ntures that the brain employs separate mechanisms, quite different from templates and pooling\u2014e.g.\n\n4PCA reduces the \ufb01nal algorithm\u2019s memory requirements. Additionally, it is much more plausible that\nthe brain could store principal components than directly memorizing frames of past visual experience. A\nnetwork of neurons with Hebbian synapses (modeled by Oja\u2019s rule)\u2014changing its weights online as images are\npresented\u2014converges to the network that projects new inputs onto the eigenvectors of its past input\u2019s covariance\n[24]. See also [1] for discussion of this point in the context of the templates-and-signatures approach.\n\n5Our classi\ufb01er is unsupervised in the sense that it doesn\u2019t have any free parameters to \ufb01t on training data.\nHowever, our complete system is built using labeled data for the templates, so from that point-of-view it may\nbe considered supervised. On the other hand, we also believe that it could be wired up by an unsupervised\nprocess\u2014probably involving the association of temporally-adjacent frames\u2014so there is also a sense in which\nthe entire system could be considered, at least in principle, to be unsupervised. We might say that, insofar as\nour system models the ventral stream, we intend it as a (strong) claim about what the brain could learn via\nunsupervised mechanisms.\n\n6The brain also has to cope with sampling and its effects can be strikingly counterintuitive. For example,\nAfraz et al. showed that perceived gender of a face is strongly biased toward male or female at different\nlocations in the visual \ufb01eld; and that the spatial pattern of these biases was distinctive and stable over time for\neach individual [25]. These perceptual heterogeneity effects could be due to the templates supporting the task\ndiffering in the precise positions (transformations) at which they were encountered during development.\n\n5\n\n\fattention\u2014toward achieving clutter-tolerance. We set aside those hypotheses for now since the goal\nof the present work is to explore the limits of the totally unconstrained notion of identity-preserving\ntransformation. Thus, for the purposes of this paper, we consider background-variation as just an-\nother transformation. That is, \u201cclutter-transformations\u201d map images of an object on one background\nto images of the same object on different backgrounds.\nWe explicitly tested the effects of non-uniform transformation-sampling and background-variation\nusing two new fully-controlled synthetic image sets for face-veri\ufb01cation7. Figure 3B shows the\nresults of the test of robustness to non-uniform transformation-sampling for 3D rotation-in-depth-\ninvariant face veri\ufb01cation. It shows that the method tolerates substantial differences between the\ntransformations used to build the feature representation and the transformations on which the system\nis tested. We tested two different models of natural non-uniform transformation sampling, in one\ncase (blue curve) we sampled the orbits at a \ufb01xed rate when preparing templates, in the other case,\nwe removed connected subsets of each orbit. In both cases the test used the entire orbit and never\ncontained any of the same faces as the training phase. It is arguable which case is a better model of\nthe real situation, but we note that even in the worse case, performance is surprisingly high\u2014even\nwith large percentages of the orbit discarded. Figure 3C shows that signatures produced by pooling\nover clutter conditions give good performance on a face-veri\ufb01cation task with faces embedded on\nbackgrounds. Using templates with the appropriate background size for each test, we show that our\nmodels continue to perform well as we increase the size of the background while the performance\nof standard HOG features declines.\n\nFigure 3: (A) Example images from Labeled Faces in the Wild. (B) Non-uniform sampling sim-\nulation. The abscissa is the percentage of frames discarded from each template\u2019s transformation\nsequence, the ordinate is the accuracy on the face veri\ufb01cation task. (C) Pooling over variation in the\nbackground. The abscissa is the background size (10 scales), and the ordinate is the area under the\nROC curve (AUC) for the face veri\ufb01cation task.\n5 Computer vision benchmarks: LFW, PubFig, and SUFR-W\nAn implication of the argument in sections 2 and 4, is that there needs to be a reasonable number of\nimages sampled from each template\u2019s orbit. Despite the fact that we are now considering a totally\nunconstrained set of transformations, i.e. any number of samples is going to be small relative to |G|,\nwe found that approximately 15 images gt\u03c4k per face is enough for all the face veri\ufb01cation tasks\nwe considered. 15 is a surprisingly manageable number, however, it is still more images than LFW\nhas for most individuals. We also used the PubFig83 dataset, which has the same problem as LFW,\nand a subset of the original PubFig dataset. In order to ensure we would have enough images from\neach template orbit, we gathered a new dataset\u2014SUFR-W8\u2014with \u223c12,500 images, depicting 450\nindividuals. The new dataset contains similar variability to LFW and PubFig but tends to have more\nimages per individual than LFW (there are at least 15 images of each individual). The new dataset\ndoes not contain any of the same individuals that appear in either LFW or PubFig/PubFig83.\n\n7We obtained 3D models of faces from FaceGen (Singular Inversions Inc.) and rendered them with Blender\n\n(www.blender.org).\n\n8See paper [26] for details. Data available at http://cbmm.mit.edu/\n\n6\n\n02040608010050556065707580859095PercentagediscardedAccuracyNon\u2212consecutiveConsecutive02468100.50.60.70.80.91AUCBackgroundsizeOur modelHOG (A) LFW IMAGES (B) NON-UNIFORM SAMPLING (C) BACKGROUND VARIATION TASK\fFigure 4: (A) Illustration of the model\u2019s processing pipeline. (B) ROC curves for the new dataset\nusing templates from the training set. The second model (red) is a control model that uses HOG\nfeatures directly. The third (control) model pools over random images in the dataset (as opposed to\nimages depicting the same person). The fourth model pools over random noise images.\n\nFigure 5: (A) The complete pipeline used for all experiments. (B) The performance of four different\nmodels on PubFig83, our new dataset, PubFig and LFW. For these experiments, Local Binary Pat-\nterns (LBP), Local Phase Quantization (LPQ), Local Ternary Patterns (LTP) were used [27, 28, 29];\nthey all perform very similarly to HOG\u2014just slightly better (\u223c1%). These experiments used non-\ndetected and non-aligned face images as inputs\u2014thus the errors include detection and alignment\nerrors (about 1.5% of faces are not detected and 6-7% of the detected faces are signi\ufb01cantly mis-\naligned). In all cases, templates were obtained from our new dataset (excluding 30 images for a\ntesting set). This sacri\ufb01ces some performance (\u223c1%) on each dataset but prevents over\ufb01tting: we\nran the exact same model on all 4 datasets. (C) The ROC curves of the best model in each dataset.\n\nFigure 4B shows ROC curves for face veri\ufb01cation with the new dataset. The blue curve is our model.\nThe purple and green curves are control experiments that pool over images depicting different indi-\nviduals, and random noise templates respectively. Both control models performed worse than raw\nHOG features (red curve).\nFor all our PubFig, PubFig83 and LFW experiments (Fig. 5), we ignored the provided training\ndata. Instead, we obtained templates from our new dataset. For consistency, we applied the same\ndetection/alignment to all images. The alignment method we used ([30]) produced images that were\nsomewhat more variable than the method used by the authors of the LFW dataset (LFW-a) \u2014the\nperformance of our simple classi\ufb01er using raw HOG features on LFW is 73.3%, while on LFW-a it\nis 75.6%.\nEven with the very simple classi\ufb01er, our system\u2019s performance still compares favorably with the\ncurrent state of the art. In the case of LFW, our model\u2019s performance exceeds the current state-\nof-the-art for an unsupervised system (86.2% using LQP \u2014 Local Quantized Patterns [31]\u2014Note:\nthese features are not publicly available; otherwise we would have tried using them for preprocess-\n\n7\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91FalsePositiveRateTruePositiveRateOur Model --- AUC: 0.817HOG --- AUC: 0.707Our Model w/ scrambledidentities --- AUC: 0.681Our Model w/ random noisetemplates--- AUC: 0.649(a) Inputs (b) Features (c) Signatures (d) Veri\ufb01cationHOGHOGPCAPrincipalComponents(PCs)Projectonto PCsTemplatesNormalizeddot productsPerson 1Person 2Person 3Person 4> Threshold?Histogram and/or statisticalmoments (e.g. mean pooling)......Normalizeddot product(A) MODEL (B) PERFORMANCETemplate preparationTesting1. Detection2. Alignment3. RecognitionSignaturePubFig83 Our data PubFig LFW78.081.7(A) PIPELINE (B) PERFORMANCE (C) ROC CURVES87.175.284.675.474.378.670.6LBPLBP Signatures (Sig.)LPQ+LBP+LTPLPQ+LBP+LTP Sig.76.466.374.165.265.168.963.4Accuracy (%)00.20.40.60.8100.10.20.30.40.50.60.70.80.91LFWAUC 0.937 Acc. 87.1%PubFigAUC 0.897 Acc. 81.7%Our dataAUC 0.856 Acc. 78.0%PubFig83AUC 0.847 Acc. 76.4%False positive rateTrue positive rate\fing), though the best supervised systems do better9. The strongest result in the literature for face\nveri\ufb01cation with PubFig8310 is 70.2% [4]\u2014which is 6.2% lower than our best model.\n6 Discussion\nThe templates-and-signatures approach to recognition permits many seemingly-different convolu-\ntional networks (e.g. ConvNets and HMAX) to be understood in a common framework. We have\nargued here that the recent strong performance of convolutional networks across a variety of tasks\n(e.g., [6, 7, 8]) is explained because all these problems share a common computational crux: the\nneed to achieve representations that are invariant to identity-preserving transformations.\nWe argued that when studying invariance, the appropriate mathematical objects to consider are\nthe orbits of images under the action of a transformation and their associated probability distribu-\ntions. The probability distributions (and hence the orbits) can be characterized by one-dimensional\nprojections\u2014thus justifying the choice of the empirical distribution function of inner products with\ntemplate images as a representation for recognition. In this paper, we systematically investigated\nthe properties of this representation for two af\ufb01ne and two non-af\ufb01ne transformations (tables 1 and\n2). The same probability distribution could also be characterized by its statistical moments. Inter-\nestingly, we found when we considered more dif\ufb01cult tasks in the second half of the paper, rep-\nresentations based on statistical moments tended to outperform the empirical distribution function.\nThere is a sense in which this result is surprising, since the empirical distribution function contains\nmore invariant \u201cinformation\u201d than the moments\u2014on the other hand, it could also be expected that\nthe moments ought to be less noisy estimates of the underlying distribution. This is an interesting\nquestion for further theoretical and experimental work.\nIn fact, the\nUnlike most convolutional networks, our model has essentially no free parameters.\npipeline we used for most experiments actually has no operations at all besides normalized dot prod-\nucts and pooling (also PCA when preparing templates). These operations are easily implemented\nby neurons [32]. We could interpret the former as the operation of \u201csimple cells\u201d and the latter as\n\u201ccomplex cells\u201d\u2014thus obtaining a similar view of the ventral stream to the one given by [33, 16, 14]\n(and many others).\nDespite the classi\ufb01er\u2019s simplicity, our model\u2019s strong performance on face veri\ufb01cation benchmark\ntasks is quite encouraging (Fig. 5). Future work could extend this approach to other objects, and\nother tasks.\nAcknowledgments This material is based upon work supported by the Center for Brains, Minds\nand Machines (CBMM), funded by NSF STC award CCF-1231216.\nReferences\n\n[1] T. Poggio, J. Mutch, F. Anselmi, J. Z. Leibo, L. Rosasco, and A. Tacchetti, \u201cThe computational magic of\nthe ventral stream: sketch of a theory (and why some deep architectures work),\u201d MIT-CSAIL-TR-2012-\n035, 2012.\n\n[2] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, \u201cLabeled faces in the wild: A database for study-\ning face recognition in unconstrained environments,\u201d in Workshop on faces in real-life images: Detection,\nalignment and recognition (ECCV), (Marseille, Fr), 2008.\n\n[3] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, \u201cAttribute and Simile Classi\ufb01ers for Face\nVeri\ufb01cation,\u201d in IEEE International Conference on Computer Vision (ICCV), (Kyoto, JP), pp. 365\u2013372,\nOct. 2009.\n\n[4] N. Pinto, Z. Stone, T. Zickler, and D. D. Cox, \u201cScaling-up Biologically-Inspired Computer Vision: A\nCase-Study on Facebook,\u201d in IEEE Computer Vision and Pattern Recognition, Workshop on Biologically\nConsistent Vision, 2011.\n\n[5] S. Dali, \u201cThe persistence of memory (1931).\u201d Museum of Modern Art, New York, NY.\n[6] A. Krizhevsky, I. Sutskever, and G. Hinton, \u201cImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in Advances in neural information processing systems, vol. 25, (Lake Tahoe, CA), 2012.\n\n9Note: Our method of testing does not strictly conform to the protocol recommended by the creators of\n\nLFW [2]: we re-aligned (worse) the faces. We also use the identities of the individuals during training.\n\n10The original PubFig dataset was only provided as a list of URLs from which the images could be down-\nloaded. Now only half the images remain available. On the original dataset, the strongest performance reported\nis 78.7% [3]. The authors of that study also made their features available, so we estimated the performance\nof their features on the available subset of images (using SVM). We found that an SVM classi\ufb01er, using their\nfeatures, and our cross-validation splits gets 78.4% correct\u20143.3% lower than our best model.\n\n8\n\n\f[7] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, \u201cApplying convolutional neural networks concepts\nto hybrid NN-HMM model for speech recognition,\u201d in IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pp. 4277\u20134280, 2012.\n\n[8] C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, N. J. Majaj, and J. J. DiCarlo, \u201cThe neural representation\n\nbenchmark and its evaluation on brain and machine,\u201d arXiv preprint arXiv:1301.3530, 2013.\n\n[9] P. F\u00a8oldi\u00b4ak, \u201cLearning invariance from transformation sequences,\u201d Neural Computation, vol. 3, no. 2,\n\npp. 194\u2013200, 1991.\n\n[10] L. Wiskott and T. Sejnowski, \u201cSlow feature analysis: Unsupervised learning of invariances,\u201d Neural com-\n\nputation, vol. 14, no. 4, pp. 715\u2013770, 2002.\n\n[11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, \u201cWhat is the best multi-stage architecture for\n\nobject recognition?,\u201d IEEE International Conference on Computer Vision, pp. 2146\u20132153, 2009.\n\n[12] J. Z. Leibo, J. Mutch, L. Rosasco, S. Ullman, and T. Poggio, \u201cLearning Generic Invariances in Object\n\nRecognition: Translation and Scale,\u201d MIT-CSAIL-TR-2010-061, CBCL-294, 2010.\n\n[13] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng, \u201cOn random weights and unsupervised\n\nfeature learning,\u201d Proceedings of the International Conference on Machine Learning (ICML), 2011.\n\n[14] M. Riesenhuber and T. Poggio, \u201cHierarchical models of object recognition in cortex,\u201d Nature Neuro-\n\nscience, vol. 2, pp. 1019\u20131025, Nov. 1999.\n\n[15] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, \u201cRobust Object Recognition with Cortex-\n\nLike Mechanisms,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 411\u2013426, 2007.\n\n[16] K. Fukushima, \u201cNeocognitron: A self-organizing neural network model for a mechanism of pattern recog-\n\nnition unaffected by shift in position,\u201d Biological Cybernetics, vol. 36, pp. 193\u2013202, Apr. 1980.\n\n[17] Y. LeCun, F. J. Huang, and L. Bottou, \u201cLearning methods for generic object recognition with invariance to\npose and lighting,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), vol. 2, pp. 90\u201397, IEEE, 2004.\n\n[18] E. Bart and S. Ullman, \u201cClass-based feature matching across unrestricted transformations,\u201d Pattern Anal-\n\nysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 9, pp. 1618\u20131631, 2008.\n\n[19] N. Pinto, Y. Barhomi, D. Cox, and J. J. DiCarlo, \u201cComparing state-of-the-art visual features on invariant\n\nobject recognition tasks,\u201d in Applications of Computer Vision (WACV), 2011 IEEE Workshop on, 2011.\n\n[20] T. Vetter, A. Hurlbert, and T. Poggio, \u201cView-based models of 3D object recognition: invariance to imaging\n\ntransformations,\u201d Cerebral Cortex, vol. 5, no. 3, p. 261, 1995.\n\n[21] J. Z. Leibo, J. Mutch, and T. Poggio, \u201cWhy The Brain Separates Face Recognition From Object Recogni-\n\ntion,\u201d in Advances in Neural Information Processing Systems (NIPS), (Granada, Spain), 2011.\n\n[22] H. Kim, J. Wohlwend, J. Z. Leibo, and T. Poggio, \u201cBody-form and body-pose recognition with a hierar-\n\nchical model of the ventral stream,\u201d MIT-CSAIL-TR-2013-013, CBCL-312, 2013.\n\n[23] N. Dalal and B. Triggs, \u201cHistograms of oriented gradients for human detection,\u201d IEEE International\n\nConference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 886-893, 2005.\n\n[24] E. Oja, \u201cSimpli\ufb01ed neuron model as a principal component analyzer,\u201d Journal of mathematical biology,\n\nvol. 15, no. 3, pp. 267\u2013273, 1982.\n\n[25] A. Afraz, M. V. Pashkam, and P. Cavanagh, \u201cSpatial heterogeneity in the perception of face and form\n\nattributes,\u201d Current Biology, vol. 20, no. 23, pp. 2112\u20132116, 2010.\n\n[26] J. Z. Leibo, Q. Liao, and T. Poggio, \u201cSubtasks of Unconstrained Face Recognition,\u201d in International Joint\n\nConference on Computer Vision, Imaging and Computer Graphics, VISIGRAPP, (Lisbon), 2014.\n\n[27] T. Ojala, M. Pietikainen, and T. Maenpaa, \u201cMultiresolution gray-scale and rotation invariant texture clas-\nsi\ufb01cation with local binary patterns,\u201d Pattern Analysis and Machine Intelligence, IEEE Transactions on,\nvol. 24, no. 7, pp. 971\u2013987, 2002.\n\n[28] X. Tan and B. Triggs, \u201cEnhanced local texture feature sets for face recognition under dif\ufb01cult lighting\n\nconditions,\u201d in Analysis and Modeling of Faces and Gestures, pp. 168\u2013182, Springer, 2007.\n\n[29] V. Ojansivu and J. Heikkil\u00a8a, \u201cBlur insensitive texture classi\ufb01cation using local phase quantization,\u201d in\n\nImage and Signal Processing, pp. 236\u2013243, Springer, 2008.\n\n[30] X. Zhu and D. Ramanan, \u201cFace detection, pose estimation, and landmark localization in the wild,\u201d in\n\nIEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[31] S. u. Hussain, T. Napoleon, and F. Jurie, \u201cFace recognition using local quantized patterns,\u201d in Proc. British\n\nMachine Vision Conference (BMCV), vol. 1, (Guildford, UK), pp. 52\u201361, 2012.\n\n[32] M. Kouh and T. Poggio, \u201cA canonical neural circuit for cortical nonlinear operations,\u201d Neural computa-\n\ntion, vol. 20, no. 6, pp. 1427\u20131451, 2008.\n\n[33] D. Hubel and T. Wiesel, \u201cReceptive \ufb01elds, binocular interaction and functional architecture in the cat\u2019s\n\nvisual cortex,\u201d The Journal of Physiology, vol. 160, no. 1, p. 106, 1962.\n\n9\n\n\f", "award": [], "sourceid": 1393, "authors": [{"given_name": "Qianli", "family_name": "Liao", "institution": "MIT"}, {"given_name": "Joel", "family_name": "Leibo", "institution": "MIT"}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": "MIT"}]}