{"title": "Max-Margin Invariant Features from Transformed Unlabelled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1439, "page_last": 1447, "abstract": "The study of representations invariant to common transformations of the data is important to learning. Most techniques have focused on local approximate invariance implemented within expensive optimization frameworks lacking explicit theoretical guarantees. In this paper, we study kernels that are invariant to a unitary group while having theoretical guarantees in addressing the important practical issue of unavailability of transformed versions of labelled data. A problem we call the Unlabeled Transformation Problem which is a special form of semi-supervised learning and one-shot learning. We present a theoretically motivated alternate approach to the invariant kernel SVM based on which we propose Max-Margin Invariant Features (MMIF) to solve this problem. As an illustration, we design an framework for face recognition and demonstrate the efficacy of our approach on a large scale semi-synthetic dataset with 153,000 images and a new challenging protocol on Labelled Faces in the Wild (LFW) while out-performing strong baselines.", "full_text": "Max-Margin Invariant Features from Transformed\n\nUnlabeled Data\n\nDipan K. Pal, Ashwin A. Kannan\u2217, Gautam Arakalgud\u2217, Marios Savvides\n\nDepartment of Electrical and Computer Engineering\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\n{dipanp,aalapakk,garakalgud,marioss}@cmu.edu\n\nAbstract\n\nThe study of representations invariant to common transformations of the data\nis important to learning. Most techniques have focused on local approximate\ninvariance implemented within expensive optimization frameworks lacking explicit\ntheoretical guarantees. In this paper, we study kernels that are invariant to a unitary\ngroup while having theoretical guarantees in addressing the important practical\nissue of unavailability of transformed versions of labelled data. A problem we\ncall the Unlabeled Transformation Problem which is a special form of semi-\nsupervised learning and one-shot learning. We present a theoretically motivated\nalternate approach to the invariant kernel SVM based on which we propose Max-\nMargin Invariant Features (MMIF) to solve this problem. As an illustration, we\ndesign an framework for face recognition and demonstrate the ef\ufb01cacy of our\napproach on a large scale semi-synthetic dataset with 153,000 images and a new\nchallenging protocol on Labelled Faces in the Wild (LFW) while out-performing\nstrong baselines.\n\n1\n\nIntroduction\n\nIt is becoming increasingly important to learn well generalizing representations that are invariant\nIndeed, being invariant to intra-class\nto many common nuisance transformations of the data.\ntransformations while being discriminative to between-class transformations can be said to be one of\nthe fundamental problems in pattern recognition. The nuisance transformations can give rise to many\n\u2018degrees of freedom\u2019 even in a constrained task such as face recognition (e.g. pose, age-variation,\nillumination etc.). Explicitly factoring them out leads to improvements in recognition performance as\nfound in [10, 7, 6]. It has also been shown that that features that are explicitly invariant to intra-class\ntransformations allow the sample complexity of the recognition problem to be reduced [2]. To this\nend, the study of invariant representations and machinery built on the concept of explicit invariance is\nimportant.\nInvariance through Data Augmentation. Many approaches in the past have enforced invariance by\ngenerating transformed labelled training samples in some form such as [13, 17, 19, 9, 15, 4]. Perhaps,\none of the most popular method for incorporating invariances in SVMs is the virtual support method\n(VSV) in [18], which used sequential runs of SVMs in order to \ufb01nd and augment the support vectors\nwith transformed versions of themselves.\nIndecipherable transformations in data leads to shortage of transformed labelled samples. The\nabove approaches however, assume that one has explicit knowledge about the transformation. This\nis a strong assumption. Indeed, in most general machine learning applications, the transformation\n\n\u2217Authors contributed equally\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Max-Margin Invariant Features (MMIF) can\nsolve an important problem we call the Unlabeled Trans-\nformation Problem. In the \ufb01gure, a traditional classi\ufb01er\nF (x) \"learns\" invariance to nuisance transformations\ndirectly from the labeled dataset X . On the other hand,\nour approach (MMIF) can incorporate additional invari-\nance learned from any unlabeled data that undergoes the\nnuisance transformation of interest.\n\npresent in the data is not clear and cannot be modelled easily, e.g. transformations between different\nviews of a general 3D object and between different sentences articulated by the same person. Methods\nwhich work on generating invariance by explicitly transforming or augmenting labelled training data\ncannot be applied to these scenarios. Further, in cases where we do know the transformations that\nexist and we actually can model them, it is dif\ufb01cult to generate transformed versions of very large\nlabelled datasets. Hence there arises an important problem: how do we train models to be invariant to\ntransformations in test data, when we do not have access to transformed labelled training samples ?\nAvailability of unlabeled transformed data.\nAlthough it is dif\ufb01cult to obtain or generate\ntransformed labelled data (due to the reasons\nmentioned above), unlabeled transformed data\nis more readily available. For instance, if differ-\nent views of speci\ufb01c objects of interest are not\navailable, one can simply collect views of gen-\neral objects. Also, if different sentences spoken\nby a speci\ufb01c group of people are not available,\none can simply collect those spoken by mem-\nbers of the general population. In both these\nscenarios, no explicit knowledge or model of\nthe transformation is needed, thereby bypassing\nthe problem of indecipherable transformations.\nThis situation is common in vision e.g. only\nunlabeled transformed images are observed, but\nhas so far mostly been addressed by the com-\nmunity by intense efforts in large scale data col-\nlection. Note that the transformed data that is\ncollected is not required to be labelled. We now\nare in a position to state the central problem that\nthis paper addresses.\nThe Unlabeled Transformation (UT) Problem:\nHaving access to transformed versions of the training unlabeled data but not of labelled data, how\ndo we learn a discriminative model of the labelled data, while being invariant to transformations\npresent in the unlabeled data ?\nOverall approach. The approach presented in this paper however (see Fig. 1), can solve this problem\nand learn invariance to transformations observed only through unlabeled samples and does not need\nlabelled training data augmentation. We explicitly and simultaneously address both problems of\ngenerating invariance to intra-class transformation (through invariant kernels) and being discriminative\nto inter or between class transformations (through max-margin classi\ufb01ers). Given a new test sample,\nthe \ufb01nal extracted feature is invariant to the transformations observed in the unlabeled set, and thereby\ngeneralizes using just a single example. This is an example of one-shot learning.\nPrior Art: Invariant Kernels. Kernel methods in machine learning have long been studied to\nconsiderable depth. Nonetheless, the study of invariant kernels and techniques to extract invariant\nfeatures has received much less attention. An invariant kernel allows the kernel product to remain\ninvariant under transformations of the inputs. Most instances of incorporating invariances focused\non local invariances through regularization and optimization such as [18, 19, 3, 21]. Some other\ntechniques were jittering kernels [17, 3] and tangent-distance kernels [5], both of which sacri\ufb01ced\nthe positive semi-de\ufb01nite property of its kernels and were computationally expensive. Though these\nmethods have had some success, most of them still lack explicit theoretical guarantees towards\ninvariance. The proposed invariant kernel SVM formulation on the other hand, develops a valid PSD\nkernel that is guaranteed to be invariant. [4] used group integration to arrive at invariant kernels\nbut did not address the Unlabeled Transformation problem which our proposed kernels do address.\nFurther, our proposed kernels allow for the formulation of the invariant SVM and application to large\nscale problems. Recently, [14] presented some work with invariant kernels. However, unlike our\nnon-parametric formulation, they do not learn the group transformations from the data itself and\nassume known parametric transformations (i.e. they assume that transformation is computable).\nKey ideas. The key ideas in this paper are twofold.\n\n2\n\nTransformed unlabeled dataNon-transformed labeled dataTrain invariant to Test imagenot invariant to Train\f1. The \ufb01rst is to model transformations using unitary groups (or sub-groups) leading to unitary-\ngroup invariant kernels. Unitary transforms allow the dot product to be preserved and\nallow for interesting generalization properties leading to low sample complexity and also\nallow learning transformation invariance from unlabeled examples (thereby solving the\nUnlabeled Transformation Problem). Classes of learning problems, such as vision, often\nhave transformations belonging to a unitary-group, that one would like to be invariant\ntowards (such as translation and rotation). In practice however, [8] found that invariance to\nmuch more general transformations not captured by this model can been achieved.\n\n2. Secondly, we combine max-margin classi\ufb01ers with invariant kernels leading to non-linear\nmax-margin unitary-group invariant classi\ufb01ers. These theoretically motivated invariant\nnon-linear SVMs form the foundation upon which Max-Margin Invariant Features (MMIF)\nare based. MMIF features can effectively solve the important Unlabeled Transformation\nProblem. To the best of our knowledge, this is the \ufb01rst theoretically proven formulation of\nthis nature.\n\nContributions.\nIn contrast to many previous studies on invariant kernels, we study non-linear\npositive semi-de\ufb01nite unitary-group invariant kernels guaranteeing invariance that can address the UT\nProblem. One of our central theoretical results to applies group integration in the RKHS. It builds on\nthe observation that, under unitary restrictions on the kernel map, group action in the input space is\nreciprocated in the RKHS. Using the proposed invariant kernel, we present a theoretically motivated\napproach towards a non-linear invariant SVM that can solve the UT Problem with explicit invariance\nguarantees. As our main theoretical contribution, we showcase a result on the generalization of\nmax-margin classi\ufb01ers in group-invariant subspaces. We propose Max-Margin Invariant Features\n(MMIF) to learn highly discriminative non-linear features that also solve the UT problem. On the\npractical side, we propose an approach to face recognition to combine MMIFs with a pre-trained\ndeep learning feature extractor (in our case VGG-Face [12]). MMIF features can be used with deep\nlearning whenever there is a need to focus on a particular transformation in data (in our application\npose in face recognition) and can further improve performance.\n\n2 Unitary-Group Invariant Kernels\nPremise: Consider a dataset of normalized samples along with labels X = {xi},Y = {yi} \u2200i \u2208\n1...N with x \u2208 Rd and y \u2208 {+1,\u22121}. We now introduce into the dataset a number of unitary trans-\nformations g part of a locally compact unitary-group G. We note again that the set of transformations\nunder consideration need not be the entire unitary group. They could very well be a subgroup. Our\naugmented normalized dataset becomes {gxi, yi} \u2200g \u2208 G \u2200i. For clarity, we denote by gx the action\nof group element g \u2208 G on x, i.e. gx = g(x). We also de\ufb01ne an orbit of x under G as the set\nXG = {gx} \u2200g \u2208 G. Clearly, X \u2286 XG. An invariant function is de\ufb01ned as follows.\nDe\ufb01nition 2.1 (G-Invariant Function). For any group G, we de\ufb01ne a function f : X \u2192 Rn to be\nG-invariant if f (x) = f (gx) \u2200x \u2208 X \u2200g \u2208 G.\nOne method of generating an invariant towards a group is through group integration. Group integration\nhas stemmed from classical invariant theory and can be shown to be a projection onto a G-invariant\nsubspace for vector spaces. In such a space x = gx \u2200g \u2208 G and thus the representation x is invariant\nunder the transformation of any element from the group G. This is ideal for recognition problems\nwhere one would want to be discriminative to between-class transformations (for e.g. between distinct\nsubjects in face recognition) but be invariant to within-class transformations (for e.g. different images\nof the same subject). The set of transformations we model as G are the within-class transformations\nthat we would like to be invariant towards. An invariant to any group G can be generated through the\nfollowing basic (previously) known property (Lemma 2.1) based on group integration.\nLemma 2.1. (Invariance Property) Given a vector \u03c9 \u2208 Rd, and any af\ufb01ne group G, for any \ufb01xed\n\nG g\u03c9 dg =(cid:82)\n\nG g\u03c9 dg\n\ng(cid:48) \u2208 G and a normalized Haar measure dg, we have g(cid:48)(cid:82)\nresults in the quantity(cid:82)\n\nThe Haar measure (dg) exists for every locally compact group and is unique up to a positive\nmultiplicative constant (hence normalized). A similar property holds for discrete groups. Lemma 2.1\nG g\u03c9 dg enjoy global invariance (encompassing all elements) to group G.\nThis property allows one to generate a G-invariant subspace in the inherent space Rd through group\nintegration. In practice, the integral corresponds to a summation over transformed samples. The\n\n3\n\n\ffollowing two lemmas (novel results, and part of our contribution) (Lemma 2.2 and 2.3) showcase\nG g dg for a unitary-group G 2. These properties would\n\nprove useful in the analysis of unitary-group invariant kernels and features.\n\nelementary properties of the operator \u03a8 =(cid:82)\nLemma 2.2. If \u03a8 =(cid:82)\nLemma 2.3. (Unitary Projection) If \u03a8 = (cid:82)\n\nG g dg for unitary G, then \u03a8T = \u03a8\n\nG g dg for any af\ufb01ne G, then \u03a8\u03a8 = \u03a8, i.e.\n\nit is a\n\nprojection operator. Further, if G is unitary, then (cid:104)\u03c9, \u03a8\u03c9(cid:48)(cid:105) = (cid:104)\u03a8\u03c9, \u03c9(cid:48)(cid:105) \u2200\u03c9, \u03c9(cid:48) \u2208 Rd\nSample Complexity and Generalization. On applying the operator \u03a8 to the dataset X , all points\nin the set {gx | g \u2208 G} for any x \u2208 X map to the same point \u03a8x in the G-invariant subspace\nthereby reducing the number of distinct points by a factor of |G| (the cardinality of G, if G is \ufb01nite).\nTheoretically, this would drastically reduce sample complexity while preserving linear feasibility\n(separability). It is trivial to observe that a perfect linear separator learned in X\u03a8 = {\u03a8x | x \u2208\nX} would also be a perfect separator for XG, thus in theory achieving perfect generalization.\nGeneralization here refers to the ability to perform correct classi\ufb01cation even in the presence of the\nset of transformations G. We prove a similar result for Reproducing Kernel Hilbert Spaces (RKHS)\nin Section 2.2. This property is theoretically powerful since cardinality of G can be large. A classi\ufb01er\ncan avoid having to observe transformed versions {gx} of any x and yet generalize perfectly.\nThe case of Face Recognition. As an illustration, if the group G of transformations considered is\nG g dg\nrepresents a pose invariant subspace. In theory, all poses of a subject will converge to the same point\nin that subspace leading to near perfect pose invariant recognition.\nWe have not yet leveraged the power of the unitary structure of the groups which is also critical in\ngeneralization to test cases as we would see later. We now present our central result showcasing that\nunitary kernels allow the unitary group action to reciprocate in a Reproducing Kernel Hilbert Space.\nThis is critical to set the foundation for our core method called Max-Margin Invariant Features.\n\npose (it is hypothesized that small changes in pose can be modeled as unitary [10]), then \u03a8 =(cid:82)\n\n2.1 Group Actions Reciprocate in a Reproducing Kernel Hilbert Space\n\nGroup integration provides exact invariance as seen in the previous section. However, it requires\nthe group structure to be preserved, i.e. if the group structure is destroyed, group integration does\nnot provide an invariant function. In the context of kernels, it is imperative that the group relation\nbetween the samples in XG be preserved in the kernel Hilbert space H corresponding to some kernel\nk with a mapping \u03c6. If the kernel k is unitary in the following sense, then this is possible.\nDe\ufb01nition 2.2 (Unitary Kernel). A kernel k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105) is a unitary kernel if, for a unitary\ngroup G, the mapping \u03c6(x) : X \u2192 H satis\ufb01es (cid:104)\u03c6(gx), \u03c6(gy)(cid:105) = (cid:104)\u03c6(x), \u03c6(y)(cid:105) \u2200g \u2208 G,\u2200x, y \u2208 X .\n\nThe unitary condition is fairly general, a common class of unitary kernels is the RBF kernel. We now\nde\ufb01ne a transformation within the RKHS itself as gH : \u03c6(x) \u2192 \u03c6(gx) \u2200\u03c6(x) \u2208 H for any g \u2208 G\nwhere G is a unitary group. We then have the following result of signi\ufb01cance.\nTheorem 2.4. (Covariance in the RKHS) If k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105) is a unitary kernel in the sense of\nDe\ufb01nition 2.2, then gH is a unitary transformation, and the set GH = {gH | gH : \u03c6(x) \u2192 \u03c6(gx) \u2200g \u2208\nG} is a unitary-group in H.\n\nTheorem 2.4 shows that the unitary-group structure is preserved in the RKHS. This paves the way for\nnew theoretically motivated approaches to achieve invariance to transformations in the RKHS. There\nhave been a few studies on group invariant kernels [4, 10]. However, [4] does not examine whether\nthe unitary group structure is actually preserved in the RKHS, which is critical. Also, DIKF was\nrecently proposed as a method utilizing group structure under the unitary kernel [10]. Our result is a\ngeneralization of the theorems they present. Theorem 2.4 shows that since the unitary group structure\nis preserved in the RKHS, any method involving group integration would be invariant in the original\nspace. The preservation of the group structure allows more direct group invariance results to be\napplied in the RKHS. It also directly allows one to formulate a non-linear SVM while guaranteeing\ninvariance theoretically leading to Max-Margin Invariant Features.\n\n2All proofs are presented in the supplementary material\n\n4\n\n\f2.2\n\nInvariant Non-linear SVM: An Alternate Approach Through Group Integration\n\nWe now apply the group integration approach to the kernel SVM. The decision function of SVMs\ncan be written in the general form as f\u03b8(x) = \u03c9T \u03c6(x) + b for some bias b \u2208 R (we agglomerate all\nparameters of f in \u03b8) where \u03c6 is the kernel feature map, i.e. \u03c6 : X \u2192 H. Reviewing the SVM, a\nmaximum margin separator is found by minimizing loss functions such as the hinge loss along with a\nregularizer. In order to invoke invariance, we can now utilize group integration in the the kernel space\nH using Theorem 2.4. All points in the set {gx \u2208 XG} get mapped to \u03c6(gx) = gH\u03c6(x) for a given\ng \u2208 G in the input space X . Group integration then results in a G-invariant subspace within H through\nGH gH dgH using Lemma 2.1. Introducing Lagrange multipliers \u03b1 = (\u03b11, \u03b12...\u03b1N ) \u2208 RN ,\n\n\u03a8H =(cid:82)\n\nthe dual formulation (utilizing Lemma 2.2 and Lemma 2.3) then becomes\n\n\u03b1i +\n\nyiyj\u03b1i\u03b1j(cid:104)\u03a8H\u03c6(xi), \u03a8H\u03c6(xj)(cid:105)\n\n(1)\n\n\u2212(cid:88)\n\ni\n\n(cid:88)\n\n1\n2\n0 \u2264 \u03b1i \u2264 1\n\ni,j\n\nmin\n\nunder the constraints(cid:80)\nH = \u03a8H\u03c9\u2217 =(cid:80)\n\n\u03b1\n\ni \u03b1iyi = 0,\n\nH = \u03a8H\u03c9\u2217 = ((cid:82)\n\nN \u2200i. The SVM separator is then given by\ni yi\u03b1i\u03a8H\u03c6(xi) thereby existing in the GH-invariant (or equivalently G-invariant)\n\u03c9\u2217\nsubspace \u03a8H within H (since g \u2192 gH is a bijection). Effectively, the SVM observes samples from\nX\u03a8H = {x | \u03c6(x) = \u03a8H\u03c6(u), \u2200u \u2208 XG} and therefore \u03c9\u2217\nH enjoys exact global invariance to G.\nFurther, \u03a8H\u03c9\u2217 is a maximum-margin separator of {\u03c6(XG)} (i.e. the set of all transformed samples).\nThis can be shown by the following result.\nTheorem 2.5. (Generalization) For a unitary group G and unitary kernel k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105),\nGH gH dgH) \u03c9\u2217 is a perfect separator for {\u03a8H\u03c6(X )} = {\u03a8H\u03c6(x) | \u2200x \u2208 X},\nif \u03c9\u2217\nthen \u03a8H\u03c9\u2217 is also a perfect separator for {\u03c6(XG)} = {\u03c6(x) | x \u2208 XG} with the same margin.\nFurther, a max-margin separator of {\u03a8H\u03c6(X )} is also a max-margin separator of {\u03c6(XG)}.\nThe invariant non-linear SVM in objective 1, observes samples in the form of \u03a8H\u03c6(x) and obtains a\nmax-margin separator \u03a8H\u03c9\u2217. This allows for the generalization properties of max-margin classi\ufb01ers\nto be combined with those of group invariant classi\ufb01ers. While being invariant to nuisance transfor-\nmations, max-margin classi\ufb01ers can lead to highly discriminative features (more robust than DIKF\n[10] as we \ufb01nd in our experiments) that are invariant to within-class transformations.\nTheorem 2.5 shows that the margins of \u03c6(XG) and {\u03a8H\u03c6(XG)} are deeply related and implies that\n\u03a8H\u03c6(x) is a max-margin separator for both datasets. Theoretically, the invariant non-linear SVM is\nable to generalize to XG on just observing X and utilizing prior information in the form of G for all\nunitary kernels k. This is true in practice for linear kernels. For non-linear kernels in practice, the\ninvariant SVM still needs to observe and integrate over transformed training inputs.\nLeveraging unitary group properties. During test time to achieve invariance, the SVM would\nrequire to observe and integrate over all possible transformations of the test sample. This is a huge\ncomputational and design bottleneck. We would ideally want to achieve invariance and generalize\nby observing just a single test sample, in effect perform one shot learning. This would not only\nbe computationally much cheaper but make the classi\ufb01er powerful owing to generalization to full\ntransformed orbits of test samples by observing just that single sample. This is where unitarity of g\nhelps and we leverage it in the form of the following Lemma.\nG g dg for any unitary group G, then for any \ufb01xed g(cid:48) \u2208 G\n\nLemma 2.6. (Invariant Projection) If \u03a8 =(cid:82)\n\n(including the identity element) we have (cid:104)\u03a8x(cid:48), \u03a8\u03c9(cid:48)(cid:105) = (cid:104)g(cid:48)x(cid:48), \u03a8\u03c9(cid:48)(cid:105) \u2200\u03c9, \u03c9(cid:48) \u2208 Rd\nAssuming \u03a8\u03c9(cid:48) is the learned SVM classi\ufb01er, Lemma 2.6 shows that for any test x(cid:48), the invariant dot\nproduct (cid:104)\u03a8x(cid:48), \u03a8\u03c9(cid:48)(cid:105) which involves observing all transformations of x(cid:48) is equivalent to the quantity\n(cid:104)g(cid:48)x(cid:48), \u03a8\u03c9(cid:48)(cid:105) which involves observing only one transformation of x(cid:48). Hence one can model the entire\norbit of x(cid:48) under G by a single sample g(cid:48)x(cid:48) where g(cid:48) \u2208 G can be any particular transformation\nincluding identity. This drastically reduces sample complexity and vastly increases generalization\ncapabilities of the classi\ufb01er since one only need to observe one test sample to achieve invariance\nLemma 2.6 also helps us in saving computation, allowing us to apply the computationally expensive\n\u03a8 (group integration) operation only once on he classi\ufb01er and not the test sample. Thus, the kernel in\nthe Invariant SVM formulation can be replaced by the form k\u03a8(x, y) = (cid:104)\u03c6(x), \u03a8H\u03c6(y)(cid:105).\nFor kernels in general, the GH-invariant subspace cannot be explicitly computed since it lies in the\nG \u03c6(gxi)dgH. It is important to\n\nRKHS. It is only implicitly projected upon through \u03a8H\u03c6(xi) =(cid:82)\n\n5\n\n\f(a) Invariant kernel feature extraction\n\n(b) SVM feature extraction leading to MMIF fea-\ntures\n\nFigure 2: MMIF Feature Extraction. (a) l(x) denotes the invariant kernel feature of any x which is invariant\nto the transformation G. Invariance is generated by group integration (or pooling). The invariant kernel feature\nlearns invariance form the unlabeled transformed template set TG. Also, the faces depicted are actual samples\nfrom the large-scale mugshots data (\u223c 153, 000 images). (b) Once the invariant features have been extracted\nfor the labelled non-transformed dataset X , then the SVMs learned act as feature extractors. Each binary class\nSVM (different color) was trained on the invariant kernel feature of a random subset of l(X ) with random class\nassignments. The \ufb01nal MMIF feature for x is the concatenation of all SVM inner-products with l(x).\n\n(cid:104)\u03c6(x),(cid:82)\n\nnote that during testing however, the SVM formulation will be invariant to transformations of the test\nsample regardless of a linear or non-linear kernel.\nPositive Semi-De\ufb01niteness. The G-invariant kernel map is now of the form k\u03a8(x, y) =\nG \u03c6(gy)dgH(cid:105). This preserves the positive semi-de\ufb01nite property of the kernel k while\nguaranteeing global invariance to unitary transformations., unlike jittering kernels [17, 3] and\ntangent-distance kernels [5]. If we wish to include invariance to scaling however (in the sense of\nscaling an image), then we would lose positive-semi-de\ufb01niteness (it is also not a unitary transform).\nNonetheless, [20] show that conditionally positive de\ufb01nite kernels still exist for transformations\nincluding scaling, although we focus of unitary transformations in this paper.\n\n3 Max-Margin Invariant Features\n\nG \u03c6(gx)dgH,(cid:82)\n\n(cid:104)\u03a8H\u03c6(x), \u03a8H\u03c6(y)(cid:105) = (cid:104)(cid:82)\n\nThe previous section utilized a group integration approach to arrive a theoretically invariant non-linear\nSVM. It however does not address the Unlabeled Transformation problem i.e. the kernel k\u03a8(x, y) =\nG \u03c6(gy)dgH(cid:105) still requires observing transformed versions\nof the labelled input sample namely {gx | gx \u2208 XG} (or atleast one of the labelled samples if we\nutilize Lemma 2.6). We now present our core approach called Max-Margin Invariant Features (MMIF)\nthat does not require the observation of any transformed labelled training sample whatsoever.\nAssume that we have access to an unlabeled set of M templates T = {ti}i={1,...M}. We assume that\nwe can observe all transformations under a unitary-group G, i.e. we have access to TG = {gti | \u2200g \u2208\nG}i={1,...M}. Also, assume we have access to a set X = {xj}i={1,...D} of labelled data with N\nclasses which are not transformed. We can extract an M-dimensional invariant kernel feature for each\nxj \u2208 X as follows. Let the invariant kernel feature be l(x) \u2208 RM to explicitly show the dependence\non x. Then the ith dimension of l for any particular x is computed as\n\n(cid:90)\n\nG\n\n(cid:90)\n\nG\n\nl(x)i = (cid:104)\u03c6(x), \u03a8H\u03c6(ti)(cid:105) = (cid:104)\u03c6(x),\n\ngH\u03c6(ti)dgH(cid:105) = (cid:104)\u03c6(x),\n\n\u03c6(gti)dgH(cid:105)\n\n(2)\n\nThe \ufb01rst equality utilizes Lemma 2.6 and the third equality uses Theorem 2.4. This is equivalent\nto observing all transformations of x since (cid:104)\u03c6(x), \u03a8H\u03c6(ti)(cid:105) = (cid:104)\u03a8H\u03c6(x), \u03c6(ti)(cid:105) using Lemma 2.3.\nThereby we have constructed a feature l(x) which is invariant to G without ever needing to observe\ntransformed versions of the labelled vector x. We now brie\ufb02y the training of the MMIF feature\nextractor. The matching metrics we use for this study is normalized cosine distance.\n\n6\n\nClass 1Class 2Class 3Class 4Test Image Kernel Invariant FeatureIntegration over the group(pooling)Test Image\fof \u03c9k =(cid:80)\n\neach dimension k being computed as (cid:104)l(x(cid:48)), \u03c9k(cid:105) for \u03c9k = (cid:80)\n\nTraining MMIF SVMs. To learn a K-dimensional MMIF feature (potentially independent of\nN), we learn K independent binary-class linear SVMs. Each SVM trains on the labelled dataset\nl(X ) = {l(xj) | j = {1, ...D}} with each sample being label +1 for some subset of the N classes\n(potentially just one class) and the rest being labelled \u22121. This leads us to a classi\ufb01er in the form\nj yj\u03b1jl(xj). Here, yj is the label of xj for the kth SVM. It is important to note that the\nunlabeled data was only used to extract l(xj). Having multiple classes randomly labelled as positive\nallows the SVM to extract some feature that is common between them. This increases generalization\nby forcing the extracted feature to be more general (shared between multiple classes) rather than\nbeing highly tuned to a single class. Any K-dimensional MMIF feature can be trained through this\ntechnique leading to a higher dimensional feature vector useful in case where one has limited labelled\nsamples and classes (N is small). During feature extraction, the K inner products (scores) of the\ntest sample x(cid:48) with the K distinct binary-class SVMs provides the K-dimensional MMIF feature\nvector. This feature vector is highly discriminative due to the max-margin nature of SVMs while\nbeing invariant to G due to the invariant kernels.\nMMIF. Given TG and X , the MMIF feature is de\ufb01ned as MMIF(x(cid:48)) \u2208 RK for any test x(cid:48) with\nj yj\u03b1jl(xj) \u2200xj \u2208 X . Further,\nl(x(cid:48)) \u2208 RM \u2200x with each dimension i being l(x(cid:48))i = (cid:104)\u03c6(x(cid:48)), \u03a8H\u03c6(ti)(cid:105). The process is illustrated in\nFig. 2.\nInheriting transformation invariance from transformed unlabeled data: A special case of semi-\nsupervised learning. MMIF features can learn to be invariant to transformations (G) by observing\nthem only through TG. It can then transfer the invariance knowledge to new unseen samples from\nX thereby becoming invariant to XG despite never having observed any samples from XG. This\nis a special case of semi-supervised learning where we leverage on the speci\ufb01c transformations\npresent in the unlabeled data. This is a very useful property of MMIFs allowing one to learn\ntransformation invariance from one source and sample points from another source while having\npowerful discrimination and generalization properties. The property is can be formally stated as the\nfollowing Theorem.\nTheorem 3.1. (MMIF is invariant to learnt transformations) MMIF(x(cid:48)) = MMIF(gx(cid:48)) \u2200x(cid:48)\u2200g \u2208 G\nwhere G is observed only through TG = {gti | \u2200g \u2208 G}i={1,...M}.\n\nThus we \ufb01nd that MMIF can solve the Unlabeled Transformation Problem. MMIFs have an invariant\nand a discriminative component. The invariant component of MMIF allows it to generalize to\nnew transformations of the test sample whereas the discriminative component allows for robust\nclassi\ufb01cation due to max-margin classi\ufb01ers. These two properties allow MMIFs to be very useful as\nwe \ufb01nd in our experiments on face recognition.\nMax and Mean Pooling in MMIF. Group integration in practice directly results in mean pooling.\nRecent work however, showed that group integration can be treated as a subset of I-theory where one\ntries to measure moments (or a subset of) of the distribution (cid:104)x, g\u03c9(cid:105) g \u2208 G since the distribution itself\nis also an invariant [1]. Group integration can be seen as measuring the mean or the \ufb01rst moment of\nthe distribution. One can also characterize using the in\ufb01nite moment or the max of the distribution.\nWe \ufb01nd in our experiments that max pooling outperforms mean pooling in general. All results in this\npaper however, still hold under the I-theory framework.\nMMIF on external feature extractors (deep networks). MMIF does not make any assumptions\nregarding its input and hence one can apply it to features extracted from any feature extractor in\ngeneral. The goal of any feature extractor is to (ideally) be invariant to within-class transformation\nwhile maximizing between-class discrimination. However, most feature extractors are not trained\nto explicitly factor out speci\ufb01c transformations. If we have access to even a small dataset with\nthe transformation we would like to be invariant to, we can transfer the invariance using MMIFs\n(e.g. it is unlikely to observe all poses of a person in datasets, but pose is an important nuisance\ntransformation).\nModelling general non-unitary transformations. General non-linear transformations such as\nout-of-plane rotation or pose variation are challenging to model. Nonetheless, a small variation\nin these transformations can be approximated by some unitary G assuming piece wise linearity\nthrough transformation-dependent sub-manifold unfolding [11]. Further, it was found that in practice,\nintegrating over general transformations produced approximate invariance [8].\n\n7\n\n\f(a) Invariant kernel feature extraction\n\n(b) SVM feature extraction leading to MMIF fea-\ntures\n\nFigure 3: (a) Pose-invariant face recognition results on the semi-synthetic large-scale mugshot database (testing\non 114,750 images). Operating on pixels: MMIF (Pixels) outperforms invariance based methods DIKF [10]\nand invariant NDP [8]. Operating on deep features: MMIF trained on VGG-Face features [12] (MMIF-VGG)\nproduces a signi\ufb01cant improvement in performance. The numbers in the brackets represent VR at 0.1% FAR.\n(b) Face recognition results on LFW with raw VGG-Face features and MMIF trained on VGG-Face features.\nThe values in the bracket show VR at 0.1% FAR.\n\n4 Experiments on Face Recognition\n\nAs illustration, we apply MMIFs using two modalities overall 1) on raw pixels and 2) on deep features\nfrom the pre-trained VGG-Face network [12]. We provide more implementation details and results\ndiscussion in the supplementary.\nA. MMIF on a large-scale semi-synthetic mugshot database (Raw-pixels and deep features).\nWe utilize a large-scale semi-synthetic face dataset to generate the sets TG and X for MMIF. In this\ndataset, only two major transformations exist, that of pose variation and subject variation. All other\ntransformations such as illumination, translation, rotation etc are strictly and synthetically controlled.\nThis provides a very good benchmark for face recognition. where we want to be invariant to pose\nvariation and be discriminative for subject variation. The experiment follows the exact protocol\nand data as described in [10] 3 We test on 750 subjects identities with 153 pose varied real-textured\ngray-scale image each (a total of 114,750 images) against each other resulting in about 13 billion\npair-wise comparisons (compared to 6,000 for the standard LFW protocol). Results are reported as\nROC curves along with VR at 0.1% FAR. Fig. 3(a) shows the ROC curves for this experiment. We\n\ufb01nd that MMIF features out-performs all baselines including VGG-Face features (pre-trained), DIKF\nand NDP approaches thereby demonstrating superior discriminability while being able to effectively\ncapture pose-invariance from the transformed template set TG. MMIF is able to solve the Unlabeled\nTransformation problem by extracting transformation information from unlabeled TG.\nB. MMIF on LFW (deep features): Unseen subject protocol. In order to be able to effectively\ntrain under the scenario of general transformations and to challenge our algorithms, we de\ufb01ne a new\nmuch harder protocol on LFW. We choose the top 500 subjects with a total of 6,300 images for\ntraining MMIF on VGG-Face features and test on the remaining subjects with 7,000 images. We\nperform all versus all matching, totalling upto 49 million matches (4 orders more than the of\ufb01cial\nprotocol). The evaluation metric is de\ufb01ned to be the standard ROC curve with veri\ufb01cation rate\nreported at 0.1% false accept rate. We split the 500 subjects into two sets of 250 and use as TG and\nX . We do not use any alignment for this experiment, and the faces were cropped according to [16].\nFig. 3(b) shows the results of this experiment. We see that MMIF on VGG features signi\ufb01cantly\noutperforms raw VGG on this protocol, boosting the VR at 0.1% FAR from 0.56 to 0.71. This\ndemonstrates that MMIF is able to generate invariance for highly non-linear transformations that are\nnot well-de\ufb01ned rendering it useful in real-world scenarios where transformations are unknown but\nobservable.\n\n3We provide more details in the supplementary. Also note that we do not need utilize identity information, all\nthat is required is the fact that a set of pose varied images belong to the same subject. Such data can be obtained\nthrough temporal sampling.\n\n8\n\nFalse Accept Rate0.10.20.30.40.50.60.7Verification Rate0.50.550.60.650.70.750.80.850.90.951(cid:2)\u221e-DIKF(0.74)(cid:2)1-DIKF(0.61)NDP-(cid:2)\u221e(0.41)NDP-(cid:2)1(0.32)MMIF(Ours)(0.78)VGGFeatures(0.55)MMIF-VGG(Ours)(0.61)False Accept Rate10-810-710-610-510-410-310-210-1100Verification Rate00.10.20.30.40.50.60.70.80.91MMIFVGG(Ours)(0.71)VGG(0.56)\fReferences\n[1] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Magic materials: a theory of\n\ndeep hierarchical architectures for learning sensory representations. MIT, CBCL paper, 2013.\n\n[2] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Unsupervised learning of\n\ninvariant representations in hierarchical architectures. CoRR, abs/1311.4158, 2013.\n\n[3] D. Decoste and B. Sch\u00f6lkopf. Training invariant support vector machines. Mach. Learn., 46(1-3):161\u2013190,\n\nMar. 2002.\n\n[4] B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine learning. In\n\nMachine Learning, pages 35\u201361, 2007.\n\n[5] B. Haasdonk and D. Keysers. Tangent distance kernels for support vector machines. In Pattern Recognition,\n\n2002. Proceedings. 16th International Conference on, volume 2, pages 864\u2013868 vol.2, 2002.\n\n[6] G. E. Hinton. Learning translation invariant recognition in a massively parallel networks. In PARLE\n\nParallel Architectures and Languages Europe, pages 1\u201313. Springer, 1987.\n\n[7] J. Z. Leibo, Q. Liao, and T. Poggio. Subtasks of unconstrained face recognition. In International Joint\n\nConference on Computer Vision, Imaging and Computer Graphics, VISIGRAPP, 2014.\n\n[8] Q. Liao, J. Z. Leibo, and T. Poggio. Learning invariant representations and applications to face veri\ufb01cation.\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[9] P. Niyogi, F. Girosi, and T. Poggio. Incorporating prior information in machine learning by creating virtual\n\nexamples. In Proceedings of the IEEE, pages 2196\u20132209, 1998.\n\n[10] D. K. Pal, F. Juefei-Xu, and M. Savvides. Discriminative invariant kernel features: a bells-and-whistles-free\napproach to unsupervised face recognition and pose estimation. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 5590\u20135599, 2016.\n\n[11] S. W. Park and M. Savvides. An extension of multifactor analysis for face recognition based on submanifold\nlearning. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2645\u2013\n2652. IEEE, 2010.\n\n[12] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. 2015.\n[13] T. Poggio and T. Vetter. Recognition and structure from one 2d model view: Observations on prototypes,\n\nobject classes and symmetries. Laboratory, Massachusetts Institute of Technology, 1992.\n\n[14] A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Sch\u00f6lkopf. Local group invariant representations via orbit\nembeddings. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS 2017), volume 54 of Proceedings of Machine Learning Research, pages 1225\u20131235, 2017.\n\n[15] M. Reisert. Group integration techniques in pattern analysis \u2013 a kernel view. PhD Thesis, 2008.\n[16] C. Sanderson and B. C. Lovell. Multi-region probabilistic histograms for robust and scalable identity\n\ninference. In International Conference on Biometrics, pages 199\u2013208. Springer, 2009.\n\n[17] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimiza-\n\ntion, and beyond. MIT press, 2002.\n\n[18] B. Sch\u00f6lkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines.\n\npages 47\u201352. Springer, 1996.\n\n[19] B. Sch\u00f6lkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. Advances in\n\nNeural Information Processing Systems (NIPS), 1998.\n\n[20] C. Walder and O. Chapelle. Learning with transformation invariant kernels. In Advances in Neural\n\nInformation Processing Systems, pages 1561\u20131568, 2007.\n\n[21] X. Zhang, W. S. Lee, and Y. W. Teh. Learning with invariance via linear functionals on reproducing kernel\n\nhilbert space. In Advances in Neural Information Processing Systems, pages 2031\u20132039, 2013.\n\n9\n\n\f", "award": [], "sourceid": 920, "authors": [{"given_name": "Dipan", "family_name": "Pal", "institution": "Carnegie Mellon University"}, {"given_name": "Ashwin", "family_name": "Kannan", "institution": "Carnegie Mellon University"}, {"given_name": "Gautam", "family_name": "Arakalgud", "institution": "Carnegie Mellon University"}, {"given_name": "Marios", "family_name": "Savvides", "institution": "Carnegie Mellon University"}]}