{"title": "Multiresolution Tangent Distance for Affine-invariant Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 843, "page_last": 849, "abstract": null, "full_text": "Multiresolution Tangent Distance for \n\nAffine-invariant Classification \n\nNuno Vasconcelos \n\nAndrew Lippman \n\nMIT Media Laboratory, 20 Ames St, E15-320M, \n\nCambridge, MA 02139, {nuno,lip }@media.mit.edu \n\nAbstract \n\nThe ability to rely on similarity metrics invariant to image transforma(cid:173)\ntions is an important issue for image classification tasks such as face or \ncharacter recognition. We analyze an invariant metric that has performed \nwell for the latter - the tangent distance - and study its limitations when \napplied to regular images, showing that the most significant among these \n(convergence to local minima) can be drastically reduced by computing \nthe distance in a multiresolution setting. This leads to the multi resolution \ntangent distance, which exhibits significantly higher invariance to im(cid:173)\nage transformations, and can be easily combined with robust estimation \nprocedures. \n\n1 Introduction \n\nImage classification algorithms often rely on distance metrics which are too sensitive to \nvariations in the imaging environment or set up (e.g. the Euclidean and Hamming distances), \nor on metrics which, even though less sensitive to these variations, are application specific \nor too expensive from a computational point of view (e.g. deformable templates). \n\nA solution to this problem, combining invariance to image transformations with computa(cid:173)\ntional simplicity and general purpose applicability was introduced by Simard et al in [7]. \nThe key idea is that, when subject to spatial transformations, images describe manifolds in a \nhigh dimensional space, and an invariant metric should measure the distance between those \nmanifolds instead of the distance between other properties of (or features extracted from) \nthe images themselves. Because these manifolds are complex, minimizing the distance be(cid:173)\ntween them is a difficult optimization problem which can, nevertheless, be made tractable \nby considering the minimization of the distance between the tangents to the manifolds -the \ntangent distance (TO) - instead of that between the manifolds themselves. While it has led \nto impressive results for the problem of character recognition [8] , the linear approximation \ninherent to the TO is too stringent for regular images, leading to invariance over only a very \nnarrow range of transformations. \n\n\f844 \n\nN. Vasconcelos and A. Lippman \n\nIn this work we embed the distance computation in a multi resolution framework [3], \nleading to the multiresolution tangent distance (MRTD). Multiresolution decompositions \nare common in the vision literature and have been known to improve the performance of \nimage registration algorithms by extending the range over which linear approximations \nIn particular, the MRTD has several appealing properties: 1) maintains \nhold [5, 1]. \nthe general purpose nature of the TD; 2) can be easily combined with robust estimation \nprocedures, exhibiting invariance to moderate non-linear image variations (such as caused \nby slight variations in shape or occlusions); 3) is amenable to computationally efficient \nscreening techniques where bad matches are discarded at low resolutions; and 4) can be \ncombined with several types of classifiers. Face recognition experiments show that the \nMRTD exhibits a significantly extended invariance to image transformations, originating \nimprovements in recognition accuracy as high as 38%, for the hardest problems considered. \n\n2 The tangent distance \n\nConsider the manifold described by all the possible linear transformations that a pattern \nlex) may be subject to \n\nTp [lex)] = 1('ljJ(x, p)), \n\n(1) \nwhere x are the spatial coordinates over which the pattern is defined, p is the set of \nparameters which define the transformation, and 'ljJ is a function typically linear on p, but \nnot necessarily linear on x. Given two patterns M(x) and N(x), the distance between the \nassociated manifolds - manifold distance (MD) - is \n\nT(M, N) = min IITq[M(x)] - Tp[N(x)]W. \n\np,q \n\n(2) \n\nFor simplicity, we consider a version of the distance in which only one of the patterns is \nsubject to a transformation, i.e. \n\nT(M, N) = min IIM(x) - Tp[N(x)]lf, \n\np \n\n(3) \n\nbut all results can be extended to the two-sided distance. Using the fact that \n\\7p Tp[N(x)] = \\7pN('ljJ(x, p)) = \\7p '\u00a2(x, p)\\7xN('\u00a2(x, p)), \n\n(4) \nwhere \\7pTp is the gradient of Tp with respect to p, Tp[N(x)] can, for small p, be \napproximated by a first order Taylor expansion around the identity transformation \n\nTp[N(x)] = N(x) + (p - If\\7p 'ljJ(x,p)\\7x N(x). \n\nThis is equivalent to approximating the manifold by a tangent hyper-plane, and leads to the \nTD. Substituting this expression in equation 3, setting the gradient with respect to p to zero, \nand solving for p leads to \n\np ~ [~'VP;6(X' P ) 'Vx N(x) 'V); N(X)'V~;6(x, P)]-' ~ D(x)'Vp;6(x, P l'VxN(x) + I, \n(5) \nwhere D(x) = M(x) - N(x). Given this optimal p, the TD between the two patterns \nis computed using equations I and 3. The main limitation of this formulation is that it \nrelies on a first-order Taylor series approximation, which is valid only over a small range \nof variation in the parameter vector p . \n\n2.1 Manifold distance via Newton's method \n\nThe minimization of the MD of equation 3 can also be performed through Newton's method, \nwhich consists of the iteration \n\npn+1 = pn _ 0: [\\7~ T/p=pn] -I \\7p Tlp=pn \n\n(6) \n\n\fMultiresolution Tangent Distancefor Affine-invariant Classification \n\n845 \n\nwhere \\7p / and \\7~ / are, respectively, the gradient and Hessian of the cost function of \nequation 3 with respect to the parameter p, \n\n\\7p/ = 2 L [M(x) - Tp[N(x)]) V'pTp[N(x)] \n\nx \n\nV'~ / = 2 L [-V'pTp[N(x)] \\7~Tp[N(x)] + [M(x) - N(x)] V'~Tp[N(x)]] . \n\nx \n\nDisregarding the term which contains second-order derivatives (V'~Tp[N(x)]), choosing \npO = I and Q: = 1, using 4, and substituting in 6 leads to equation 5. \nthe TO \ncorresponds to a single iteration of the minimization of the MD by a simplified version of \nNewton's method, where sec!ond-orderderivatives are disregarded. This reduces the rate of \nconvergence of Newton's method, and a single iteration may not be enough to achieve the \nlocal minimum, even for simple functions. It is, therefore, possible to achieve improvement \nif the iteration described by equation 6 is repeated until convergence. \n\nI.e. \n\n3 The multiresolution tangent distance \n\nThe iterative minimization of equation 6 suffers from two major drawbacks [2]: 1) it may \nrequire a significant number of iterations for convergence and 2), it can easily get trapped \nin local minima. Both these limitations can be, at least partially, avoided by embedding \nthe computation of the MD in a multiresolution framework, leading to the multiresolution \nmanifold distance (MRMD). For its computation, the patterns to classify are first subject to \na multiresolution decomposition, and the MD is then iteratively computed for each layer, \nusing the estimate obtained from the layer above as a starting point, \n\nwhere, Dl(x) = M(x) - Tp~ [N(x)]. If only one iteration is allowed at each imageresolu(cid:173)\ntion, the MRMD becomes the multiresolution extension of the TO, i.e. the multi resolution \ntangent distance (MRTO). \nTo illustrate the benefits of minimization over different scales consider the signal J (t) = \nE{;=1 sin(wkt ), and the manifold generated by all its possible translations J'(t,d) = \nJ(t + d). Figure 1 depicts the multiresolution Gaussian decomposition of J(t), together \nwith the Euclidean distance to the points on the manifold as a function of the translation \nassociated with each of them (d). Notice that as the resolution increases, the distance \nfunction has more local minima, and the range of translations over which an initial guess \nis guaranteed to lead to convergence to the global minimum (at d = 0) is smaller. I.e., \nat higher resolutions, a better initial estimate is necessary to obtain the same performance \nfrom the minimization algorithm. \n\nNotice also that, since the function to minimize is very smooth at the lowest resolutions, \nthe minimization will require few iterations at these resolutions if a procedure such as \nNewton's method is employed. Furthermore, since the minimum at one resolution is a good \nguess for the minimum at the next resolution, the computational effort required to reach \nthat minimum will also be small. Finally, since a minimum at low resolutions is based on \ncoarse, or global, information about the function or patterns to be classified, it is likely to \nbe the global minimum of at least a significant region of the parameter space, if not the true \nglobal minimum. \n\n\f846 \n\nN. Vasconcelos and A. Lippman \n\n\u00b7RB .~5ISa {\\Z\\Z\\] \n-UJj -F\\lJ -t;: Ll \n\n. . . . . . . . . . ...\n\n_ . : . . . . .. ...\n\n-I~..::.. \n\n\u2022\u2022\u2022\u2022 \n\n.. ~ \n\n.. .\n\n. . . . \n\n..I~ \n\n\u2022 \u2022\n\n\u2022\u2022 \n\n--\"' .\n\n..\n\n.... . \n\n..:.. \n\n\u2022 .\n\n..\n\n\u2022\u2022\n\n\u2022\u2022\n\nFigure 1: Top: Three scales of the multiresolution decomposition of J(t) . Bottom: Euclidean \ndistance VS. translation for each scale. Resolution decreases from left to right. \n\n4 Affine-invariant classification \n\nThere are many linear transformations which can be used in equation 1. In this work, we \nconsider manifolds generated by affine transformations \n\n1jJ(x,p) = 0 0 0 x y IP = ~(x)p, \n\ny 1000 ] \n\nX \n\n[\n\n(8) \n\nwhere P is the vector of parameters which characterize the transformation. Taking the \ngradient of equation 8 with respect to p. V'p1jJ(x,p) = ~(x)T. using equation 4. and \nsubstituting in equation 7. \n\np~+1 = pr + \" \n\n[ ~ 4> (x) TV x N ' (x) viN' (x) 4> (xl ] -I \nL D'(x)~(x)TV'xN'(x), \n\nx \n\n(9) \n\nwhere N'(x) = N(1jJ(x, PI\u00bb' and D'(x) = M(x) - N'(x). For a given levell of the \nmultiresolution decomposition, the iterative process of equation 9 can be summarized as \nfollows. \n\n1. Compute N'(x) by warping the pattern to classify N(x) according to the best \n\ncurrent estimate of p, and compute its spatial gradient V'xN'(x). \n\n2. Update the estimate of PI according to equation 9. \n3. Stop if convergence, otherwise go to 1. \n\nOnce the final PI is obtained, it is passed to the multiresolution level below (by doubling the \ntranslation parameters), where it is used as initial estimate. Given the values of Pi which \nminimize the MD between a pattern to classify and a set of prototypes in the database, a \nK-nearest neighbor classifier is used to find the pattern's class. \n\n5 Robust classifiers \n\nOne issue of importance for pattern recognition systems is that of robustness to outliers, i.e \nerrors which occur with low probability, but which can have large magnitude. Examples \nare errors due to variation of facial features (e.g. faces shot with or without glasses) in \nface recognition, errors due to undesired blobs of ink or uneven line thickness in character \nrecognition, or errors due to partial occlusions (such as a hand in front of a face) or partially \n\n\fMultiresolution Tangent Distance/or Affine-invariant Classification \n\n847 \n\nmissing patterns (such as an undoted i). It is well known that a few (maybe even one) \noutliers of high leverage are sufficient to throw mean squared error estimators completely \noff-track [6] . \n\nSeveral robust estimators have been proposed in the statistics literature to avoid this problem. \nIn this work we consider M-estimators [4] which can be very easily incorporated in the \nMD classification framework. M-estimators are an extension of least squares estimators \nwhere the square function is substituted by a functional p(x) which weighs large errors less \nheavily. The robust-estimator version of the tangent distance then becomes to minimize the \ncost function \n\n(10) \n\n(11) \n\nT(M, N) = min I: p(M(x) - Tp[N{x)]) , \n\np x \n\nand it is straightforward to show that the \"robust\" equivalent to equation 9 is \n\np~+' ~ pr +\" [~P\"[D(X))oI>(X)TI7XN'(X)I7;;:N'(X)oI>(X)T ]-' x \n\n[~P'[D(X))oI>(X)Tl7xN' (X)] , \n\nwhere D(x) = M(x) - N'(x) and p'(x) and p\"(x) are, respectively, the first and second \nderivatives of the function p( x) with respect to its argument. \n\n6 Experimental results \n\nIn this section, we report on experiments carried out to evaluate the performance of the MD \nclassifier. The first set of experiments was designed to test the invariance of the TD to affine \ntransformations of the input. The second set was designed to evaluate the improvement \nobtained under the multiresolution framework. \n\n6.1 Affine invariance of the tangent distance \n\nStarting from a single view of a reference face, we created an artificial dataset composed \nby 441 affine transformations of it. These transformations consisted of combinations of \nall rotations in the range from - 30 to 30 degrees with increments of 3 degrees, with all \nscaling transformations in the range from 70% to 130% with increments of 3%. The faces \nassociated with the extremes of the scaling/rotation space are represented on the left portion \nof figure 2. \n\nOn the right of figure 2 are the distance surfaces obtained by measuring the distance \nassociated with several metrics at each of the points in the scaling/rotation space. Five \nmetrics were considered in this experiment: the Euclidean distance (ED), the TD, the MD \ncomputed through Newton's method, the MRMD, and the MRTD. \n\nWhile the TD exhibits some invariance to rotation and scaling, this invariance is restricted \nto a small range of the parameter space and performance only slightly better than the \nobtained with the ED. The performance of the MD computed through Newton's method \nis dramatically superior, but still inferior to those achieved with the MRTD (which is very \nclose to zero over the entire parameter space considered in this experiment), and the MRMD. \nThe performance of the MRTD is in fact impressive given that it involves a computational \nincrease of less than 50% with respect to the TD, while each iteration of Newton's method \nrequires an increase of 100%, and several iterations are typically necessary to attain the \nminimum MD. \n\n\fN. Vasconcelos and A. Lippman \n\n848 \n\n-30 \n\n!i \n~ 0 \n-0 \na: \n\n0.7 \n\n1.3 \n\nScaling \n\nFigure 2: Invariance of the tangent distance. In the right, the surfaces shown correspond to ED, TO, \nMO through Newton's method, MRTO, and MRMO. This ordering corresponds to that of the nesting \nof the surfaces, i.e. the ED is the cup-shaped surface in the center, while the MRMO is the flat surface \nwhich is approximately zero everywhere. \n\n6.2 Face recognition \n\nTo evaluate the performance of the multiresolution tangent distance on a real classification \ntask, we conducted a series of face recognition experiments, using the Olivetti Research \nLaboratories (ORL) face database. This database is composed by 400 images of 40 subjects, \n10 images per subject, and contains variations in pose, light conditions, expressions and \nfacial features, but small variability in terms of scaling, rotation, or translation. To correct \nthis limitation we created three artificial datasets by applying to each image three random \naffine transformations drawn from three multivariate normal distributions centered on the \nidentity transformation with different covariances. A small sample of the faces in the \ndatabase is presented in figure 3, together with its transformed version under the set of \ntransformations of higher variability. \n\nFigure 3: Left: sample of the ORL face database. Right: transformed version. \n\nWe next designed three experiments with increasing degree of difficulty. In the first, we \nselected the first view of each subject as the test set, using the remaining nine views as \ntraining data. In the second, the first five faces were used as test data while the remaining \nfive were used for training. Finally, in the third experiment, we reverted the roles of the \ndatasets used in the first. The recognition accuracy for each of these experiments and each \nof the datasets is reported on figure 4 for the ED, the TO, the MRTD, and a robust version \nof this distance (RMRTO) with p(x) = 1x2 if x::; aT and p(x) = ~2 otherwise, where T \nis a threshold (set to 2.0 in our experiments), and a a robust version of the error standard \ndeviation defined as a = median lei - median (ei )1 /0.6745. \n\nSeveral conclusions can be taken from this figure. First, it can be seen that the MRTD \nprovides a significantly higher invariance to linear transformations than the ED or the TO, \n\n\fMultiresolUlion Tangent Distance for Affine-invariant Classification \n\n849 \n\nincreasing the recognition accuracy by as much as 37.8% in the hardest datasets. In fact, \nfor the easier tasks of experiments one and two, the performance of the multiresolution \nclassifier is almost constant and always above the level of 90% accuracy. It is only for the \nharder experiment that the invariance of the MRTO classifier starts to break down. But even \nin this case, the degradation is graceful- the recognition accuracy only drops below 75% \nfor considerable values of rotation and scaling (dataset D3). \n\nOn the other hand, the ED and the single resolution TO break down even for the easier \ntasks, and fail dramatically when the hardest task is performed on the more difficult datasets. \nFurthermore, their performance does not degrade gracefully, they seem to be more invariant \nwhen the training set has five views than when it is composed by nine faces of each subject \nin the database. \n\n'~--------~,~-=--=~~~ \n\n___ _ ...l _ _ _ _ _ __ _ l... \n\n, \n, \n\n, \n\nI \n\n'iJ \n\n' .... \n\n, \n, \n, \n\n- - --1- -\n\n- -- - -~ -\n\n, \n\n, \n\n, \n, \n\n'!O.(U. \n\n, \n, \n\n, \n, \n\n- - -- - - --r- --- - - - - -t- - ----\n\n. .. _ __ _____ _ L _ _ _ _ ___ ...1_ ______ _ \n\n, \n, \n-\n, \n, \n, \n\"\".(1,1 ____ __ _ _ _ ~- - ---- --+ ---- --- -~-\n, \n, \n, \n- -- - -- - -r- -- ---- - -t - - - - --- - t- -\n, \n, \n\n____ ______ L- __ _ __ __ ...1 ___ __ _ __ l-\n\n, \n, \n, \n, \n\n, \n, \n, \n\nam . \n\n, \n, \n, \n\n, \n, \n\n, \n, \n\n, \n\n, \n\nI \n,;. \n\n. \n\n., \n__ ---- - -I-- ------ - - ------ - r -m-\n-------;----... ---......... :0 \n\nL . _ \n\n_ _ \n\n....1_ _ \nI \n\n_ _ ~ _ \n\n, \n\n,GI;IIIIII _\n\n.... \n\nOO~ _\n\n_ __ ___ __ 1- _ ____ __ -10 _ _ ____ __ ... _ \n\n10>00_ -- - - -- -- t-- -- -- -- -~ - -- ----t\" -\n\n, \n, \n, \n_ _ __ ____ L __ _ ____ j _ ~ __ __ _ _ L _ \n, \n, \n..,CV _ _ __ ___ _ _ ~ -- - ---- ~ -- --- -- - ~ -\n\n, \n, \n, \n, \n\nJOQI _\n\n, \n\nl'W'> __ __ ___ _ ~ _ _ ___ __ ... ______ _ _ ... \n\n, \n\nI \n\n, \n, \n, \n, \n, \n, , \n\n, \n\n_\n\n_\n\nr \n\n\u2022 \n\n11041 \n\nI \nI \n, \n, \nI \n\n! \n! \n, \nI \nI \n\niii\"\" \n.OII IIL. ----- - - - j- ------ - -t - - - - - - - - r -TD\"\"\"\" \nI 1Mb \nIIRm \nI \nI \n, \nI \n\n_ _ ______ L _____ __ ...1 _ _ _ \u2022 _ _ _ _ .l- . \n\n-~?~?~:~j?~~~~~~~}~~~~~~~~~ \n: \n: '~ \nI ____ _ ___ + ____ _____ t _ \n, \n\n_ ___ L ___ _ _ __ ...l ___ ___ ___ _ \n\n__ _ _ _ \n\n, \n\n110m \u2022\n\n:1111,1,1 . \n\n\u2022 _\n\n, \n, \n\nI \nI \n\n\u2022 \n\n-\n\n, \nI _ _ _ _ _ ___ ~ \n\n\u00ab>.011 _\n\n_ _ __ ___ I _____ _ _ \n\n_\n\n_ ~ __ __ ___ L _____ ____ _ \n\nI \n\n, , , \n\nFigure 4: Recognition accuracy. From left to right: results from the first, second, and third \nexperiments. Oatasets are ordered by degree of variability: 00 is the ORL database 03 is subject to \nthe affine transfonnations of greater amplitude. \n\nAcknowledgments \n\nWe would like to thank Federico Girosi for first bringing the tangent distance to our attention, \nand for several stimulating discussions on the topic. \n\nReferences \n\n[1J P. Anandan, J. Bergen, K. Hanna, and R. Hingorani. Hierarchical Model-Based Mo(cid:173)\n\ntion Estimation. In M. Sezan and R. Lagendijk, editors, Motion Analysis and Image \nSequence Processing, chapter 1. Kluwer Academic Press, 1993. \n[2J D. Bertsekas. Nonlinear Programming. Athena Scientific, 1995. \n[3J P. Burt and E. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE \n\nTrans. on Communications, Vol. 31:532-540,1983. \n\n[4] P. Huber. Robust Statistics. John Wiley, 1981 . \n[5] B. Lucas and T. Kanade. An Iterative Image Registration Technique with an Application \n\nto Stereo Vision. In Proc. DARPA Image Understanding Workshop, 198 I. \n\n[6J P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. John Wiley, \n\n1987. \n\n[7] P. Simard, Y. Le Cun, and J. Denker. Efficient Pattern Recognition Using a New \nTransformation Distance. In Proc. Neurallnfonnation Proc. Systems, Denver, USA, \n1994. \n\n[8] P. Simard, Y. Le Cun, and 1. Denker. Memory-based Character Recognition Using a \nTransformation Invariant Metric. In Int. Conference on Pattern Recognition, Jerusalem, \nIsrael, 1994. \n\n\f", "award": [], "sourceid": 1474, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Andrew", "family_name": "Lippman", "institution": null}]}