{"title": "A Comparison of Image Processing Techniques for Visual Speech Recognition Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 939, "page_last": 945, "abstract": null, "full_text": "A comparison of Image Processing \n\nTechniques for Visual Speech Recognition \n\nApplications \n\nMichael S. Gray \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\nSan Diego, CA 92186-5800 \n\nTerrence J. Sejnowski \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\nSan Diego, CA 92186-5800 \n\nJavier R. Movellan* \n\nDepartment of Cognitive Science \nInstitute for Neural Computation \nUniversity of California San Diego \n\nAbstract \n\nWe examine eight different techniques for developing visual rep(cid:173)\nresentations in machine vision tasks. In particular we compare \ndifferent versions of principal component and independent com(cid:173)\nponent analysis in combination with stepwise regression methods \nfor variable selection. We found that local methods, based on the \nstatistics of image patches, consistently outperformed global meth(cid:173)\nods based on the statistics of entire images. This result is consistent \nwith previous work on emotion and facial expression recognition. \nIn addition, the use of a stepwise regression technique for selecting \nvariables and regions of interest substantially boosted performance. \n\n1 \n\nIntroduction \n\nWe study the performance of eight different methods for developing image repre(cid:173)\nsentations based on the statistical properties of the images at hand. These methods \nare compared on their performance on a visual speech recognition task. While \nthe representations developed are specific to visual speech recognition, the meth(cid:173)\nods themselves are general purpose and applicable to other tasks. Our focus is \non low-level data-driven methods based on the statistical properties of relatively \nuntouched images, as opposed to approaches that work with contours or highly \nprocessed versions of the image. Padgett [8] and Bartlett [1] systematically studied \nstatistical methods for developing representations on expression recognition tasks. \nThey found that local wavelet-like representations consistently outperformed global \nrepresentations, like eigenfaces. In this paper we also compare local versus global \nrepresentations. The main differences between our work and that in [8] and [1] \n\n* To whom correspondence should be addressed. \n\n\fFigure 1: The normalization procedure. In each panel, the \"+\" indicates the center \nof the lips, and the \"0\" indicates the center of the image. The location of the \nlips was automatically determined using Luettin et al. point density model for lip \ntracking: (1) Original image; (2) The center of the lips was translated to the center \nofthe image; (3) The image was rotated in the plane to horizontal; (4) The lips were \nscaled to a constant reference width; (5) The image was symmetrized relative to \nthe vertical midline; (6) The intensity was normalized using a logistic gain control \nprocedure. \n\nare: (1) We use image sequences while they used static images; (2) Our work in(cid:173)\nvolves images of the mouth region while their work involves images of the entire \nface; (3) Our recognition engine is a bank of hidden Markov model while theirs is a \nbackpropagation network [8] and a nearest neighbor classifier [1]. In addition to the \ncomparison of local and global representations, we propose an unsupervised method \nfor automatically selecting regions and variables of interest. \n\n2 Preprocessing and Recognition Engine \n\nThe task was recognition of the words \"one\", \"two\", \"three\" and \"four\" from the \nTulips1 [7] database. The database consists on movies of 12 subjects each uttering \nthe digits in English twice. While the number of words is limited, the database is \nchallenging due to differences in illumination conditions, ethnicity and gender of the \nsubjects. Image preprocessing consisted of the following steps: First the contour of \nthe outer lips were tracked using point distribution models, a data-driven technique \nbased on analysis ofthe gray-level statistics around lip contours [5]. The lip images \nwere then normalized for translation and rotation. This was accomplished by first \npadding the image on all sides with 25 rows or columns of zeros, and modulating \nthe images in the spatial frequency domain. The images were symmetrized with \nrespect to the vertical axis going through the center of the lips. This makes the \nfinal representation more robust to horizontal changes in illumination. The images \nwere cropped to 65 pixels vertically x 87 pixels horizontally (see Figure 1) and their \nintensity was normalized using logistic gain control [7]. Eight different techniques \nwere used on the normalized database each of which developed a different image \nbasis. For each of these techniques the following steps were followed: (1) Projection: \nFor each image in the database we compute the coordinates x(t) of the image with \nrespect to the image bases developed using each of the eight techniques; (2) Tempoml \ndifferentiation: For each time step we compute the vectors 8(t) = x(t) - x(t - 1), \nwhere x(t) represents the coordinate vector of image presented at time t; (3) Gain \ncontrol: Each component of x(t) and 8(t) is independently scaled using a logistic \ngain control function matched to the mean and variance of each component across \nan entire movie [7] . This results in a form of soft histogram equalization; (4) \n\n\fGlobal \nPCA \n\nPCA \n\nSpectrum \n\nGlobal \n\nTCA \n\nlCA \n\nSpectrum \n\nFigure 2: Global decompositions for the normalized image dataset. Row 1: Global \nkernels of principal component analysis ordered with first eigenimage on left. Row \n2: Log magnitude spectrum of eigenimages. Row 3: Global pixel space independent \ncomponent kernels ordered according to projected variance. Row 4: Log magnitude \nspectrum of global independent components. \n\nRecognition: The scaled x(t) and 8(t) coefficients are fed to the HMM recognition \nengine. \n\n3 Global Methods \n\nWe first evaluated the performance of techniques based on the statistics of the \nentire lip images as opposed to portions of it. This global approach has been \nshown to provide good performance on face recognition [9], expression recognition \n[2], and gender recognition tasks [4]. In particular we compared the performance \nof principal component analysis (PCA) and two different versions of independent \ncomponent analysis (ICA). \n\n3.1 Global PC A: \n\nWe tried image bases that consisted of the first 50, 100 and 150 eigenvectors of the \npixelwise covariance matrix. Best results were obtained with the first 50 principal \ncomponents (which accounted for 94.6% of the variance) and are the only ones \nreported here. The top row of Figure 2 shows the first 5 eigenvectors displayed as \nimages, their magnitude spectrum is shown in the second row. These eigenimages \nhave most of their energy localized in low and horizontal spatial frequencies and \nare typically non-local in the spatial domain (i.e., have non-zero energy distributed \nover the whole image). \n\n3.2 Global ICA: \n\nThe goal of lnfomax ICA is to transform an input random vector such that the en(cid:173)\ntropy of the output vector is maximized [3]. The main differences between ICA and \nPCA are: (1) ICA maximizes the joint entropy of the outputs, while PCA maximizes \nthe sum of their variance; (2) PCA provides orthogonal basis vectors, while rcA \nbasis vectors need not be orthogonal; (3) PCA outputs are always uncorrelated, \nbut may not be statistically independent. ICA attempts to extract independent \noutputs, not just uncorrelated. We tried two different ICA approaches: \n\nICA I: This method results in a non-orthogonal transformation of the bases de(cid:173)\nveloped via PCA. While such transformations do not change the underlying space of \n\n\fFigure 3: Upper left: Lip patches (12 pixels x 12 pixels) from randomly chosen \nlocations used to develop local PCA and local lCA kernels. Lower left: Four or(cid:173)\nthogonal images generated from a single local PCA kernel. Right: Top 10 Local \nPCA and lCA kernels ordered according to projected variance (highest at top left). \nNote how the lCA vectors tend to be more local and consistent with the receptive \nfields found in VI. \n\nthe representation they may facilitate the job of the recognition engine by decreas(cid:173)\ning the statistical dependency amongst the coordinates. First each image in the \ndatabase was projected onto the space spanned by the first 50 eigenvectors of the \npixelwise covariance matrix. Then lCA was performed on the 50 PCA coordinate \nvariables to obtain a new 50-dimensional non-orthogonal basis. \n\nlCA II: A different approach to lCA was explored in [1] for face recognition tasks \nand by [6] for fMRI images. While in lCA-l the goal is to develop independent image \ncoordinates, in rcA-II the goal is for the image bases themselves to be independent. \nHere independence of images is defined with respect to a probability space in which \npixels are seen as outcomes and images as random vectors of such outcomes. The \napproach, which is described in detail in [6], resulted in a set of 50 images which \nwere a non-orthogonal linear transformation of the first 50 eigenvectors of the pix(cid:173)\nelwise covariance matrix. The first 5 images (accounting for the largest amounts of \nprojected variance) obtained via this approach to lCA are shown in the third row of \nFigure 2. The fourth row shows their magnitude spectrum. As reported in [1] the \nimages obtained using this method are more local than those obtained via PCA. \n\n4 Local Methods \n\nPadgett et al. [8] reported surprisingly good results on an emotion recognition tasks \nusing PCA on random patches of the face instead of the entire face. Recent theoret(cid:173)\nical work also places emphasis on spatially localized, wavelet-like image bases. One \npotential advantage of spatially localized image bases is that they provide explicit \ninformation about where things are happening, not just about what is happening. \nThis facilitates the work of recognition engines on some tasks but the theoretical \nreasons for this are unclear at this point. \n\nLocal PCA and lCA kernels were developed based on a database of 18680 small \npatches (12 pixel x 12 pixel) chosen from random locations in the Tulip1s database. \nA sample of these random patches (superimposed on a lip image) is shown in the \ntop panel of Figure 3. Hereafter we refer to the 12 pixel x 12 pixel images obtained \n\n\fPCAKemeil \n\nPCAKemel2 \n\n2 \n\n20 \n\n\" n \n\n\" \" \n\n.. \n, \n\nICAKemeil \n\nICAKemel9 \n\n,..LLL \" \n\" \n\n, ~ \" \" \n\n\" , \n\n'\" \n\n10 \n\n41\n\nFigure 4: Kernel-location combinations chosen using unblocked variable selection. \nTop of each quadrant: Local rcA or peA kernel. Bottom of each quadrant: Lip \nimage convolved with corresponding local kernel, then downsampled. The numbers \non the lip image indicate the order in which variables were chosen for the multiple \nregression procedure. There are no numbers on the right side of the lip images \nbecause only half of each lip image was used for the representation (since the images \nare symmetrized). \n\nvia peA or leA as \"kernels\". Image bases were generated by centering a local \npeA or leA kernel onto different locations and padding the rest of the matrix \nwith zeros, as displayed in Figure 3 (lower left panel). This results on bases images \nwhich are local in space (the energy is localized about a single patch) and shifted \nversions of each other. The process of obtaining image coordinates can be seen as \na filtering operation followed by subsampling: First the images are filtered using a \nbank of filters whose impulse response are the kernels obtained via peA (or leA). \nThe relevant coordinates are obtained by subsampling at 300 uniformly distributed \nlocations (15 locations vertically by 20 locations horizontally). We explored four \ndifferent filtering approaches: (1) Single linear shift invariant filter (LSI); (2) Single \nlinear shift variant filter (LSV); (3) Bank of LSI filters with blocked selection; (4) \nBank of LSI filters combined with unblocked selection. \nFor the single-filter LSI approach, the images were convolved with a single local leA \nkernel or a local peA kernel. The top 5 local peA and leA kernels were each tested \nseparately and the results obtained with the best of the 5 kernels were reported. For \nthe single LSV-filtering approach different local peA kernels were derived for a total \nof 117 non-overlapping regions each of which occupied 5 x 5 pixels. Each region of \nthe 934 images was projected onto the first principal component corresponding to \nthat location. This effectively resulted in an LSV filtering operation. \n\n4.1 Automatic Selection of Focal Points \n\nPadgett's [8] most successful method was based on outputs of local filters at manu(cid:173)\nally selected focal regions. Their task was emotion recognition and the focal regions \nwere the eyes and mouth. In visual speech recognition once the lips are chosen it \n\n\fGlobal Methods \n\nLocal Methods \n\nImage Processing \n\nGlobal peA \nGlobal Il;A I \nUlobal ICA II \n\nSingle-Filter LSI peA \nSingle-Filter LSI ICA \n\nBlocked Filter Bank PeA \nBlocked Filter Bank leA \n\nUnblocked Filter Bank peA \nUnblocked Filter Bank Il;A \n\nPerformance \u00b1 s.e.m. \n\n79.2 \u00b1 4.7 \n61.5 \u00b1 4.5 \n74.0 \u00b1 5.4 \n90.6 \u00b1 3.1 \n89.6 \u00b1 3.0 \n85.4 \u00b1 3.7 \n85.4 \u00b1 3.0 \n91.7 \u00b1 2.8 \n91.7 \u00b1 3.2 \n\nTable 1: Best generalization performance (% correct) \u00b1 standard error of the mean \nfor all image representations. \n\nis unclear which regions would be most informative. Thus we developed a method \nfor automatic selection of focal regions. \n\nFirst 10 filters were developed via local leA (or peA). Each image was filtered \nusing the 10-filter bank and the outputs were subsampled at 150 locations for a \n1500 dimensional representation (10 filters x 150 locations) of each of the images \nin the dataset. Regions and variables of interest were then selected using a stepwise \nforward multiple regression procedure. First we choose the variable that, when \naveraging across the entire database, best reconstructed the original images. Here \nbest reconstruction is defined in terms of least squares using a multiple regression \nmodel. Once a variable is selected, it is \"tenured\" and we search for the variable \nwhich in combination with the tenured ones best reconstructs the image database. \nThe procedure is stopped when the number of tenured variables reaches a criterion \npoint. We compared performance using 50, 100, and 150 tenured variables and \nreport results with the best of those three numbers. We tested two different selection \nprocedures, one blocked by location and one in which location was not blocked. In \nthe first method the selection was done in blocks of 10 variables where each block \ncontained the outputs of all the filters at a specific location. If a location was chosen, \nthe outputs of the 10 filters in that location were automatically included in the final \nimage representation. In the second method selection of variables was not blocked \nby location. \n\nFigure 4 shows, for 2 local peA and 2 local leA kernels, the first 10 variables chosen \nfor each particular kernel using the forward selection multiple regression procedure. \nThe numbers on the lip images in this figure indicate the order in which particular \nkernel/location variables were chosen using the sequential regression procedure: \"I\" \nindicates the first variable chosen, \"2\" the second, etc. \n\n5 Results and Conclusions \n\nTable 1 shows the best generalization performance (out of the 9 HMM architectures \ntested) for each of the eight image representation methods. The local decomposi(cid:173)\ntions significantly outperformed the global ones (t(106) = 4.10, p < 0.001). The \nimproved performance of local representations is consistent with current ideas on \nthe importance of localized wavelet-like representations. However, it is unclear \nwhy local decompositions work better. One possibility is that these results apply \nonly to this particular recognition engine and the problem at hand (i.e., hidden \nMarkov models for speechreading). Yet similar results with local representations \nwere reported in [8] on an emotion classification task with a 3 layer backpropaga-\n\n\ftion network and in [1] on an expression classification tasks with a nearest neighbor \nclassifier. Another possible explanation for the advantage of local representations \nis that global unsupervised decompositions emphasize subject identity while local \ndecompositions tend to hide it. We found some evidence consistent with this idea \nby testing global and local representations on a subject identification task (i.e., \nrecognizing which person the lip images belong to). For this task the global repre(cid:173)\nsentations outperformed the local ones. However this result is inconsistent with [8] \nwhich found local representations were better on emotion classification and on sub(cid:173)\nject identification tasks. Another possibility is that local representations make more \nexplicit information about where things are happening, not just what is happening, \nand such information turns out to be important for the task at hand. \nThe image representations obtained using the bank of filter methods with unblocked \nselection yielded the best results. The stepwise regression technique used to select \nkernels and regions of interest led to substantial gains in recognition performance. \nIn fact the highest generalization performance reported here (91. 7% with the bank of \nfilters using unblocked variable selection) surpassed the best published performance \non this dataset [5]. \n\nReferences \n\n[1] M.S. Bartlett. Face Image Analysis by Unsupervised Learning and Redundancy \n\nReduction. PhD thesis, University of California, San Diego, 1998. \n\n[2] M.S. Bartlett, P.A. Viola, T.J. Sejnowski, J. Larsen, J. Hager, and P. Ekman. \nClassifying facial action. In D. Touretski, M. Mozer, and M. Hasselmo, editors, \nAdvances in Neural Information Processing Systems, volume 8, pages 823-829. \nMorgan Kaufmann, San Mateo, CA, 1996. \n\n[3] A.J. Bell and T.J. Sejnowski. An information-maximization approach to blind \nseparation and blind deconvolution. Neural Computation, 7(6):1129-1159,1995. \n[4] G. Cottrell and J. 1991 Metcalfe. Face, gender and emotion recognition using \nholons. In D. Touretzky, editor, Advances in Neural Information Processing \nSystems, volume 3, pages 564- 571, San Mateo, CA, 1991. Morgan Kaufmann. \n\n[5] Juergen Luettin. Visual Speech and Speaker Recognition. PhD thesis, University \n\nof Sheffield, 1997. \n\n[6] M.J. McKeown, S. Makeig, G.G. Brown, T-P. Jung, S.S. Kindermann, A.J. Bell, \nand T.J. Sejnowski. Analysis of fmri data by decomposition into independent \ncomponents. Proc. Nat. Acad. Sci., in press. \n\n[7] J .R. Movellan. Visual speech recognition with stochastic networks. \n\nIn \n\nG. Tesauro, D.S. Touretzky, and T. Leen, editors, Advances in Neural Infor(cid:173)\nmation Processing Systems, volume 7, pages 851- 858. MIT Press, Cambridge, \nMA,1995. \n\n[8] C. Padgett and G. Cottrell. Representing face images for emotion classifica(cid:173)\n\ntion. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural \nInformation Processing Systems, volume 9, Cambridge, MA, 1997. MIT Press. \n[9] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive \n\nNeuroscience, 3(1):71- 86, 1991. \n\n\f", "award": [], "sourceid": 1877, "authors": [{"given_name": "Michael", "family_name": "Gray", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}]}