{"title": "Surface Learning with Applications to Lipreading", "book": "Advances in Neural Information Processing Systems", "page_first": 43, "page_last": 50, "abstract": null, "full_text": "Surface Learning with Applications to \n\nLipreading \n\nChristoph Bregler *.** \n\n*Computer Science Division \n\nUniversity of California \n\nBerkeley, CA 94720 \n\nStephen M. Omohundro ** \n\n**Int. Computer Science Institute \n\n1947 Center Street Suite 600 \n\nBerkeley, CA 94704 \n\nAbstract \n\nMost connectionist research has focused on learning mappings from \none space to another (eg. classification and regression). This paper \nintroduces the more general task of learning constraint surfaces. \nIt describes a simple but powerful architecture for learning and \nmanipulating nonlinear surfaces from data. We demonstrate the \ntechnique on low dimensional synthetic surfaces and compare it to \nnearest neighbor approaches. We then show its utility in learning \nthe space of lip images in a system for improving speech recognition \nby lip reading. This learned surface is used to improve the visual \ntracking performance during recognition. \n\n1 Surface Learning \n\nMappings are an appropriate representation for systems whose variables naturally \ndecompose into \"inputs\" and \"outputs)). To use a learned mapping, the input vari(cid:173)\nables must be known and error-free and a single output value must be estimated for \neach input. Many tasks in vision, robotics, and control must maintain relationships \nbetween variables which don't naturally decompose in this way. Instead, there is \na nonlinear constraint surface on which the values of the variables are jointly re(cid:173)\nstricted to lie. We propose a representation for such surfaces which supports a wide \nrange of queries and which can be naturally learned from data. \n\nThe simplest queries are \"completion queries)). In these queries, the values of certain \nvariables are specified and the values (or constraints on the values) of remaining \n\n43 \n\n\f44 \n\nBregler and Omohundro \n\nFigure 1: Using a constraint surface to reduce uncertainty in two variables \n\n~. \n\nFigure 2: Finding the closest point in a surface to a given point. \n\nvariables are to be determined. This reduces to a conventional mapping query if the \n\"input\" variables are specified and the system reports the values of corresponding \n\"output\" variables. Such queries can also be used to invert mappings, however, by \nspecifying the \"output\" variables in the query. Figure 1 shows a generalization in \nwhich the variables are known to lie with certain ranges and the constraint surface \nis used to further restrict these ranges. \n\nFor recognition tasks, \"nearest point\" queries in which the system must return the \nsurface point which is closest to a specified sample point are important (Figure \n2). For example, symmetry-invariant classification can be performed by taking the \nsurface to be generated by applying all symmetry operations to class prototypes (eg. \ntranslations, rotations, and scalings of exemplar characters in an OCR system). In \nour representation we are able to efficiently find the globally nearest surface point \nin this kind of query. \n\nOther important classes of queries are \"interpolation queries\" and \"prediction \nqueries\". For these, two or more points on a curve are specified and the goal is to in(cid:173)\nterpolate between them or extrapolate beyond them. Knowledge of the constraint \nsurface can dramatically improve performance over \"knowledge-free\" approaches \nlike linear or spline interpolation. \n\nIn addition to supporting these and other queries, one would like a representation \nwhich can be efficiently learned. The training data is a set of points randomly \ndrawn from the surface. The system should generalize from these training points \nto form a representation of the surface (Figure 3). This task is more difficult than \nmapping learning for several reasons: 1) The system must discover the dimension of \nthe surface, 2) The surface may be topologically complex (eg. a torus or a sphere) \n\n\fSurface Learning with Applications to Lipreading \n\n45 \n\n\u2022\u2022\u2022 \n\u2022 \n\u2022\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022\u2022 \n\n\u2022 \u2022\u2022 \n\n\u2022 \u2022 \u2022 \u2022 \n\n\u2022 \u2022\u2022 \n\nFigure 3: Surface Learning \n\nand may not support a single set of coordinates, 3) The broader range of queries \ndiscussed above must be supported. \n\nOur approach starts from the observation that if the data points were drawn from \na linear surface, then a principle components analysis could be used to discover the \ndimension of the linear space and to find the best-fit linear space of that dimension. \nThe largest principle vectors would span the space and there would be a precipitous \ndrop in the principle values at the dimension of the surface. A principle components \nanalysis will no longer work, however, when the surface is nonlinear because even \na I-dimensional curve could be embedded so as to span all the dimensions of the \nspace. \n\nIf a nonlinear surface is smooth, however, then each local piece looks more and \nmore linear under magnification. If we consider only those data points which lie \nwithin a local region, then to a good approximation they come from a linear surface \npatch. The principle values can be used to determine the most likely dimension of \nthe surface and that number of the largest principle components span its tangent \nspace (Omohundro, 1988). The key idea behind our representations is to \"glue\" \nthese local patches together using a partition of unity. \n\nWe are exploring several implementations, but all the results reported here come \nfrom a represenation based on the \"nearest point\" query. The surface is repre(cid:173)\nsented as a mapping from the embedding space to itself which takes each point \nto the nearest surface point. K-means clustering is used to determine a initial set \nof \"prototype centers\" from the data points. A principle components analysis is \nperformed on a specified number of the nearest neighbors of each prototype. These \n\"local peA\" results are used to estimate the dimension of the surface and to find \nthe best linear projection in the neighborhood of prototype i. The influence of these \nlocal models is determined by Gaussians centered on the prototype location with a \nvariance determined by the local sample density. The projection onto the surface \nis determined by forming a partition of unity from these Gaussians and using it to \nform a convex linear combination of the local linear projections: \n\nThis initial model is then refined to minimize the mean squared error between the \n\n(1) \n\n\f46 \n\nBregler and Omohundro \n\na) \n\nb) \n\nFigure 4: Learning a I-dimensional surface. a) The surface to learn b) The local \npatches and the range of their influence functions, c) The learned surface \n\ntraining samples and the nearest surface point using EM optimization and gradient \ndescent. \n\n2 Synthetic Examples \n\nTo see how this approach works, consider 200 samples drawn from a I-dimensional \ncurve in a two-dimensional space (Figure 4a). 16 prototype centers are chosen by k(cid:173)\nmeans clustering. At each center, a local principle components analysis is performed \non the closest 20 training samples. Figure 4b shows the prototype centers and the \ntwo local principle components as straight lines. In this case, the larger principle \nvalue is several times larger than the smaller one. The system therefore attempts \nto construct a one-dimensional learned surface. The circles in Figure 4b show the \nextent of the Gaussian influence functions for each prototype. Figure 4c shows the \nresulting learned suface. It was generated by randomly selecting 2000 points in the \nneighborhood of the surface and projecting them according to the learned model. \n\nFigure 5 shows the same process applied to learning a two-dimensional surface \nembedded in three dimensions. \n\nTo quantify the performance of this learning algorithm, we studied the effect of the \ndifferent parameters on learning a two-dimensional sphere in three dimensions. It \nis easy to compare the learned results with the correct ones in this case. Figure 6a \nshows how the empirical error in the nearest point query decreases as a function \nof the number of training samples. We compare it against the error made by a \nnearest-neighbor algorithm. With 50 training samples our approach produces an \nerror which is one-fourth as large. Figure 6b shows how the average size of the local \nprinciple values depends on the number of nearest neighbors included. Because \nthis is a two-dimensional surface, the two largest values are well-separated from the \nthird largest. The rate of growth of the principle values is useful for determining \nthe dimension of the surface in the presence of noise. \n\n\fSurface Learning with Applications to Lipreading \n\n47 \n\nFigure 5: Learning a two-dimensional surface in the three dimensions a) 1000 ran(cid:173)\ndom samples on the surface b) The two largest local principle components at each \nof 100 prototype centers based on 25 nearest neighbors. \n\n:::~--+ ~--:+=~-+=t-+=:--+:~:+~ \n'0000- - j=----~~ c-t-r--t---r \n=:=. ~ ~f . t::- \u00b7=t~~f\u00b7t~ \n:::- -~~r~l- -:=t:==t~f \n:::: ..::t~ L_ -\n~:--=- -:-:- -\n':::::-\\ I ==-~~~-l== -- ---\n----+(cid:173)\n-- ----+ \n\n-+--1--\n\nIOD~ --\n\n--\n\n4000- - -\n\n-\n\n. -~:'::: \n\nlBO . OO \n\n160 .00 \n\n120.00 \n\n9>. 00 \n\n60.00 \n\n\".00 \n\n20.00 \n\n1000--\n\n\u2022. oo~ _ _ ~~ _______ _ \n\n'ODD \n\n1OG OO \n\n15000 \n\nZOO 00 \n\n~OOD \n\n3000{) \n\n3SGOO \n\n'.00 \n\n80.00 \n\n100.00 \n\n1110.00 \n\nFigure 6: Quantitative performance on learning a two-dimensional sphere in three \ndimensions. a) Mean squared error of closest point querries as function of the num(cid:173)\nber of samples for the learned surface vs. nearest training point b) The mean square \nroot of the three principle values as a function of number of neighbors included in \neach local PCA . \n\n\f48 \n\nBregler and Omohundro \n\na \n\nb \n\nFigure 7: Snakes for finding the lip contours a) A correctly placed snake b) A snake \nwhich has gotten stuck in a local minimum of the simple energy function. \n\n3 Modelling the space of lips \n\nWe are using this technique as a part of system to do \"lipreading\". To provide \nfeatures for \"vise me classification\" (visemes are the visual analog of phonemes), we \nwould like the system to reliably track the shape of a speaker's lips in video images. \nIt should be able to identify the corners of the lips and to estimate the bounding \ncurves robustly under a variety of imaging and lighting conditions. Two approaches \nto this kind of tracking task are \"snakes\" (Kass, et. aI, 1987) and \"deformable \ntemplates\" (Yuille, 1991). Both of these approaches minimize an \"energy function\" \nwhich is a sum of an internal model energy and an energy measuring the match to \nexternal image features. \n\nFor example, to use the \"snake\" approach for lip tracking, we form the internal \nenergy from the first and second derivatives of the coordinates along the snake, \nprefering smoother snakes to less smooth ones. The external energy is formed \nfrom an estimate of the negative image gradient along the snake. Figure 7a shows \na snake which has correctly relaxed onto a lip contour. This energy function is \nnot very specific to lips, however. For example, the internal energy just causes \nthe snake to be a controlled continuity spline. The \"lip- snakes\" sometimes relax \nonto undesirable local minima like that shown in Figure 7b. Models based on \ndeformable templates allow a researcher to more strongly constrain the shape space \n(typically with hand-coded quadratic linking polynomials), but are difficult to use \nfor representing fine grain lip features. \n\nOur approach is to use surface learning as described here to build a model of the \nspace of lips. We can then replace the internal energy described above by a quantity \ncomputed from the distance to the learned surface in lip feature space. \n\nOur training set consists of 4500 images of a speaker uttering random wordsl . \nThe training images are initially \"labeled\" with the conventional snake algorithm. \nIncorrectly aligned snakes are removed from the database by hand. The contour \nshape is parameterized by the x and y coordinates of 40 evenly spaced points along \nthe snake. All values are normalized to give a lip width of 1. Each lip contour is \n\nIThe data was collected for an earlier lipreading system described in (Bregler, Hild, \n\nManke, Waibel 1993) \n\n\fSurface Learning with Applications to Lipreading \n\n49 \n\n(Ja \n~d \n\nC7b \n\ne \n\nFigure 8: Two principle axes in a local patch in lip space. a, b, and c are configu(cid:173)\nrations along the first principle axis, while d, e, and f are along the third axis. \n\na \n\nb \n\nc \n\nFigure 9: a) Initial crude estimate of the contour b) An intermediate step in the \nrelaxation c) The final contour. \n\ntherefore a point in an 80-dimensional \"lip- space\". The lip configurations which \nactually occur lie on a lower dimensional surface embedded in this space. Our \nexperiments show that a 5-dimensional surface in the 80-dimensional lip space is \nsufficient to describe the contours with single pixel accuracy in the image. Figure 8 \nshows some lip models along two of the principle axes in the local neighborhood of \none of the patches. The lip recognition system uses this learned surface to improve \nthe performance of tracking on new image sequences. \n\nThe tracking algorithm starts with a crude initial estimate of the lip position and \nsize. It chooses the closest model in the lip surface and maps the corresponding \nresized contour back onto the estimated image position (Figure 9a). The external \nimage energy is taken to be the cumulative magnitude of graylevel gradient esti(cid:173)\nmates along the current contour. This term has maximum value when the curve \nis aligned exactly on the lip boundary. We perform gradient ascent in the contour \nspace, but constrain the contour to lie in the learned lip surface. This is achieved by \nreprojecting the contour onto the lip surface after each gradient step. The surface \nthereby acts as the analog of the internal energy in the snake and deformable tem(cid:173)\nplate approaches. Figure 9b shows the result after a few steps and figure 9c shows \nthe final contour. The image gradient is estimated using an image filter whose width \nis gradually reduced as the search proceeds. \n\nThe lip contours in successive images in the video sequence are found by starting \nwith the relaxed contour from the previous image and performing gradient ascent \n\n\f50 \n\nBregler and Omohundro \n\nwith the altered external image energies. Empirically, surface-based tracking is far \nmore robust than the \"knowledge-free\" approaches. While we have described the \napproach in the context of contour finding, it is much more general and we are \ncurrently extending the system to model more complex aspects of the image. \n\nThe full lipreading system which combines the described tracking algorithm and a \nhybrid connectionist speech recognizer (MLP /HMM) is described in (Bregler and \nKonig 1994). Additionally we will use the lip surface to interpolate visual features \nto match them with the higher rate auditory features. \n\n4 Conclusions \n\nWe have presented the task of learning surfaces from data and described several im(cid:173)\nportant queries that the learned surfaces should support: completion, nearest point, \ninterpolation, and prediction. We have described an algorithm which is capable of \nefficiently performing these tasks and demonstrated it on both synthetic data and \non a real-world lip-tracking problem. The approach can be made computationally \nefficient using the \"bumptree\" data structure described in (Omohundro, 1991). We \nare currently studying the use of \"model merging\" to improve the representation \nand are also applying it to robot control. \n\nAcknowledgements \n\nThis research was funded in part by Advanced Research Project Agency contract \n#NOOOO 1493 C0249 and by the International Computer Science Institute. The \ndatabase was collected with a grant from Land Baden Wuerttenberg (Landesschw(cid:173)\nerpunkt Neuroinformatik) at Alex Waibel's institute. \n\nReferences \n\nC. Bregler, H. Hild, S. Manke & A. Waibel. (1993) Improving Connected Letter \nRecognition by Lipreading. In Proc. of Int. Conf. on Acoustics, Speech, and Signal \nProcessing, Minneapolis. \nC. Bregler, Y. Konig (1994) \"Eigenlips\" for Robust Speech Recognition. In Proc. \nof Int. Conf. on Acoustics, Speech, and Signal Processing, Adelaide. \n\nM. Kass, A. Witkin, and D. Terzopoulos. (1987) SNAKES: Active Contour Models, \nin Proc. of the First Int. Conf. on Computer Vision, London. \n\nS. Omohundro. (1988) Fundamentals of Geometric Learning. University of Illinois \nat Urbana-Champaign Technical Report UIUCDCS-R-88-1408. \n\nS. Omohundro. (1991) Bumptrees for Efficient Function, Constraint, and Classifi(cid:173)\ncation Learning. In Lippmann, Moody, and Touretzky (ed.), Advances in Neural \nInformation Processing Systems 3. San Mateo, CA: Morgan Kaufmann. \n\nA. Yuille. (1991) Deformable Templates for Face Recognition, Journal of Cognitive \nNeuroscience, Volume 3, Number 1. \n\n\f", "award": [], "sourceid": 814, "authors": [{"given_name": "Christoph", "family_name": "Bregler", "institution": null}, {"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}