{"title": "Nonlinear Image Interpolation using Manifold Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 973, "page_last": 980, "abstract": null, "full_text": "Nonlinear Image Interpolation using \n\nManifold Learning \n\nChristoph Bregler \n\nComputer Science Division \n\nUniversity of California \n\nBerkeley, CA 94720 \n\nbregler@cs.berkeley.edu \n\nStephen M. Omohundro'\" \nInt . Computer Science Institute \n\n1947 Center Street Suite 600 \n\nBerkeley, CA 94704 \n\nom@research.nj .nec.com \n\nAbstract \n\nThe problem of interpolating between specified images in an image \nsequence is a simple, but important task in model-based vision. \nWe describe an approach based on the abstract task of \"manifold \nlearning\" and present results on both synthetic and real image se(cid:173)\nquences. This problem arose in the development of a combined \nlip-reading and speech recognition system. \n\n1 \n\nIntroduction \n\nPerception may be viewed as the task of combining impoverished sensory input with \nstored world knowledge to predict aspects of the state of the world which are not \ndirectly sensed. In this paper we consider the task of image interpolation by which \nwe mean hypothesizing the structure of images which occurred between given images \nin a temporal sequence. This task arose during the development of a combined lip(cid:173)\nreading and speech recognition system [3], because the time windows for auditory \nand visual information are different (30 frames per second for the camera vs. 100 \nfeature vectors per second for the acoustic information). It is an excellent visual \ntest domain in general, however, because it is easy to generate large amounts of test \nand training data and the performance measure is largely \"theory independent\" . \nThe test consists of simply presenting two frames from a movie and comparing the \n\n\"'New address: NEe Research Institute, Inc., 4 Independence Way, Princeton, NJ 08540 \n\n\f974 \n\nChristoph Bregler. Stephen M. Omohundro \n\nFigure 1: Linear interpolated lips. \n\nFigure 2: Desired interpolation. \n\nhypothesized intermediate frames to the actual ones. It is easy to use footage of a \nparticular visual domain as training data in the same way. \n\nMost current approaches to model-based vision require hand-constructed CAD(cid:173)\nlike models. We are developing an alternative approach in which the vision system \nbuilds up visual models automatically by learning from examples. One of the central \ncomponents of this kind of learning is the abstract problem of inducing a smooth \nnonlinear constraint manifold from a set of examples from the manifold. We call \nthis \"manifold learning\" and have developed several approaches closely related to \nneural networks for doing it [2]. In this paper we apply manifold learning to the \nimage interpolation problem and numerically compare the results of this \"nonlinear\" \nprocess with simple linear interpolation. We find that the approach works well when \nthe underlying model space is low-dimensional. In more complex examples, manifold \nlearning cannot be directly applied to images but still is a central component in a \nmore complex system (not discussed here). \n\nWe present several approaches to using manifold learning for this task. We compare \nthe performance of these approaches to that of simple linear interpolation. Figure \n1 shows the results of linear interpolation of lip images from the lip-reading system. \nEven in the short period of 33 milliseconds linear interpolation can produce an \nunnatural lip image. The problem is that linear interpolation of two images just \naverages the two pictures. The interpolated image in Fig. 1 has two lower lip parts \ninstead of just one. The desired interpolated image is shown in Fig. 2, and consists \nof a single lower lip positioned at a location between the lower lip positions in the \ntwo input pictures. \n\nOur interpolation technique is nonlinear, and is constrained to produce only images \nfrom an abstract manifold in \"lip space\" induced by learning. Section 2 describes the \nprocedure, Section 4 introduces the interpolation technique based on the induced \nmanifold, and Sections 5 and 6 describe our experiments on artificial and natural \nimages. \n\n2 Manifold Learning \nEach n * m gray level image may be thought of as a point in an n * m-dimensional \nspace. A sequence of lip-images produced by a speaker uttering a sentence lie on a \n\n\fNonlinear Image Interpolation Using Manifold Learning \n\n975 \n\n-\\ \n\n------\n\nGraylevel \nDimensions \n\n(l6x16 pixel = 256 dim. space) \n\nFigure 3: Linear vs nonlinear interpolation. \n\n1-dimensional trajectory in this space (figure 3). If the speaker were to move her \nlips in all possible ways, the images would define a low-dimensional submanifold (or \nnonlinear surface) embedded in the high-dimensional space of all possible graylevel \nimages. \nIf we could compute this nonlinear manifold, we could limit any interpolation algo(cid:173)\nrithm to generate only images contained in it. Images not on the manifold cannot \nbe generated by the speaker under normal circumstances. Figure 3 compares a \ncurve of interpolated images lying on this manifold to straight line interpolation \nwhich generally leaves the manifold and enters the domain of images which violate \nthe integrity of the model. \n\nTo represent this kind of nonlinear manifold embedded in a high-dimensional fea(cid:173)\nture space, we use a mixture model of local linear patches. Any smooth nonlinear \nmanifold can be approximated arbitrarily well in each local neighborhood by a lin(cid:173)\near \"patch\" . In our representation, local linear patches are \"glued\" together with \nsmooth \"gating\" functions to form a globally defined nonlinear manifold [2]. We use \nthe \"nearest-point-query\" to define the manifold. Given an arbitrary point near the \nmanifold, this returns the closest point on the manifold. We answer such queries \nwith a weighted sum of the linear projections of the point to each local patch. The \nweights are defined by an \"influence function\" associated with each linear patch \nwhich we usually define by a Gaussian kernel. The weight for each patch is the \nvalue of its influence function at the point divided by the sum of all influence func(cid:173)\ntions (\"partition of unity\"). Figure 4 illustrates the nearest-point-query. Because \nGaussian kernels die off quickly, the effect of distant patches may be ignored, im(cid:173)\nproving computational performance. The linear projections themselves consist of a \ndot product and so are computationally inexpensive. \nFor learning, we must fit such a mixture of local patches to the training data. An \ninitial estimate of the patch centers is obtained from k-means clustering. We fit a \npatch to each local cluster using a local principal components analysis. Fine tuning \n\n\f976 \n\nChristoph Bregler. Stephen M. Omohundro \n\nPI \n\nInfluence \n\nFunction \n\nP2 \n\n\\ \n\nLinear \nPatch \n\nl:,Gi(X) ' Pi(x) \n\nP(x) = ...:...' -=---\n\nl:,Gi(X) \n, \n\nFigure 4: Local linear patches glued together to a nonlinear manifold. \n\nof the model is done using the EM (expectation-maximization) procedure. \n\nThis approach is related to the mixture of expert architecture [4], and to the man(cid:173)\nifold representation in [6]. Our EM implementation is related to [5] , which uses a \nhierarchical gating function and local experts that compute linear mappings from \none space to another space. In contrast, our approach uses a \"one-level\" gating \nfunction and local patches that project a space into itself. \n\n3 Linear Preprocessing \n\nDealing with very high-dimensional domains (e.g. 256 * 256 gray level images) re(cid:173)\nquires large memory and computational resources. Much of this computation is \nnot relevant to the task, however. Even if the space of images is nonlinear, the \nnonlinearity does not necessarily appear in all of the dimensions. Earlier experi(cid:173)\nments in the lip domain [3] have shown that images projected onto a lO-dimensional \nlinear subspace still accurately represents all possible lip configurations. We there(cid:173)\nfore first project the high-dimensional images into such a linear subspace and then \ninduce the nonlinear manifold within this lower dimensional linear subspace. This \npreprocessing is similar to purely linear techniques [7, 10, 9] . \n\n4 Constraint Interpolation \n\nGeometrically, linear interpolation between two points in n-space may be thought of \nas moving along the straight line joining the two points. In our non-linear approach \nto interpolation, the point moves along a curve joining the two points which lies \nin the manifold of legal images. We have studied several algorithms for estimating \nthe shortest manifold trajectory connecting two given points. For the performance \nresults, we studied the point which is halfway along the shortest trajectory. \n\n\fNonlinear Image Interpolation Using Manifold Learning \n\n977 \n\n4.1 \n\n\"Free-Fall\" \n\nThe computationally simplest approach is to simply project the linearly interpolated \npoint onto the nonlinear manifold. The projection is accurate when the point is close \nto the manifold. In cases where the linearly interpolated point is far away (i.e. no \nweight of the partition of unity dominates all the other weights) the closest-point(cid:173)\nquery does not result in a good interpolant. For a worst case, consider a point \nin the middle of a circle or sphere. All local patches have same weight and the \nweighted sum of all projections is the center point itself, which is not a manifold \npoint. Furthermore, near such \"singular\" points, the final result is sensitive to small \nperturbations in the initial position. \n\n4.2 \n\n\"Manifold-Walk\" \n\nA better approach is to \"walk\" along the manifold itself rather than relying on the \nlinear interpolant. Each step of the walk is linear and in the direction of the target \npoint but the result is immediately projected onto the manifold. This new point is \nthen moved toward the target point and projected onto the manifold, etc. When \nthe target is finally reached, the arc length of the curve is approximated by the \naccumulated lengths of the individual steps. The point half way along the curve \nis chosen as the interpolant. This algorithm is far more robust than the first one, \nbecause it only uses local projections, even when the two input points are far from \neach other. Figure 5b illustrates this algorithm. \n\n4.3 \n\n\"Manifold-Snake\" \n\nThis approach combines aspects of the first two algorithms. It begins with the lin(cid:173)\nearly interpolated points and iteratively moves the points toward the manifold. The \nManifold-Snake is a sequence of n points preferentially distributed along a smooth \ncurve with equal distances between them. An energy function is defined on such \nsequences of points so that the energy minimum tries to satisfy these constraints \n(smoothness, equidistance, and nearness to the manifold): \n\n(1) \n\nE has value 0 if all Vi are evenly distributed on a straight line and also lie on the \nmanifold. In general E can never be 0 if the manifold is nonlinear, but a minimum \nfor E represents an optimizing solution. We begin with a straight line between the \ntwo input points and perform gradient descent in E to find this optimizing solution. \n\n5 Synthetic Examples \n\nTo quantify the performance of these approaches to interpolation, we generated a \ndatabase of 16 * 16 pixel images consisting of rotated bars. The bars were rotated \nfor each image by a specific angle. The images lie on a one-dimensional nonlinear \nmanifold embedded in a 256 dimensional image space. A nonlinear manifold repre(cid:173)\nsented by 16 local linear patches was induced from the 256 images. Figure 6a shows \n\n\f978 \n\nChristoph Bregler, Stephen M. Omohundro \n\na) \"Free Fall\" \n\nb) \"Surface Walk\" \n\nc) \"Surface Snake\" \n\nFigure 5: Proposed interpolation algorithms. \n\n-~/ \n\nFigure 6: a) Linear interpolation, b) nonlinear interpolation. \n\ntwo bars and their linear interpolation. Figure 6b shows the nonlinear interpolation \nusing the Manifold- Walk algorithm. \n\nFigure 7 shows the average pixel mean squared error of linear and nonlinear in(cid:173)\nterpolated bars. The x-axis represents the relative angle between the two input \npoints. \nFigure 8 shows some iterations of a Manifold-Snake interpolating 7 points along a \n1 dimensional manifold embedded in a 2 dimensional space . \n\n.... pi \n\n. . . . . . . . . \n\n---\n\n./ \n\n/' \n\nVV \n.t. __ ---\n---\n\n, / \n\n~/ \n\n-- --- --- -\n\nFigure 7: Average pixel mean squared error of linear and nonlinear interpolated \nbars. \n\n\fNonlinear Image Interpolation Using Manifold Learning \n\n979 \n\nQQOOOO \n\n5 Iter. \n\nto Iter. \n\n30 Iter. \n\no Iter. \n\nI Iter. \n\n2 Iter. \n\nFigure 8: Manifold-Snake iterations on an induced 1 dimensional manifold embed(cid:173)\nded in 2 dimensions. \n\nFigure 9: 16x16 images. Top row: linear interpolation. Bottom row: nonlinear \n\"manifold-walk\" interpolation. \n\n6 Natural Lip Images \n\nWe experimented with two databases of natural lip images taken from two different \nsubjects. \nFigure 9 shows a case of linear interpolated and nonlinear interpolated 16 * 16 \npixel lip images using the Manifold- Walk algorithm. The manifold consists of 16 \n4-dimensional local linear patches. \nIt was induced from a training set of 1931 \nlip images recorded with a 30 frames per second camera from a subject uttering \nvarious sentences. The nonlinear interpolated image is much closer to a realistic lip \nconfiguration than the linear interpolated image. \nFigure 10 shows a case of linear interpolated and nonlinear interpolated 45 * 72 \npixel lip images using the Manifold-Snake algorithm. The images were recorded \nwith a high-speed 100 frames per second cameral . Because of the much higher \ndimensionality of the images, we projected the images into a 16 dimensional linear \nsubspace. Embedded in this subspace we induced a nonlinear manifold consisting \nof 16 4-dimensionallocallinear patches, using a training set of 2560 images. The \nlinearly interpolated lip image shows upper and lower teeth, but with smaller con(cid:173)\ntrast , because it is the average image of the open mouth and closed mouth. The \nnonlinearly interpolated lip images show only the upper teeth and the lips half way \nclosed, which is closer to the real lip configuration. \n\n7 Discussion \n\nWe have shown how induced nonlinear manifolds can be used to constrain the \ninterpolation of gray level images. Several interpolation algorithms were proposed \n\nIThe images were recorded in the UCSD Perceptual Science Lab by Michael Cohen \n\n\f980 \n\nChristoph Bregler. Stephen M. Omohundro \n\nFigure 10: 45x72 images projected into a 16 dimensional subspace. Top row: linear \ninterpolation. Bottom row: nonlinear \"manifold-snake\" interpolation. \n\nand experimental studies have shown that constrained nonlinear interpolation works \nwell both in artificial domains and natural lip images. \n\nAmong various other nonlinear image interpolation techniques, the work of [1] using \na Gaussian Radial Basis Function network is most closely related to our approach. \nTheir approach is based on feature locations found by pixelwise correspondence, \nwhere our approach directly interpolates graylevel images. \n\nAnother related approach is presented in [8]. Their images are also first projected \ninto a linear subspace and then modelled by a nonlinear surface but they require \ntheir training examples to lie on a grid in parameter space so that they can use \nspline methods. \n\nReferences \n\n[1] D. Beymer, A. Shahsua, and T. Poggio Example Based Image Analysis and Synthesis \n\nM.I.T. A.1. Memo No. 1431, Nov. 1993. \n\n[2] C. Bregler and S. Omohundro, Surface Learning with Applications to Lip-Reading, in \n\nAdvances in Neural Information Processing Systems 6, Morgan Kaufmann, 1994. \n\n[3] C. Bregler and Y. Konig, \"Eigenlips\" for Robust Speech Recognition in Proc. ofIEEE \n\nInt. Conf. on Acoustics, Speech, and Signal Processing, Adelaide, Australia, 1994. \n\n[4] R.A. Jacobs, M.1. Jordan, S.J. Nowlan, and G.E. Hinton, Adaptive mixtures of local \n\nexperts in Neural Compuation, 3, 79-87. \n\n[5] M.1. Jordan and R. A. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm \n\nNeural Computation, Vol. 6, Issue 2, March 1994. \n\n[6] N. Kambhatla and T.K. Leen, Fast Non-Linear Dimension Reduction in Advances in \n\nNeural Information Processing Systems 6, Morgan Kaufmann, 1994. \n\n[7] M. Kirby, F. Weisser, and G. DangeImayr, A Model Problem in Represetation of \n\nDigital Image Sequences, in Pattern Recgonition, Vol 26, No.1, 1993. \n\n[8] H. Murase, and S. K. Nayar Learning and Recognition of 3-D Objects from Brightness \n\nImages Proc. AAAI, Washington D.C., 1993. \n\n[9] P. Simard, Y. Le Cun, J. Denker Efficient Pattern Recognition Using a New Trans(cid:173)\nformation Distance Advances in Neural Information Processing Systems 5, Morgan \nKaufman, 1993. \n\n[10] M. Turk and A. Pentland, Eigenfaces for Recognition Journal of Cognitive Neuro(cid:173)\n\nscience, Volume 3, Number 1, MIT 1991. \n\n\f", "award": [], "sourceid": 879, "authors": [{"given_name": "Christoph", "family_name": "Bregler", "institution": null}, {"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}