{"title": "A Variational Principle for Model-based Morphing", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "A variational principle for \n\nmodel-based morphing \n\nLawrence K. Saul'\" and Michael I. Jordan \n\nCenter for Biological and Computational Learning \n\nMassachusetts Institute of Technology \n\n79 Amherst Street, EI0-034D \n\nCambridge, MA 02139 \n\nAbstract \n\nGiven a multidimensional data set and a model of its density, \nwe consider how to define the optimal interpolation between two \npoints. This is done by assigning a cost to each path through space, \nbased on two competing goals-one to interpolate through regions \nof high density, the other to minimize arc length. From this path \nfunctional, we derive the Euler-Lagrange equations for extremal \nmotionj given two points, the desired interpolation is found by solv(cid:173)\ning a boundary value problem. We show that this interpolation can \nbe done efficiently, in high dimensions, for Gaussian, Dirichlet, and \nmixture models. \n\n1 \n\nIntroduction \n\nThe problem of non-linear interpolation arises frequently in image, speech, and \nsignal processing. Consider the following two examples: (i) given two profiles of the \nsame face, connect them by a smooth animation of intermediate poses[l]j (ii) given a \ntelephone signal masked by intermittent noise, fill in the missing speech. Both these \nexamples may be viewed as instances of the same abstract problem. In qualitative \nterms, we can state the problem as follows[2]: given a multidimensional data set, \nand two points from this set, find a smooth adjoining path that is consistent with \navailable models of the data. We will refer to this as the problem of model-based \nmorphing. \n\nIn this paper, we examine this problem it arises from statistical models of multi(cid:173)\ndimensional data. Specifically, our focus is on models that have been derived from \n\nCurrent address: AT&T Labs, 600 Mountain Ave 2D-439, Murray Hill, NJ 07974 \n\n\f268 \n\nL K. Saul and M. I. Jordan \n\nsome form of density estimation. Though there exists a large body of work on \nthe use of statistical models for regression and classification, there has been com(cid:173)\nparatively little work on the other types of operations that these models support. \nNon-linear morphing is an example of such an operation, one that has important \napplications to video email[3], low-bandwidth teleconferencing[4]' and audiovisual \nspeech recognition [2] . \n\nA common way to describe multidimensional data is some form of mixture modeling. \nMixture models represent the data as a collection of two or more clusters; thus, they \nare well-suited to handling complicated (multimodal) data sets. Roughly speaking, \nfor these models the problem of interpolation can be divided into two tasks- how \nto interpolate between points in the same cluster, and how to interpolate between \npoints in different clusters. Our paper will therefore be organized along these lines. \n\nPrevious studies of morphing have exploited the properties of radial basis function \nnetworks[l] and locally linear models[2]. We have been influenced by both these \nworks, especially in the abstract formulation of the problem. New features of our \napproach include: the fundamental role played by the density, the treatment of non(cid:173)\nGaussian models, the use of a continuous variational principle, and the description \nof the interpolant by a differential equation. \n\n2 \n\nIntracluster interpolation \n\nLet Q = {q(1), q(2), .. . , qlQI} denote a set of multidimensional data points, and let \nP( q) denote a model of the distribution from which these points were generated. \nGiven two points, our problem is to find a smooth adjoining path that respects the \nstatistical model of the data. In particular, the desired interpolant should not pass \nthrough regions of space that the modeled density P( q) assigns low probability. \n\n2.1 Clusters and metrics \n\nTo develop these ideas further, we begin by considering a special class of models(cid:173)\nnamely, those that represent clusters. We say that P( q) models a data cluster \nif P( q) has a unique (global) maximum; in turn, we identify the location of this \nmaximum, q\"', as the prototype. \n\nLet us now consider the geometry of the space inhabited by the data. To endow this \nspace with a geometric structure, we must define a metric, ga,B( q), that provides a \nmeasure of the distance between two nearby points: \n\nV[q, q + dq] = [~gap(q) dqadqp r + 0 (Idql') . \n\n1 \n\n(1) \n\nIntuitively speaking, the metric should reflect the fact that as one moves away from \nthe center of the cluster, the density of the data dies off more quickly in some \ndirections than in others. A natural choice for the metric, one that meets the above \ncriteria, is the negative Hessian of the log-likelihood: \n\n(2) \n\n\fA Variational Principle for Model-based Morphing \n\n269 \n\nThis metric is positive-definite if In P( q) is concave; this will be true for all the \nexamples we discuss. \n\n2.2 From densities to paths \n\nThe problem of model-based interpolation is to balance two competing goals(cid:173)\none to interpolate through regions of high density, the other to avoid excessive \ndeformations. Using the metric in eq. (1), we can now assign a cost (or penalty) to \neach path based on these competing goals. \n\nConsider the path parameterized by q(t) . We begin by dividing the path into \nsegments, each of which is traversed in some small time interval, dt. We assign a \nvalue to each segment by \n\n_ {[P(q(t\u00bb] _l}V[q(t),q(t+dt)] \n\n\u00a2(t) -\n\nP( q*) \n\ne \n\n, \n\n(3) \n\nwhere f ~ O. For reasons that will become clear shortly, we refer to f as the \nline tension. The value assigned to each segment dep~n~s on two terms: a ratio \nof probabilities, P( q(t\u00bbj P( q*), which favors points near the prototype, and the \nconstant multiplier, e-l . Both these terms are upper bo~nded by unity, and hence \nso is their product. The value of the segment also decays with its length, as a result \nof the exponent, V[q(t), q(t + dt)]. \nWe derive a path functional by piecing these segments together, multiplying their \nindividual contributions, and taking the continuum limit. A value for the entire \npath is obtained from the product: \n\ne-S = II \u00a2(t). \n\n(4) \n\nTaking the logarithm of both sides, and considering the limit dt -+ 0, we obtain the \npath functional \n\nS[q( t)1 = J {-In [ ~~!i) 1 +l H ~ go~( q)4 0 0 for all a, and I:Ct Wct = 1. Clearly, the \nmultivariate Gaussian is not suited to data of this form, since no matter what the \nmean and covariance matrix, it cannot assign zero probability to vectors outside \nthe simplex. Instead, a more natural model is the Dirichlet distribution: \n\npew) = f(O) IJ ;(Oct) , \n\n8\",-1 \n\n(10) \n\n\fA Variational Principle for Model-based Morphing \n\n271 \n\nwhere ()a > 0 for all Ct, and () == :La ()a. Here, f(.) is the gamma function, and \n()a are parameters that determine the statistics of P(w). Note that P(w) = 0 for \nvectors that are not probability vectors; in particular, the simplex constraints on w \nare implicit assumptions of the model. \n\nWe can rewrite the Dirichlet distribution in a more revealing form as follows. First, \nlet w'\" denote the probability vector with elements w~ = ()al(). Then, making a \nchange of variables from w to In w, we have: \n\nP(ln w) = ;8 exp { - () [KL (w\"'llw)]}, \n\n(11) \n\n(12) \n\nwhere Z8 is a normalization factor that depends on ()a (but not w), and the quantity \nin the exponent is () times the Kullback-Leibler (KL) divergence, \n\nKL(w\"'llw)= Lw:ln [::J. \n\na \n\nThe KL divergence measures the mismatch between wand w\"', with KL(w\"'llw) = 0 \nif and only if w = w\u00b7. Since KL(w\"'llw) has no other minima besides the one at w\"', \nwe shall say that P(ln w) models a data cluster in the variable In w . \n\nThe metric induced by this modeled density is computed by following the prescrip(cid:173)\ntion of eq. (2). For two nearby points inside the simplex, wand w + dw, the result \nof this prescription is that the squared distance is given by \n\nds 2 = () L dw~ . \n\na Wa \n\n(13) \n\nUp to a multiplicative factor of2(), eq. (13) measures the infinitesimal KL divergence \nbetween wand w + dw. This is a natural metric for vectors whose elements can \nbe interpreted as probabilities. \n\nThe functional for non-linear interpolation is found by substituting the modeled \ndensity and the induced metric into eq. (5). For the Dirichlet distribution, this gives: \n\nS[w(t)] = J {O[KL(W'IlW )] H} [O~ ::r dt. \n\n1 \n\n(14) \n\nOur problem is to find the path that minimizes this functional. Because the func(cid:173)\ntional is parameterization-invariant, it again suffices to consider paths that are \ntraversed at a constant rate, or :La w;lwa = 1. In addition to this, however, we \nmust also enforce the constraint that w remains inside the simplex; this is done \nby introducing a Lagrange multiplier. Following this procedure, we find that the \noptimal path is described by: \n\n[0 KL(w'llw) HJ {w. -2~. + ~. } - 0 [~ :: Wp] w. = O(w. - w;). (15) \n\nGiven two endpoints, this differential equation defines a boundary value problem \nfor the optimal path. Unlike before, however, in this case the motion of w is not \n\n\f272 \n\nL. K. Saul and M. I. Jordan \n\nconfined to a plane. Hence, the boundary value problem for eq. (15) does not \ncollapse to one dimension, as does its Gaussian counterpart, eq. (9). \n\nTo remedy this situation, we have developed an efficient approximation that finds \na near-optimal interpolant, in lieu of the optimal one. This is done in two steps: \nfirst, by solving eq. (15) exactly in the limit \u00a3 -+ 00; second, by using this limiting \nsolution, WOO(t), to find the lowest-cost path that can be expressed as the convex \ncombination: \n\nw(t) = m(t)w'\" + [1 - m(t)) WOO(t). \n\n(16) \n\nThe lowest-cost path of this form is found by substituting eq. (16) into the Dirichlet \nfunctional, eq. (14), and solving the Euler-Lagrange equations for m(t). The moti(cid:173)\nvation for eq. (16) is that for finite \u00a3, we expect the optimal interpolant to deviate \nfrom WOO(t) and bend toward the prototype at w*. In practice, this approximation \nworks very well, and by collapsing the boundary value problem to one dimension, it \nallows cheap computation of the Dirichlet interpolants. Some paths from eq. (16), \nas well as the \u00a3 -+ 00 paths on which they are based, are shown in figure lb. These \npaths were computed for the twelve dimensional simplex (N = 12), then projected \nonto the WI w2-plane. \n\n3 \n\nIntercluster interpolation \n\nThe Gaussian and Dirichlet distributions of the previous section are clearly in(cid:173)\nadequate for modeling for multimodal data sets. In this section, we extend the \nvariational principle to mixture models, which describe the data as a collection of \nk 2: 2 clusters. In particular, suppose the data is modeled by \n\nk \n\nP(q) = L 7rz P(qlz). \n\nz=1 \n\n(17) \n\nHere, we have assumed that the conditional densities P( qlz) model data clusters as \ndefined in section 2.1, and the coefficients 7rz = P(z) define prior probabilities for \nthe latent variable, z E {I, 2, ... , k}. \n\nThe crucial step for mixture models is to develop the appropriate generalization of \neq. (5). To this end, let \u00a3z(q, q) denote the Lagrangian derived from the conditional \ndensity, P(qlz), and \u00a3z the line tensioni that appears in this Lagrangian. We now \ncombine these Lagrangians into a single functional: \n\nS[q(t), z(t)) = J dt \u00a3z(t)(q, q). \n\n(18) \n\nNote that eq. (18) is a functional of two arguments, not one. For mixture models, \nwhich define a joint density P(q, z) = 7rzP(qlz), our goal is to find the optimal path \nin the joint space q \u00ae z. Here, z(t) is a piecewise-constant function of time that \nassigns a discrete label to each point along the path; in other words, it provides a \ntemporal segmentation of the path, q(t). The purpose of z(t) in eq. (18) is to select \nwhich Lagrangian is used to compute the contribution from the interval [t, t + dt). \n\nlTo respect the weighting of the mixture components in eq. (17), we set the line tensions \naccording to iz = i-In 'Trz. Thus, components with higher weights have lower line tensions. \n\n\fA Variational Principle for Model-based Morphing \n\n273 \n\no. \n\n'lI \n\n-0. \n\n0.7 \n\n0.0 \n\n0.5 \n\n0 .. \n~0.3 \n0.2 \n\n0.1 \n\n00 \n\n'.CI> \n\n'.0 \n\n, \n'. \n-.---,.,\\~.-.-.-. \n\n* \n\n'liS \n\n.. ': 1.(1) \nj \n\n0.4 \n\n0.8 \n\n* \n\n-6 \n\n0 \nxl \n\n\u2022 \n\n'.1 \n\n0 .\u2022 \n\nW1 \n\nFigure 1: Model-based morphs for (a) Gaussian distribution; (b) Dirichlet distribu(cid:173)\ntion; (c) mixture of Gaussians. The prototypes are shown as asterisks; f denotes the \nline tension. Figure lc shows the convergence of the iterative algorithm; n denotes \nthe number of iterations. \n\nAs before, we define the model-based interpolant as the path q(t) that minimizes \neq. (18). In this case, however, both q(t) and z(t) must be simultaneously optimized \nto recover this path. We have implemented an iterative scheme to perform this \noptimization, one that alternately (i) estimates the segmentation z(t), (ii) computes \nthe model-based interpolant within each cluster based on this segmentation, and \n(iii) reestimates the points (along the cluster boundaries) where z(t) changes value. \nIn short, the strategy is to optimize z(t) for fixed q(t), then optimize q(t) for fixed \nz(t) . \n\nFigure lc shows how this algorithm operates on a simple mixture of Gaussians. In \nthis example, the covariance matrices were set equal to the identity matrix, and \nthe means of the Gaussians were distributed along a circle in the Xlx2-plane. Note \nthat with each iteration, the interpolant converges more closely to the path that \ntraverses this circle. The effect is similar to the manifold-snake algorithm of Bregler \nand Omohundro[2]. \n\n4 Discussion \n\nIn this paper we have proposed a variational principle for model-based interpolation. \nOur framework handles Gaussian, Dirichlet, and mixture models, and the resulting \nalgorithms scale well to high dimensions. Future work will concentrate on the \napplication to real images. \n\nReferences \n\n[1] T. Poggio and F. Girosi. Networks for approximation and learning. Proc. of IEEE, vol \n\n78:9 (1990). \n\n[2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. \nIn G. Tesauro, D. Touretzky, and T. Leen (eds.). Advances in Neural Information \nProcessing Systems 7, 973-980. MIT Press, Cambridge, MA (1995). \n\n[3] T . Ezzat. Example based analysis and synthesis for images of faces. MIT EECS M.S. \n\nthesis (1996). \n\n[4] D. Beymer, A. Shashua, and T. Poggio. Example based image analysis and synthesis. \n\nAI Memo 1161, MIT (1993). \n\n[5] H. Goldstein. Classical Mechanics. Addison-Wesley, London (1980). \n\n\f", "award": [], "sourceid": 1283, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}