{"title": "Correlation and Interpolation Networks for Real-time Expression Analysis/Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 909, "page_last": 916, "abstract": null, "full_text": "Correlation and Interpolation Networks for \nReal-time Expression Analysis/Synthesis. \n\nTrevor Darrell, Irfan Essa, Alex Pentland \n\nPerceptual Computing Group \n\nMIT Media Lab \n\nAbstract \n\nWe  describe  a  framework  for  real-time  tracking  of facial  expressions \nthat  uses  neurally-inspired  correlation  and  interpolation  methods.  A \ndistributed view-based representation is used to characterize facial state, \nand is  computed using a  replicated correlation network.  The ensemble \nresponse of the set of view correlation scores is input to a network based \ninterpolation method, which maps perceptual state to motor control states \nfor  a  simulated  3-D  face  model.  Activation levels  of the  motor  state \ncorrespond to muscle activations in an  anatomically derived model.  By \nintegrating fast and robust 2-D processing with 3-D models, we obtain a \nsystem that is able to quickly track and interpret complex facial motions \nin real-time. \n\n1  INTRODUCTION \n\nAn important task for natural and artificial vision systems is the analysis and interpretation \nof faces.  To  be useful in interactive systems and in other settings where the information \nconveyed is of a time critical nature, analysis of facial expressions must occur quickly, or be \noflittle value.  However, many of the traditional computer vision methods for estimating and \nmodeling facial  state have proved difficult to perform fast enough for interactive settings. \nWe  have  therefore  investigated  neurally  inspired  mechanisms  for  the  analysis  of facial \nexpressions.  We use neurally plausible distributed pattern recognition mechanisms to make \nfast and robust assessments of facial state, and multi-dimensional interpolation networks to \nconnect these measurements to a facial model. \n\nThere are many  potential applications of a system for facial  expression analysis.  Person-\n\n\f910 \n\nTrevor Darrell. Irfan  Essa.  Alex Pentland \n\nalized interfaces which sense a users emotional state, ultra-low bitrate video conferencing \nwhich sends only facial  muscle activations,  as  well  as  the enhanced recognition systems \nmentioned above.  We have focused  on a application in computer graphics which stresses \nboth the analysis and synthesis components of our system:  interactive facial animation. \n\nIn the next sections we develop a computational framework for neurally plausible expression \nanalysis, and the connection to a physically-based face model using a radial basis function \nmethod.  Finally  we  will  show  the  results  of these  methods  applied  to  the  interactive \nanimation task, in which an computer graphics model of a face is rendered in real time, and \nmatches the state of the users face as sensed through a conventional video camera. \n\n2  EXPRESSION MODELINGffRACKING \n\nThe modeling and tracking of expressions and faces has been a topic of increasing interest \nrecently.  In  the  neural  network field,  several  successful  models of character  expression \nmodeling  have  been  developed  by  Poggio  and  colleagues.  These  models  apply  multi(cid:173)\ndimensional interpolation techniques,  using the radial  basis function method,  to the task \nof interpolating 2D  images  of different facial  expression.  Librande  [4]  and Poggio  and \nBrunelli [9]  applied the Radial  Basis Function (RBF)  method to  facial  expression  mod(cid:173)\neling,  using a line drawing  representation of cartoon faces.  In this model  a small  set of \ncanonical expressions is defined, and intermediate expressions constructed via the interpo(cid:173)\nlation technique. The representation used is a generic \"feature vector\", which in the case of \ncartoon faces consists of the contour endpoints.  Recently, Beymer et al.  [1]  extended this \napproach to use real images, relying on optical flow and image warping techniques to solve \nthe correspondence and prediction problems, respectively. \n\nRBF-based  techniques have  the advantage of allowing for the efficient  and  fast  compu(cid:173)\ntation  of intermediate states  in  a  representation.  Since the  representation  is  simple  and \nthe interpolation computation straight-forward, real-time implementations are practical on \nconventional systems.  These methods interpolate between a set of 2D views, so the need for \nan explicit 3-D representation is sidestepped.  For many applications, this is not a problem, \nand may  even be desirable since it allows the extrapolation to \"impossible\" figures or ex(cid:173)\npressions, which may be of creative value. However, for realistic rendering and recognition \ntasks, the use of a 3-D model may be desirable since it can detect such impossible states. \n\nIn the field  of computer graphics,  much  work has  been done on on the 3-D modeling of \nfaces  and facial  expression.  These models focus on the geometric and physical qualities \nof facial  structure.  Platt and B adler [7], Pieper [6], Waters [11] and others have developed \nmodels of facial  structure, skin dynamics,  and muscle connections, respectively, based on \navailable  anatomical  data.  These  models  provide strong constraints for  the tracking  of \nfeature  locations on a face.  Williams et.  al.  [12]  developed a  method  in  which explicit \nfeature marks are tracked on a 3-D face by use of two cameras.  Terzopoulos and Waters [10] \ndeveloped a similar method to track linear facial features, estimate corresponding parameters \nof a three dimensional wireframe face model, and reproduce facial expression. A significant \nlimitation of these systems is that successful  tracking requires facial  markings.  Essa and \nPentland [3]  applied optical flow  methods (see also Mase [5])  for the passive tracking of \nfacial motion, and integrated the flow  measurement method into a dynamic system model. \nTheir method allowed for completely passive estimation of facial expressions, using all the \nconstraints provided by a full 3-D model of facial expression. \n\nBoth the view based method of Beymer et.  al. and the 3-D model of Essa and Pentland rely \n\n\fCorrelation  and  Interpolation  Networks for  Real-Time  Expression  Analysis/Synthesis \n\n911 \n\n( \n\n(b) \n\nFigure  1:  (a)  Frame of video being processed to extract view model.  Outlined rectangle \nindicates area of image used for model.  (b)  View models found via clustering method on \ntraining sequence consisting of neutral, smile, and surprise expressions. \non estimates of optic flow,  which are difficult to compute reliably, especially in real-time. \nOur approach here is to combine interpolated view-based  measurements  with physically \nbased  models,  to take advantage of the  fast  interpolation capability of the RBF and  the \npowerful constraints imposed by  physically based models.  We construct a framework  in \nwhich perceptual  states  are  estimated  from  real  video sequences  and  are  interpolated to \ncontrol the motor control states of a physically based face model. \n\n3  VIEW-BASED FACE PERCEPTION \n\nTo make reliable real-time measurements of a complex dynamic object, we use a distributed \nrepresentation corresponding to distinct views of that object.  Previously, we demonstrated \nthe use of this type of representation for the tracking and recognition of hand gestures [2]. \nLike faces,  hands are complex objects with both non-rigid and rigid dynamics.  Direct use \nof a  3-D model  for recognition has  proved difficult for  such  objects,  so  we developed  a \nview-based method for representation.  Here we apply this technique to the problem of facial \nrepresentation, but extend the scheme to connect to a 3-D model  for high-level modeling \nand generation/animation.  With  this,  we gain the representational power and constraints \nimplied by the 3-D model as  a high-level representation;  however the 3-D model  is only \nindirectly involved in the perception stage, so we can still have the same speed and reliability \nafforded by the view-based representation. \n\nIn  our method  each  view  characterizes  a  particular aspect  or pose  of the  object  being \nrepresented.  The  view  is  stored  iconically,  that  is,  it  is  a  literal  image or template  (but \nwith some point-wise statistics) of the appearance of the object in that aspect or pose.  A \nmatch criteria is defined between views and input images; usually a normalized correlation \nfunction  is  used,  but  other  criteria are  possible.  An  input  image  is  represented  by  the \nensemble of match scores from that image to the stored views. \n\nTo  achieve invariance across a range of transformations, for example translation, rotation \nand/or scale, units which compute the match score for each view are replicated at different \nvalues of each transformation. I  The unit which has maximal response across all values of \nthe transformation is selected, and the ensemble response of the view units which share the \n\n1 In  a computer  implementation  this  exhaustive  sampling  may  be impractical  due  to  the  num(cid:173)\nber of units  needed, in  which  case this  stage may  be  approximated by  methods  which  are  hybrid \nsampling/search methods. \n\n\f912 \n\nTrevor Darrell,  Irfan  Essa,  Alex Pentland \n\nsame transformation values as the selected unit is stored as the representation for the input \nimage.  We set the perceptual state X to be a vector containing this ensemble response. \n\nIf the object to be represented is fully known a priori, then methods to generate views can \nbe constructed  by  analysis of the aspect graph  if the object is polyhedral,  or in  general \nby rendering images of the object at evenly  spaced rotations.  However,  in practice good \n3-D models that are useful for describing image intensity values are rare2,  so we look to \ndata-driven methods of acquiring object views. \n\nAs  described  in  [2]  a  simple  clustering  algorithm can  find  a  set  of views  that  \"span\"  a \ntraining sequence of images, in the sense that for each image in the sequence at least one \nview  is within some threshold similarity to that image.  The algorithm is as  follows.  Let \nV  be the current set of views for an  object (initially one view is specified manually).  For \neach frame I  of a training sequence, if at least one v E V has a match value M (v, 1) that is \ngreater than a threshold (J,  then no action is performed and the next frame is processed.  If \nno  view is close, then I  is used to construct a new view which is added to the view set.  A \nview v' is created using a window of I  centered at the location in the previous image where \nthe closest view was located.  (All views usually share the same window size, determined \nby the initial view.)  The view set is then augmented to include the new view:  V  =  V u v'. \nThis algorithm will find a set of views which well-characterizes an object across the range \nof poses or expressions contained in the training sequence.  For example, in the domain of \nhand gestures,  inputing a training sequence consisting of a waving hand will yield views \nwhich contain images of the hand at several  different rotations.  In the domain of faces, \nwhen  input a  training sequence  consisting of a  user  performing  3  different expressions, \nneutral, smile, and surprise, this algorithm (with normalized correlation and (J  = 0.7) found \nthree views corresponding to these expressions to represent the face,  as  shown in Figure \nl(b).  These 3 views serve as  a good representation for the face of this user as long as  his \nexpression is similar to one in the training set. \n\nThe major advantage of this type of distributed view-based representation lies in the reduc(cid:173)\ntion of the dimensionality of the processing that needs to occur for recognition, tracking, or \ncontrol tasks.  In the gesture recognition domain, this dimensionality reduction allowed for \nconventional recognition strategies to be applied successfully and in real-time, on examples \nwhere it would have been infeasible to evaluate the recognition criteria on the full signal.  In \nthe domain explored in this paper it makes the interpolation problem of much lower order: \nrather than interpolate from thousands of input dimensions as  would be required when the \ninput is the image domain, the view domain for expression modeling tasks typically has on \nthe order of a dozen dimensions. \n\n4  3-D MODELINGIMOTOR CONTROL \n\nTo model the structure of the face and the dynamics of expression performance, we use the \nphysically based model of Essa et.  al.  This model captures how expressions are generated \nby muscle actuations and the resulting skin and tissue deformations.  The model is capable \nof controlled nonrigid deformations of various facial  regions,  in a fashion similar to how \nhumans generate facial  expressions by muscle actuations attached to facial  tissue.  Finite \nElement methods are used to model the dynamics of the system. \n\n2 As opposed to  modeling forces  and shape deformations, for  which 3-D models are  useful and \n\nindeed are used in the method presented here. \n\n\fCorrelation  and  Interpolation  Networks for  Real-Time  Expression  Analysis/Synthesis \n\n9 J 3 \n\n(a) \n\n(b) \n\n1 \n\nl\\I1odel \n\n(e) \n\n(d) \n\n900. \n8 0<5'> core \n700 \n\nParalTleters \n\n0.2 \no  ActuatIons \n-0.2 \n\n. \n\nFigure 2:  (a)  Face  images  used as input, (b) normalized correlation scores X(t)  for each \nview model,  (c) resulting muscle control parameters  Y(t),  (d) rendered images of facial \nmodel corresponding to muscle parameters. \n\n\f914 \n\nTrevor Darrell,  Irfan Essa,  Alex Pentland \n\nThis  model  is  based  on  the  mesh  developed  by  Platt  and  Badler  [7],  extended  into  a \ntopologically invariant physics-based  model  through the addition of a  dynamic skin and \nmuscle model  [6,  11].  These methods give the facial model an anatomically-based facial \nstructure by modeling facial  tissue/skin, and muscle actuators,  with a geometric model to \ndescribe force-based deformations and control parameters. \n\nThe muscle model provides us with a set of control knobs to drive the facial  state, defined \nto be a vector Y.  These serve to define the motor state of the animated face.  Our task now \nis to connect the perceptual states of the observed face to these motor states. \n\n5  CONNECTING PERCEPTION WITH ACTION \n\nWe need to establish a mapping from the perceptual view scores to the appropriate muscle \nactivations  on  the  3-D face  model.  To  do  this,  we  use  multidimensional  interpolation \nstrategies implemented in network form. \n\nInterpolation requires a set of control points or exemplars from which to derive the desired \nmapping. Example pairs of real faces and model faces for different expressions are presented \nto the interpolation method during a training phase.  This can be done in one of two ways, \nwith either a user-driven or model-driven paradigm.  In the model-driven case the muscle \nstates are set to generate a particular expression by an  animator/programmer and then the \nuser  is asked  to  make the equivalent expression.  The resulting perceptual (view-model) \nscores are then  recorded and paired with the muscle activation levels.  In the user-driven \ncase,  the user makes  an  expression of hislher own choosing,  and the optic flow  method \nof Essa et.  al.  is used  to derive the corresponding muscle activation levels.  The model(cid:173)\ndriven paradigm is simpler and faster,  but the user-driven paradigm yields more detailed \nand authentic facial expressions. \n\nWe use the Radial Basis Function (RBF) method presented in [8], and define the interpolated \nmotor controls to be a weighted sum of radial functions centered at each example: \n\nn \n\nY  =  I:ci9(X - Xi) \n\ni=1 \n\n(1) \n\nwhere Y  are the muscle states, X are the observed view-model scores, Xi are the example \nscores, 9 is an RBF (and in our case was simply a linear ramp 9(\u00a7) =  II\u00a7II), and the weights \nCi  are computed from the example motor values Yi using the pseudo-inverse method [8]. \n\n6 \n\nINTERACTIVE ANIMATION SYSTEM \n\nThe correlation  network,  RBF interpolator,  and facial  model  described  above  have  been \ncombined into a single system for interactive animation.  The entire system can be updated \nat over 5 Hz, using a dedicated single board accelerator to compute the correlation network, \nand an  SGI workstation to render the facial  mesh. \u00b7 Here we present two examples  of the \nprocessing performed by the system, using different strategies for coupling perceptual and \nmotor state. \n\nFigure 2 illustrates one example of real-time facial expression tracking using this system, \nusing  a  full-coupling paradigm.  Across  the top,  labeled  (a),  are five  frames  of a  video \nsequence of a user making a smile expression.  This was one of the expressions used in the \ntraining sequence for the view models shown in Figure 1 (b), so they were applicable to be \n\n\fCorrelation  and Interpolation  Networks for  Real-Time  Expression  Analysis/Synthesis \n\n915 \n\n(a) \n\n(b) \n\nFigure 3:  (a)  Processing of video frame  with independent view model regions  for eyes, \neye-brows,  and mouth region.  (b)  Overview shot of full  system.  User is on  left,  vision \nsystem and camera is on right, and animated face is in the center of the scene.  The animated \nface matches the state of the users face in real-time, including eye-blinks (as is the case in \nthis shot.) \n\nused here.  Figure 2(b) shows the correlation scores computed for each of the 3 view models \nfor each frame of the sequence.  This constituted the perceptual state representation, X(t). \nIn this example the full face is coupled with the full suite of motor control parameters.  An \nRBF interpolator was trained using perceptual/motor state pairs for three example full-face \nexpressions  (neutral,  smile,  surprise);  the  resulting  (interpolated)  motor control  values, \nyet), for the entire sequence are shown in Figure 2(c).  Finally, the rendered facial mesh for \nfive frames of these motor control values is shown in Figure 2( d). \n\nWhen there are only a few canonical expressions that need be tracked/matched, this full-face \ntemplate approach is robust and simple.  However if the user wishes to exercise independent \ncontrol of the various regions of the face,  then the full coupling paradigm will be overly \nrestrictive.  For example, if the user trains two expressions, eyes closed and eyes open, and \nthen runs the system and attempts to blink only one eye, the rendered face will be unable to \nmatch it.  (In fact closing one eye leads to the rendered face half-closing both eyes.) \n\nA solution to this is to decouple the regions of the face which are independent geometrically \n(and  to some degree,  in terms  of muscle effect.)  Under this paradigm,  separate correla(cid:173)\ntion networks are computed for each facial  regions, and  multiple RBF interpolations are \nperformed for  each  system.  Each  interpolator drives  a distinct subset of the motor state \nvector.  Figure 3(a) shows the regions used for decoupled local templates.  In these examples \nindependent regions were used for each eye, eyebrow, and the mouth region. \n\nFinally,  figure  3  (b)  shows  a picture of the set-up of the system  as  it  is  being  run  in  an \ninteractive setting.  The animated face mimics the facial  state of the user,  matching in real \ntime the position of the eyes,  eyelids,  eyebrows  and mouth  of the user.  In the example \nshown in this picture, the users eyes are closed,  so the animated face's eyes are similarly \nclosed.  Realistic performance of animated facial expressions and gestures are are possible \n\n\f916 \n\nTrevor Darrell,  Irfan Essa,  Alex Pentland \n\nthrough this method, since the timing and levels of the muscle activations react immediately \nto changes in the users face. \n\n7  CONCLUSION \n\nWe have explored the use of correlation networks and Radial Basis Function techniques for \nthe tracking of real  faces  in video sequences.  A distributed view-based representation is \ncomputed using a network of replicated normalized correlation units, and offers a fast and \nrobust assesment of perceptual state.  3-D constraints on facial  shape are achieved through \nthe use of a an anatomically derived facial  model, whose muscle activations are controled \nvia interolated perceptual states using the RBF method. \n\nWith this framework we have been able to acheive the fast and robust analysis and synthesis \nof facial  expressions.  A  modeled face mimics the expression of a user in real-time, using \nonly a conventional video camera sensor and no special marking on the face of the user.  This \nsystem has promise as a new approach in the interactive animation, video tele-conferencing, \nand personalized interface domains. \n\nReferences \n\n[1]  D. Beymer, A. Shashua, and T. Poggio, Example Based Image Analysis and Synthesis, \n\nMIT AI Lab TR-1431, 1993. \n\n[2]  T.  Darrell  and  A.  Pentland.  Classification  of Hand  Gestures  using  a  View-Based \n\nDistributed Representation In NIPS-6,  1993. \n\n[3]  I. Essa and A. Pentland.  A  vision system for observing and  extracting facial  action \n\nparameters.  In Proc.  IEEE Conf.  Computer Vision and Pattern Recognition, 1994. \n\n[4]  S.  Librande.  Example-based  Character Drawing.  S.M.  Thesis,  Media Arts and  Sci(cid:173)\n\nencelMedia Lab, MIT.  1992 \n\n[5]  K.  Mase.  Recognition of facial  expressions  for  optical flow.  IEICE  Transactions, \n\nSpecial I ssue on Computer Vision and its Applications, E 74( 1 0), 1991. \n\n[6]  S.  Pieper, J.  Rosen,  and D.  Zeltzer.  Interactive graphics for plastic surgery:  A  task \n\nlevel analysis and implementation.  Proc.  Siggraph-92, pages 127-134, 1992. \n\n[7]  S. M. Platt and N. I. Badler.  Animating facial expression.  ACM SIGGRAPH Confer(cid:173)\n\nence Proceedings,  15(3):245-252,1981. \n\n[8]  T.  Poggio and F.  Girosi.  A theory of networks for approximation and learning. MIT \n\nAI Lab TR-1140,  1989. \n\n[9]  T.  Poggio and R.  Brunelli, A  Novel  Approach to Graphics,  MIT AI Lab TR- 1354. \n\n1992. \n\n[10]  D. Terzopoulus and K. Waters. Analysis and synthesis offacial image sequences using \n\nphysical and anatomical models.  IEEE Trans.  PAMI,  15(6):569-579, June 1993. \n\n[11]  K.  Waters  and D.  Terzopoulos.  Modeling and  animating faces  using scanned data. \n\nThe Journal of Visualization and Computer Animation, 2: 123-128, 1991. \n\n[12]  L.  Williams.  Performance-driven  facial  animation.  ACM SIGGRAPH  Conference \n\nProceedings, 24(4):235-242, 1990. \n\n\f", "award": [], "sourceid": 999, "authors": [{"given_name": "Trevor", "family_name": "Darrell", "institution": null}, {"given_name": "Irfan", "family_name": "Essa", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}