Michael Gray, Javier Movellan, Terrence J. Sejnowski
Humans use visual as well as auditory speech signals to recognize spoken words. A variety of systems have been investigated for per(cid:173) forming this task. The main purpose of this research was to sys(cid:173) tematically compare the performance of a range of dynamic visual features on a speechreading task. We have found that normal(cid:173) ization of images to eliminate variation due to translation, scale, and planar rotation yielded substantial improvements in general(cid:173) ization performance regardless of the visual representation used. In addition, the dynamic information in the difference between suc(cid:173) cessive frames yielded better performance than optical-flow based approaches, and compression by local low-pass filtering worked sur(cid:173) prisingly better than global principal components analysis (PCA). These results are examined and possible explanations are explored.