Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)
John Hershey, Michael Casey
It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This sug(cid:173) gests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are fac(cid:173) tori ally combined, to incorporate visual lip information and em(cid:173) ploy novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combina(cid:173) torial explosion in the factorial model by using a simple approxi(cid:173) mate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information.