Explore a PhD candidate's research on enhancing speech recognition models with visual inputs in this 24-minute talk from MIT. Andrew Rouditchenko presents Whisper-Flamingo, a multi-modal model that incorporates lip-based visual features into the Whisper speech recognition framework using gated cross attention. Learn how this innovation significantly improves performance in noisy environments for both English speech recognition and English-to-other-language translation across six languages. The presentation also covers mWhisper-Flamingo, a multilingual extension trained on videos in nine languages that employs a novel decoder modality dropout technique crucial for effective noisy multilingual performance. Gain insights into cutting-edge approaches to multi-modal and multilingual speech processing from this MIT CSAIL researcher.
Overview
Syllabus
Andrew Rouditchenko - Giving Sight to Speech Models
Taught by
MIT Embodied Intelligence