Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Massachusetts Institute of Technology

Giving Sight to Speech Models

Massachusetts Institute of Technology via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
Explore a PhD candidate's research on enhancing speech recognition models with visual inputs in this 24-minute talk from MIT. Andrew Rouditchenko presents Whisper-Flamingo, a multi-modal model that incorporates lip-based visual features into the Whisper speech recognition framework using gated cross attention. Learn how this innovation significantly improves performance in noisy environments for both English speech recognition and English-to-other-language translation across six languages. The presentation also covers mWhisper-Flamingo, a multilingual extension trained on videos in nine languages that employs a novel decoder modality dropout technique crucial for effective noisy multilingual performance. Gain insights into cutting-edge approaches to multi-modal and multilingual speech processing from this MIT CSAIL researcher.

Syllabus

Andrew Rouditchenko - Giving Sight to Speech Models

Taught by

MIT Embodied Intelligence

Reviews

Start your review of Giving Sight to Speech Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.