Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Qwen 2.5 Omni - The Most Multi-modal Model for Video, Text and Audio Processing

Trelis Research via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
Explore the capabilities of a highly multi-modal AI model in this 30-minute video from Trelis Research. Learn about the architecture including TMRoPE technology, and see how it compares to competitors like Llama 3, Moshi, GPT-4o, and Gemini Pro 2.5. Discover practical implementation details for running the model on GPU hardware, and watch demonstrations of its versatility with various input/output combinations including text, audio, and video processing. The presentation includes detailed architecture analysis and guidance on when this multi-modal model would be most appropriate for different use cases. Repository access is available at Trelis.com/ADVANCED-transcription.

Syllabus

0:00 Qwen 2.5 Onmi - Video, Text and Audio Inputs, Text and Audio Outputs.
0:24 Qwen2.5 Architecture, incl. TMRoPE
6:29 Qwen Omni vs Llama 3.
7:43 Qwen Omni vs Moshi.
9:32 Comparison with GPT-4o and Gemini Pro 2.5.
13:09 How to run Qwen 2.5 Onmi on a GPU?
18:19 Inference with Audio Inputs and Audio + Text Outputs.
22:48 Inference with Video Input and Audio Output + Text Output.
27:22 Qwen 2.5 Model Architecture Print-out
29:20 When should you use Qwen 2.5 Omni?

Taught by

Trelis Research

Reviews

Start your review of Qwen 2.5 Omni - The Most Multi-modal Model for Video, Text and Audio Processing

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.