Overview
Explore the capabilities of a highly multi-modal AI model in this 30-minute video from Trelis Research. Learn about the architecture including TMRoPE technology, and see how it compares to competitors like Llama 3, Moshi, GPT-4o, and Gemini Pro 2.5. Discover practical implementation details for running the model on GPU hardware, and watch demonstrations of its versatility with various input/output combinations including text, audio, and video processing. The presentation includes detailed architecture analysis and guidance on when this multi-modal model would be most appropriate for different use cases. Repository access is available at Trelis.com/ADVANCED-transcription.
Syllabus
0:00 Qwen 2.5 Onmi - Video, Text and Audio Inputs, Text and Audio Outputs.
0:24 Qwen2.5 Architecture, incl. TMRoPE
6:29 Qwen Omni vs Llama 3.
7:43 Qwen Omni vs Moshi.
9:32 Comparison with GPT-4o and Gemini Pro 2.5.
13:09 How to run Qwen 2.5 Onmi on a GPU?
18:19 Inference with Audio Inputs and Audio + Text Outputs.
22:48 Inference with Video Input and Audio Output + Text Output.
27:22 Qwen 2.5 Model Architecture Print-out
29:20 When should you use Qwen 2.5 Omni?
Taught by
Trelis Research