Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Top Vision Language Models 2025 - Comparing Qwen 2.5 VL, Moondream, and SmolVLM

Trelis Research via YouTube

Overview

Udemy Special: Ends May 28!
Learn Data Science. Courses starting at $12.99.
Get Deal
Dive into a comprehensive technical video exploring the latest advancements in Vision Language Models (VLMs) for 2025, focusing on three cutting-edge models: Qwen 2.5 VL, Moondream, and SmolVLM. Learn practical implementation strategies, from model selection based on size requirements to detailed fine-tuning procedures. Master Moondream's latest features through hands-on inference demonstrations, understand SmolVLM's architecture and step-by-step fine-tuning process, and explore Qwen 2.5 VL's capabilities. Address common technical challenges including FlashAttention installation, token limits, VRAM optimization, and memory management. Compare model performances using real-world examples like chess piece recognition, implement training loops with data collators, and optimize image resolution for better results. Conclude with practical inference demonstrations and an exploration of WebGPU integration, providing a complete toolkit for working with modern vision models.

Syllabus

00:00 Introduction to Vision Language Models
00:55 Model Recommendations: Small vs Large
02:02 Exploring Moondream's Latest Features
03:00 Inference with Moondream
12:20 Fine-Tuning SmolVLM
12:55 Understanding SmolVLM Architecture
17:22 Fine-Tuning SmolVLM: Step-by-Step
32:54 Introducing Qwen 2.5 VL
37:48 Troubleshooting FlashAttention Installation
38:42 Updating Transformers and Restarting Kernel
39:50 Handling Token Limits and VRAM Issues
40:44 Evaluating Model Performance on Chess Pieces
42:48 Comparing Performance with Florence 2
44:46 Training Loop and Data Collator Setup
50:34 Addressing Memory Issues and Image Resolution
55:39 Final Training and Evaluation
01:04:22 Inference and Model Comparison
01:08:27 Conclusion and WebGPU Demo

Taught by

Trelis Research

Reviews

Start your review of Top Vision Language Models 2025 - Comparing Qwen 2.5 VL, Moondream, and SmolVLM

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.