Visually Explaining Mixture of Experts LLMs like DeepSeek and Mixtral - How to Code
Neural Breakdown with AVB via YouTube
Overview
Dive into a comprehensive video tutorial that visually explains Mixture of Experts (MOE) Transformers, the architecture behind cutting-edge LLMs like DeepSeek V3 and Mixtral 8x22B. Learn essential concepts including Dense MOEs, Sparse MOEs, Top-K Routing, Noisy Routing, Expert Capacity, Switch Transformers, and Auxiliary load balancing losses. Follow along with visual explanations that clarify complex concepts, complemented by practical code snippets for implementation. The tutorial progresses from basic intuition about MOEs through Transformer fundamentals, then explores advanced routing mechanisms, collapse prevention techniques, and analysis of real-world implementations like Mixtral and DeepSeek. Perfect for those wanting to understand both the theory and practical implementation of state-of-the-art LLM architectures.
Syllabus
0:00 - Intro
1:52 - Mixture of Experts Intuition
4:53 - Transformers 101
9:20 - Dense MOEs
14:50 - Sparse MOEs
16:34 - Router Collapse and Top-K Routing
19:20 - Noisy TopK, Load Balancing
20:56 - Routing Analysis by Mixtral
22:30 - Auxilliary Losses & DeepSeek
24:05 - Expert Capacity
26:07 - 6 Points to Remember
Taught by
Neural Breakdown with AVB