Overview
This video presentation explores the evolution of reinforcement learning algorithms from DeepSeek's GRPO to the more advanced DAPO and VAPO frameworks, developed by researchers from ByteDance Seed and Tsinghua University. Learn about the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm and the four key techniques that make reinforcement learning more effective and efficient for complex Chain-of-Thought reasoning tasks. The 26-minute talk details how these innovations overcome limitations in traditional RL approaches when applied to advanced reasoning challenges, showcasing collaborative work from a large research team led by Yu Yue with supervision from senior researchers at ByteDance.
Syllabus
DeepSeek's GRPO evolved to VAPO (CoT Reasoning)
Taught by
Discover AI