Overview
Dive deep into the mathematical foundations of KL divergence implementation in DeepSeek R1's GRPO through this 20-minute technical tutorial. Learn the key differences between GRPO and PPO's KL divergence approaches, starting with a comprehensive refresher on the concept. Follow along with detailed explanations of Monte Carlo estimation and explore three key formulations: logarithmic ratio (k1), squared logarithmic ratio (k2), and the difference-based approach (k3). Examine practical benchmarking results and gain valuable insights from Schulman's influential blog post on KL approximation. Perfect for machine learning practitioners seeking to understand the mathematical underpinnings of modern deep learning algorithms.
Syllabus
- Introduction: 0:00
- KL Divergence in GRPO vs PPO: 1:00
- KL Divergence refresher: 2:30
- Monte Carlo estimation of KL divergence: 6:42
- Schulman blog: 7:58
- k1 = logq/p: 8:55
- k2 = 0.5*logp/q^2: 11:23
- k3 = p/q - 1 - logp/q: 13:35
- benchmarking: 15:58
- takeaways: 18:43
Taught by
Yacine Mahdid