Overview
Explore a detailed technical video lecture examining the advanced reasoning capabilities of large language models, specifically focusing on o1 and DeepSeek-R1 models trained using large-scale reinforcement learning. Delve into key concepts including test-time compute requirements, thinking tokens, and the implementation of reinforcement learning techniques. Learn about the R1-Zero methodology, covering its prompt templates, reward mechanisms, and GRPO (Generalized Reward-Powered Optimization) technical details. Understand DeepSeek R1's four-step training process, from supervised fine-tuning with Chain of Thought to the final reinforcement learning and RLHF stages. Gain practical insights on accessing DeepSeek models and comprehend the broader implications of these advancements in AI reasoning capabilities.
Syllabus
Intro - 0:00
OpenAI's o1 - 0:33
Test-time Compute - 1:33
"Thinking" Tokens - 3:50
DeepSeek Paper - 5:58
Reinforcement Learning - 7:22
R1-Zero: Prompt Template - 9:28
R1-Zero: Reward - 10:53
R1-Zero: GRPO technical - 12:53
R1-Zero: Results - 20:00
DeepSeek R1 - 23:32
Step 1: SFT with CoT - 24:47
Step 2: R1-Zero Style RL - 26:14
Step 3: SFT with Mixed Data - 27:03
Step 4: RL & RLHF - 28:26
Accessing DeepSeek Models - 29:18
Conclusions - 30:10
Taught by
Shaw Talebi