Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

GRPO - Group Relative Policy Optimization: How DeepSeek Trains Reasoning Models

Serrano.Academy via YouTube

Overview

Coursera Plus Annual Sale:
All Certificates & Courses 50% Off!
Grab it
Explore Group Relative Policy Optimization (GRPO), the reinforcement learning technique used by DeepSeek to train its advanced reasoning model, in this 22-minute educational video from Serrano.Academy. Discover how GRPO differs from self-supervised learning by using reinforcement learning for model self-improvement. Compare DeepSeek's reasoning capabilities with ChatGPT through practical examples, understand the GRPO scoring mechanism, and learn about key concepts including context-based answering, quality advantage, response probability calculations, and response clipping. The video breaks down complex technical concepts into accessible explanations, making it part of a broader series on reinforcement learning for large language models. Perfect for those interested in the technical foundations behind cutting-edge AI reasoning systems.

Syllabus

00:00 Introduction
00:26 Answering with context
01:40 DeepSeek vs ChatGPT
05:30 The GRPO score
07:05 Averaging over answers and steps
07:38 Quality Advantage
10:30 Probability of responses
15:36 Clipping the response
18:21 Not changing the model too much

Taught by

Serrano.Academy

Reviews

Start your review of GRPO - Group Relative Policy Optimization: How DeepSeek Trains Reasoning Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.