Overview
Learn how to implement DeepSeek and O1's rule-based AI reward modeling using verifiers in a comprehensive 45-minute video tutorial. Master the creation of various verifiers including format, accuracy, boxed, and limerick types to shape AI-generated outputs. Explore different sampling strategies like greedy vs. top-p sampling, and discover how to generate custom verifier datasets. Follow along to understand the process of collecting prompts from a DeepSeek teacher model and implementing the same techniques used in the DeepSeek paper to evoke long chains of thought. Gain practical experience in fine-tuning models through SFT training on collected prompts and evaluating their performance and chain-of-thought quality. Access companion GitHub repositories for fine-tuning, verifiers, and math synthetic data generation to create your own reasoning-based model locally. Perfect for both AI enthusiasts and experienced practitioners looking to enhance their understanding of AI reward strategies and verifier implementation.
Syllabus
00:00 - intro
00:53 - deepseek reward modelling
03:20 - format reward verifier
07:31 - accuracy reward verifier
09:43 - boxed reward verifier
12:11 - verifier answer verifier
13:39 - limerick verifier
16:25 - llm verifiers
18:29 - evoking outputs deepseek style
19:07 - greedy sampling
23:10 - top p sampling
30:00 - generating verifier datasets
33:00 - collecting prompts from teacher model deepseek
37:00 - sft training on collected prompts
37:33 - inferring from trained model
38:50 - chain of thought quality
Taught by
Chris Hay