Overview
This tutorial explores the techniques for fine-tuning Large Language Models (LLMs) using human feedback, covering both Reinforcement Learning from Human Feedback (RLHF) and Direct Policy Optimization (DPO). Learn about the evolution from base models to InstructGPT, understand the mechanics of RLHF and its limitations, and discover how DPO offers a more efficient alternative. Follow along with a practical example of fine-tuning the Qwen model on title preferences through a three-step process: curating preference data, implementing DPO fine-tuning, and evaluating the resulting model. The video includes timestamps for easy navigation through topics like Proximal Policy Optimization (PPO) and provides links to relevant research papers, datasets, and the fine-tuned model on Hugging Face.
Syllabus
Intro - 0:00
Base Models - 0:25
InstructGPT - 2:20
RL from Human Feedback RLHF - 5:18
Proximal Policy Optimization PPO - 9:20
Limitations of RLHF - 10:30
Direct Policy Optimization DPO - 11:50
Example: Fine-tuning Qwen on Title Preferences - 14:29
Step 1: Curate preference data - 17:49
Step 2: Fine-tuning with DPO - 20:53
Step 3: Evaluate fine-tuning model - 25:27
Taught by
Shaw Talebi