Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Shaw Talebi via YouTube

Overview

Coursera Plus Annual Sale: All Certificates & Courses 25% Off!
This tutorial explores the techniques for fine-tuning Large Language Models (LLMs) using human feedback, covering both Reinforcement Learning from Human Feedback (RLHF) and Direct Policy Optimization (DPO). Learn about the evolution from base models to InstructGPT, understand the mechanics of RLHF and its limitations, and discover how DPO offers a more efficient alternative. Follow along with a practical example of fine-tuning the Qwen model on title preferences through a three-step process: curating preference data, implementing DPO fine-tuning, and evaluating the resulting model. The video includes timestamps for easy navigation through topics like Proximal Policy Optimization (PPO) and provides links to relevant research papers, datasets, and the fine-tuned model on Hugging Face.

Syllabus

Intro - 0:00
Base Models - 0:25
InstructGPT - 2:20
RL from Human Feedback RLHF - 5:18
Proximal Policy Optimization PPO - 9:20
Limitations of RLHF - 10:30
Direct Policy Optimization DPO - 11:50
Example: Fine-tuning Qwen on Title Preferences - 14:29
Step 1: Curate preference data - 17:49
Step 2: Fine-tuning with DPO - 20:53
Step 3: Evaluate fine-tuning model - 25:27

Taught by

Shaw Talebi

Reviews

Start your review of Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.