Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Test-Time Preference Optimization: On-the-Fly AI Alignment via Iterative Feedback

Discover AI via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
Learn about Test-Time Preference Optimization (TPO), a groundbreaking approach to AI alignment presented in this 14-minute video from Discover AI. Explore how large language models can dynamically adapt through natural language critiques without requiring retraining or parameter updates. Delve into the innovative process where textual feedback is converted into "textual gradients," allowing models to iteratively refine their responses in real-time. Understand how an unaligned model can potentially outperform fine-tuned versions through self-generated critiques, demonstrating a novel fusion of symbolic reasoning and lightweight computation. Based on research from Shanghai AI Laboratory and The Chinese University of Hong Kong, discover how TPO represents a significant advancement in on-the-fly AI alignment techniques.

Syllabus

DPO to TPO: Test-Time Preference Optimization (RL)

Taught by

Discover AI

Reviews

Start your review of Test-Time Preference Optimization: On-the-Fly AI Alignment via Iterative Feedback

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.