Test-Time Preference Optimization: On-the-Fly AI Alignment via Iterative Feedback
Discover AI via YouTube
Overview
Learn about Test-Time Preference Optimization (TPO), a groundbreaking approach to AI alignment presented in this 14-minute video from Discover AI. Explore how large language models can dynamically adapt through natural language critiques without requiring retraining or parameter updates. Delve into the innovative process where textual feedback is converted into "textual gradients," allowing models to iteratively refine their responses in real-time. Understand how an unaligned model can potentially outperform fine-tuned versions through self-generated critiques, demonstrating a novel fusion of symbolic reasoning and lightweight computation. Based on research from Shanghai AI Laboratory and The Chinese University of Hong Kong, discover how TPO represents a significant advancement in on-the-fly AI alignment techniques.
Syllabus
DPO to TPO: Test-Time Preference Optimization (RL)
Taught by
Discover AI