Overview
This talk by Jacob Hilton from the Alignment Research Center explores how to establish probabilistic safety guarantees for large language models by examining their internal mechanisms. Learn about innovative approaches to ensuring AI safety through model internals analysis, as presented at the Simons Institute's Safety-Guaranteed LLMs event. The 46-minute presentation delves into technical methods for creating more reliable safety assurances in advanced AI systems.
Syllabus
Probabilistic Safety Guarantees Using Model Internals
Taught by
Simons Institute