Overview
Explore a 59-minute lecture by Siva Reddy from IVADO - Mila - McGill University, presented at the Simons Institute, examining the vulnerability of aligned language models to jailbreaking attempts. Investigate how these security exploits transfer across different types of AI systems, including standard large language models, reasoning-enhanced models, and autonomous agents. The presentation, part of the Safety-Guaranteed LLMs series, offers critical insights into the robustness challenges facing AI safety mechanisms and potential implications for developing more secure AI systems.
Syllabus
Robustness of jailbreaking across aligned LLMs, reasoning models and agents
Taught by
Simons Institute