
Udemy Special: Ends May 28!
Learn Data Science. Courses starting at $12.99.
Get Deal
This talk by Geoffrey Irving from the UK AI Safety Institute explores the theoretical foundations and practical applications of scalable oversight for AI alignment. Learn about recent advancements in computational complexity, multi-agent training dynamics, and learning theory that aim to provide theoretical safety guarantees under simplified human feedback assumptions. Discover the innovative "prover-predictor game" variant of debate that addresses the "obfuscated arguments" problem from earlier experiments while allowing ML systems to operate more efficiently with ML-checkable arguments. Examine the potential for extending these methods to more realistic human feedback scenarios and stronger solution requirements, drawing on untapped resources from theoretical computer science. Understand how these approaches, structured as zero-sum adversarial team games, might translate into practical, convergent training methods that offer asymptotic safety guarantees with real-world applicability.