This talk by Daniel Beaglehole from UC San Diego explores a general method for detecting semantic concepts in the internal activations of Large Language Models (LLMs) and adapting this methodology to steer LLMs toward desirable outputs. Learn about innovative approaches including nonlinear feature learning to identify important linear directions for predicting concepts from each layer, and techniques for aggregating features across layers to build powerful concept detectors and steering mechanisms. The presentation showcases state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content across seven benchmarks. Discover how this general approach can steer LLMs towards new concepts including semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, multiple concepts simultaneously, and even concepts with numerical attributes such as product reviews.
Overview
Syllabus
Detection and Steering in LLMs using Feature Learning
Taught by
Simons Institute