Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Detection and Steering in LLMs using Feature Learning

Simons Institute via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
This talk by Daniel Beaglehole from UC San Diego explores a general method for detecting semantic concepts in the internal activations of Large Language Models (LLMs) and adapting this methodology to steer LLMs toward desirable outputs. Learn about innovative approaches including nonlinear feature learning to identify important linear directions for predicting concepts from each layer, and techniques for aggregating features across layers to build powerful concept detectors and steering mechanisms. The presentation showcases state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content across seven benchmarks. Discover how this general approach can steer LLMs towards new concepts including semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, multiple concepts simultaneously, and even concepts with numerical attributes such as product reviews.

Syllabus

Detection and Steering in LLMs using Feature Learning

Taught by

Simons Institute

Reviews

Start your review of Detection and Steering in LLMs using Feature Learning

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.