Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Efficient LLM Inference with HiRE and Tandem Transformers

Centre for Networked Intelligence, IISc via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
Join Dr. Praneeth Netrapalli, Research Scientist at Google, for an in-depth talk on improving Large Language Model (LLM) inference efficiency through innovative techniques. Learn about HiRE, a novel approach that combines dimensionality reduction and quantization to predict significant elements with high recall, enabling more efficient computation in memory-bound LLM operations. Discover how this technique can be applied to softmax and group-sparse FFN layers to reduce computational costs while maintaining accuracy. The talk also explores tandem transformers architecture, which pairs a small autoregressive model with a large block-mode model to overcome sequential generation bottlenecks, resulting in faster inference and improved prediction quality. This presentation is based on collaborative research with colleagues from Google and provides valuable insights for those interested in advancing LLM efficiency and capabilities. Dr. Netrapalli brings extensive expertise as a researcher at Google Research India, an adjunct professor at multiple institutions, and a recipient of several prestigious awards including the IEEE Signal Processing Society Best Paper Award.

Syllabus

Time: 5:00 PM - 6:00 PM IST

Taught by

Centre for Networked Intelligence, IISc

Reviews

Start your review of Efficient LLM Inference with HiRE and Tandem Transformers

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.