Join Dr. Praneeth Netrapalli, Research Scientist at Google, for an in-depth talk on improving Large Language Model (LLM) inference efficiency through innovative techniques. Learn about HiRE, a novel approach that combines dimensionality reduction and quantization to predict significant elements with high recall, enabling more efficient computation in memory-bound LLM operations. Discover how this technique can be applied to softmax and group-sparse FFN layers to reduce computational costs while maintaining accuracy. The talk also explores tandem transformers architecture, which pairs a small autoregressive model with a large block-mode model to overcome sequential generation bottlenecks, resulting in faster inference and improved prediction quality. This presentation is based on collaborative research with colleagues from Google and provides valuable insights for those interested in advancing LLM efficiency and capabilities. Dr. Netrapalli brings extensive expertise as a researcher at Google Research India, an adjunct professor at multiple institutions, and a recipient of several prestigious awards including the IEEE Signal Processing Society Best Paper Award.
Efficient LLM Inference with HiRE and Tandem Transformers
Centre for Networked Intelligence, IISc via YouTube
Overview
Syllabus
Time: 5:00 PM - 6:00 PM IST
Taught by
Centre for Networked Intelligence, IISc