Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference

USENIX via YouTube

Overview

FLASH SALE: Ends May 22!
Udemy online courses up to 85% off.
Get Deal
This conference talk from FAST '25 presents IMPRESS, an importance-informed multi-tier prefix KV storage system designed to optimize large language model inference. Learn how researchers from Zhejiang University and Huawei Cloud address the challenge of efficiently storing and reusing prefix key-value pairs (KVs) from repeated contexts in LLM applications. Discover their innovative approach that identifies important token indices across attention heads and implements I/O-efficient algorithms to reduce time to first token (TTFT). The presentation demonstrates how IMPRESS can reduce TTFT by up to 2.8× compared to state-of-the-art systems while maintaining comparable inference accuracy, making it particularly valuable for LLM applications with limited CPU memory where disk I/O latency becomes a bottleneck.

Syllabus

FAST '25 - IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language...

Taught by

USENIX

Reviews

Start your review of IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.