
FLASH SALE: Ends May 22!
Udemy online courses up to 85% off.
Get Deal
This conference talk from FAST '25 presents IMPRESS, an importance-informed multi-tier prefix KV storage system designed to optimize large language model inference. Learn how researchers from Zhejiang University and Huawei Cloud address the challenge of efficiently storing and reusing prefix key-value pairs (KVs) from repeated contexts in LLM applications. Discover their innovative approach that identifies important token indices across attention heads and implements I/O-efficient algorithms to reduce time to first token (TTFT). The presentation demonstrates how IMPRESS can reduce TTFT by up to 2.8× compared to state-of-the-art systems while maintaining comparable inference accuracy, making it particularly valuable for LLM applications with limited CPU memory where disk I/O latency becomes a bottleneck.