Overview
This 48-minute InfoQ video explores the challenges of scaling Large Language Model (LLM) batch inference and demonstrates how to combine Ray Data with vLLM to achieve high throughput and cost-effective processing. Dive into techniques for leveraging heterogeneous computing resources, implementing fault tolerance for reliability, and optimizing inference pipelines for maximum efficiency. Examine real-world case studies that showcase significant performance improvements and cost reductions when processing large volumes of data through LLMs. Learn practical approaches to overcome common bottlenecks in batch inference workflows and implement scalable solutions for production environments.
Syllabus
Scaling LLM Batch Inference: Ray Data & vLLM for High Throughput
Taught by
InfoQ