Optimizing Metrics Collection and Serving When Autoscaling LLM Workloads
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore strategies for optimizing metrics collection and serving when autoscaling Large Language Model (LLM) workloads in this informative 35-minute conference talk presented by Vincent Hou from Bloomberg and Jiřà Kremser from kedify.io at a CNCF event. Learn how to balance resource provision for LLM workloads to maintain both cost efficiency and service quality using Kubernetes's Horizontal Autoscaling capabilities. The presentation covers the fundamentals of horizontal autoscaling in Kubernetes, the unique challenges specific to LLM workloads, a comparative analysis of existing Kubernetes autoscaling solutions for custom metrics with their advantages and disadvantages, and techniques for improving scaling responsiveness through push-based metrics collection approaches. The speakers demonstrate an integrated solution utilizing KServe, OpenTelemetry collector, and KEDA to showcase effective optimization of LLM workload autoscaling in cloud-native environments.
Syllabus
Optimizing Metrics Collection & Serving When Autoscaling LLM Workloads - Vincent Hou & Jiřà Kremser
Taught by
CNCF [Cloud Native Computing Foundation]