Advanced Model Serving Techniques with Ray on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Watch a 42-minute conference talk exploring advanced techniques for serving Large Language Models using Ray on Kubernetes. Dive into sophisticated model serving approaches including model composition, multiplexing, and fractional GPU scheduling presented by experts from Google and Anyscale. Learn about cutting-edge GPU-native communication initiatives in Ray and how they integrate with Kubernetes DRA to enable tensor parallelism across multiple GPUs. Experience a live demonstration showcasing KubeRay's practical implementation of these techniques for real-world LLM deployments, highlighting Ray's capabilities in scaling and orchestrating open-source models across diverse hardware accelerators and failure domains.
Syllabus
Advanced Model Serving Techniques with Ray on Kubernetes - Andrew Sy Kim & Kai-Hsun Chen
Taught by
CNCF [Cloud Native Computing Foundation]