More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling With Dynamic Resource Allocation
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
This conference talk explores how Dynamic Resource Allocation (DRA) in Kubernetes addresses the challenges of scheduling multi-host GPU/TPU workloads. Learn how to manage large-scale AI/ML training jobs that require numerous nodes and accelerators, while handling potential failures in the infrastructure. Discover how DRA, which reached beta status in Kubernetes 1.32, extends beyond individual device management to support resources accessible across multiple nodes. The presenters demonstrate how combining DRA with the alpha partitionable device model in version 1.33 enables correct modeling of complex multi-host, multi-accelerator topologies, allowing workloads to be scheduled as a unified entity. Gain insights into how these technologies are being applied to massive compute clusters like NVIDIA GB200 and multi-host TPU slices, revolutionizing how AI/ML workloads run on Kubernetes.
Syllabus
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
Taught by
CNCF [Cloud Native Computing Foundation]