Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling With Dynamic Resource Allocation

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
This conference talk explores how Dynamic Resource Allocation (DRA) in Kubernetes addresses the challenges of scheduling multi-host GPU/TPU workloads. Learn how to manage large-scale AI/ML training jobs that require numerous nodes and accelerators, while handling potential failures in the infrastructure. Discover how DRA, which reached beta status in Kubernetes 1.32, extends beyond individual device management to support resources accessible across multiple nodes. The presenters demonstrate how combining DRA with the alpha partitionable device model in version 1.33 enables correct modeling of complex multi-host, multi-accelerator topologies, allowing workloads to be scheduled as a unified entity. Gain insights into how these technologies are being applied to massive compute clusters like NVIDIA GB200 and multi-host TPU slices, revolutionizing how AI/ML workloads run on Kubernetes.

Syllabus

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling With Dynamic Resource Allocation

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.