
Accelerate Your AI/ML Workloads With Topology-Aware Scheduling in Kueue
CNCF [Cloud Native Computing Foundation] via YouTube
Overview

Udemy Special: Ends May 28!
Learn Data Science. Courses starting at $12.99.
Get Deal
This conference talk explores how to optimize AI/ML workload performance using Topology-Aware Scheduling in Kueue. Learn how to address the network throughput bottlenecks that occur when AI training and inference workloads exchange massive amounts of data between pods, especially critical in the era of Large Language Models. Discover how Kueue, as a Job-level scheduler, leverages cluster topology information through a proposed node labeling convention to optimize Pod placement. The presenters, Michał Woźniak from Google and Yuki Iwai from CyberAgent, explain the key concepts behind Topology-Aware Scheduling (TAS), compare it with alternative approaches, and demonstrate how it significantly improves execution time for AI workloads by ordering Pods by indices to enhance the performance of AI frameworks using NCCL.
Syllabus
Accelerate Your AI/ML Workloads With Topology-Aware Scheduling in Kueue - Michał Woźniak & Yuki Iwai
Taught by
CNCF [Cloud Native Computing Foundation]