Slinky: Slurm in Kubernetes - Performant AI and HPC Workload Management in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
This conference talk explores Slinky, a fully open-source toolset designed to integrate Slurm with Kubernetes for more efficient AI and HPC workload management. Discover how Kubernetes, originally designed for microservices, is adapting to support AI training and multi-node inference workloads. Learn about Slurm, the most widely used HPC workload manager with over two decades of development, which excels at gang scheduling, fair usage, job planning, and batch scheduling. Explore the architecture of Slinky, which includes a Slurm operator, client library, and metrics exporter, and understand the challenges of achieving fine-grained control in Kubernetes for AI and HPC workloads. Presented by Tim Wickberg from SchedMD at a CNCF event, this 39-minute talk provides valuable insights for those looking to improve performance and efficiency in AI clusters.
Syllabus
Slinky: Slurm in Kubernetes, Performant AI and HPC Workload Management in Kubernetes - Tim Wickberg
Taught by
CNCF [Cloud Native Computing Foundation]