The GPUs on the Bus Go 'Round and 'Round - Large-Scale GPU Deployment Management
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
This conference talk explores NVIDIA's large-scale GPU deployment challenges and solutions in their GPU Cloud environment. Gain behind-the-scenes insights into how NVIDIA manages day-2 maintenance for over 60,000 GPUs in production. Discover the hard truths and surprising revelations they've encountered while maximizing device uptime through years of experimentation and fine-tuning. As Kubernetes enhances support for accelerator allocation with DRA, learn critical device management techniques including how to identify device failures, maintain GPU health, and remediate issues transparently without disrupting running workloads. Essential viewing for anyone managing GPU infrastructure at scale in cloud native environments.
Syllabus
The GPUs on the Bus Go ‘Round and ‘Round - Natalie Bandel & Ryan Hallisey, NVIDIA
Taught by
CNCF [Cloud Native Computing Foundation]