Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

The GPUs on the Bus Go 'Round and 'Round - Large-Scale GPU Deployment Management

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
This conference talk explores NVIDIA's large-scale GPU deployment challenges and solutions in their GPU Cloud environment. Gain behind-the-scenes insights into how NVIDIA manages day-2 maintenance for over 60,000 GPUs in production. Discover the hard truths and surprising revelations they've encountered while maximizing device uptime through years of experimentation and fine-tuning. As Kubernetes enhances support for accelerator allocation with DRA, learn critical device management techniques including how to identify device failures, maintain GPU health, and remediate issues transparently without disrupting running workloads. Essential viewing for anyone managing GPU infrastructure at scale in cloud native environments.

Syllabus

The GPUs on the Bus Go ‘Round and ‘Round - Natalie Bandel & Ryan Hallisey, NVIDIA

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of The GPUs on the Bus Go 'Round and 'Round - Large-Scale GPU Deployment Management

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.