Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
This conference talk explores how Kubernetes container checkpointing functionality has been extended to support transparent checkpoint/restore operations for GPU-based AI/ML workloads. Learn about innovative approaches that avoid the overhead of traditional methods like API interception, logging, and re-execution. The speakers, Radostin Stoyanov from University of Oxford, Adrian Reber from Red Hat, and Viktória Spišáková from Masaryk University, demonstrate how these new mechanisms leverage CRIU plugins to provide fault tolerance for long-running AI/ML workloads in cloud-native environments. Discover techniques that can significantly improve resource utilization in large-scale GPU clusters while reducing the steady-state overhead associated with conventional transparent GPU checkpointing approaches.

Syllabus

Efficient Transparent Checkpointing of AI/ML Workloads in Kub... R. Stoyanov, A. Reber, V. Spišáková

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.