Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
This conference talk explores how Kubernetes container checkpointing functionality has been extended to support transparent checkpoint/restore operations for GPU-based AI/ML workloads. Learn about innovative approaches that avoid the overhead of traditional methods like API interception, logging, and re-execution. The speakers, Radostin Stoyanov from University of Oxford, Adrian Reber from Red Hat, and Viktória Spišáková from Masaryk University, demonstrate how these new mechanisms leverage CRIU plugins to provide fault tolerance for long-running AI/ML workloads in cloud-native environments. Discover techniques that can significantly improve resource utilization in large-scale GPU clusters while reducing the steady-state overhead associated with conventional transparent GPU checkpointing approaches.
Syllabus
Efficient Transparent Checkpointing of AI/ML Workloads in Kub... R. Stoyanov, A. Reber, V. Spišáková
Taught by
CNCF [Cloud Native Computing Foundation]