Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Udemy Special: Ends May 28!
Learn Data Science. Courses starting at $12.99.
Get Deal
This conference talk explores how infrastructure-level checkpointing can enhance resilience for AI/ML workloads beyond traditional application framework checkpointing. Learn how Checkpoint/Restore in Userspace (CRIU) can efficiently address scheduling and resilience issues in an application-agnostic way as production workloads scale. The presenters demonstrate a Kubernetes operator that leverages CRIU, CRI-O, and cuda-checkpoint to checkpoint and hot-restart distributed ML workloads. Discover synchronization mechanisms for JobSets running stateful workloads during node maintenance scenarios. The presentation covers use cases and limitations of platform-layer checkpoint/restore for stateful ML applications, provides a technical overview of the implementation, and discusses the roadmap for productionizing this emerging technology.

Syllabus

Transparent, Infra-Level Checkpoint and Restore for Resil... Ganeshkumar Ashokavardhanan & Bernie Wu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.