
Udemy Special: Ends May 28!
Learn Data Science. Courses starting at $12.99.
Get Deal
This conference talk explores how infrastructure-level checkpointing can enhance resilience for AI/ML workloads beyond traditional application framework checkpointing. Learn how Checkpoint/Restore in Userspace (CRIU) can efficiently address scheduling and resilience issues in an application-agnostic way as production workloads scale. The presenters demonstrate a Kubernetes operator that leverages CRIU, CRI-O, and cuda-checkpoint to checkpoint and hot-restart distributed ML workloads. Discover synchronization mechanisms for JobSets running stateful workloads during node maintenance scenarios. The presentation covers use cases and limitations of platform-layer checkpoint/restore for stateful ML applications, provides a technical overview of the implementation, and discusses the roadmap for productionizing this emerging technology.