Overview
This course aims to deepen the understanding of metastable failures in distributed systems by studying their prevalence in real-world incidents. The main goals include presenting an in-depth study of metastable failures from various organizations, categorizing triggers and amplification mechanisms, and developing example applications to reproduce different types of failures. The course teaches the concepts of metastable failures, their prevalence in severe outages, and an extended model to reflect real-world scenarios. The teaching method includes a survey of incident reports, categorization of triggers and mechanisms, and the development of example applications. The intended audience for this course includes professionals and researchers in the field of distributed systems and reliability engineering.
Syllabus
Intro
What are Metastable Failures?
Metastable Failures are Prevalent
Metastability in the Wild - Survey
Defining Metastability - System States
Survey Summary
Metastability Taxonomy - Trigger
Metastability Taxonomy - Sustaining ef
Four Metastability Scenarios Load-spike trigger
Degrees of Vulnerabilities
Lessons
Conclusion
Taught by
USENIX