Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How to SRE When Everything's Already on Fire

USENIX via YouTube

Overview

This course teaches learners how to implement Site Reliability Engineering (SRE) best practices to improve system reliability and resolve incidents effectively. The course covers topics such as service reliability principles, service level indicators, error budgets, incident command systems, data collection, and lessons learned from real-life incidents. The teaching method includes a case study presented by industry experts, showcasing practical steps taken to enhance system reliability. The course is designed for individuals interested in learning about SRE practices and improving the reliability of their systems.

Syllabus

Intro
A PHENOMENAL EVENING
ELK @ SQUARESPACE
SERVICE RELIABILITY PRINCIPLES
THE RELIABILITY STACK
SERVICE LEVEL INDICATORS
SERVICE LEVEL OBJECTIVES
ERROR BUDGETS ARE AWESOME
THIS RELIABILITY STUFF ISN'T NEW
THE INCIDENT COMMAND SYSTEM
PROBLEMS THE ICS ADDRESSES
OPERATIONS LEAD
INCIDENT COMMANDER 1
TIMELINE OF A 37-HOUR INCIDENT
SEE THE FOREST FOR THE TREES
THE UNSHARDENING
KEY COMPONENTS
DATA COLLECTION
LESSONS LEARNED
REPAIR ITEMS
PROGRESS IS INCREMENTAL
ALERT ON WHAT MATTERS Put your users first

Taught by

USENIX

Reviews

Start your review of How to SRE When Everything's Already on Fire

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.