In many IT organizations, incentives are not aligned between developers, who strive for agility, and operators, who focus on stability. Site reliability engineering, or SRE, is how Google aligns incentives between development and operations and does mission-critical production support. Adoption of SRE cultural and technical practices can help improve collaboration between the business and IT. This course introduces key practices of Google SRE and the important role IT and business leaders play in the success of SRE organizational adoption.
IT leaders and business leaders who are interested in embracing SRE philosophy. Roles include, but are not limited to CTO, IT director/manager, engineering VP/director/manager.
Other product and IT roles such as operations managers or engineers, software engineers, service managers, or product managers may also find this content useful as an introduction to SRE.
Module 1: Welcome to Developing a Google SRE Culture
This module provides a course overview. You will learn why this course is beneficial for IT and business leaders who want to embrace SRE culture, and what topics each module covers.
Module 2: DevOps, SRE, and Why They Exist
This module explains the components of DevOps philosophy, why Site Reliability Engineering came to exist, and who in an organization can and should practice SRE.
Module 3: SLOs with Consequences
This module covers the value of SRE to an organization, as well as the technical and cultural fundamentals related to reducing organizational silos and accepting failure as normal. Topics include the SRE technical practices of blameless postmortems, service-level objectives (SLOs), and error budgets, and the SRE cultural practices of blamelessness, psychological safety, unified vision, collaboration and communication, and knowledge sharing.
Module 4: Make Tomorrow Better than Today
Continuous, gradual testing as well as automation are very important in SRE culture. This module covers the SRE technical concepts of continuous integration, continuous delivery, and canarying as they relate to the DevOps pillar of implementing gradual change. You'll learn about the concepts of toil and automation, and the idea of automating this year’s job away. You'll also learn about SRE cultural practices of design thinking, prototyping, and how you can support your teams through change.
Module 5: Regulate Workload
In this module, you'll learn about SRE practices around measuring everything, specifically reliability and toil, and the concept of monitoring. We’ll also cover the cultural fundamentals of goal-setting, transparency, and data-driven decision making.
Module 6: Apply SRE in Your Organization
In this module, we will talk about ways you can assess and understand your organization’s maturity and readiness for adopting SRE principles, practices, and culture. We’ll also discuss the types of skills to look for in hiring new SREs and how to upskill your current workforce. Lastly, we’ll give you advice on how to start thinking about setting up an SRE org, and the additional support our Google Cloud Professional Services teams can provide your organization as you continue on your journey to SRE.
Module 7: Final Assessment
Test your overall knowledge of Google SRE technical and cultural practices with this summative quiz. You must score an 80% to pass. This assessment is required in order to receive your course completion certificate.