Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Microsoft

Develop a Site Reliability Engineering (SRE) strategy

Microsoft via Microsoft Learn

Overview

  • Module 1: Learn about SRE, an engineering discipline that helps you sustainably achieve the appropriate level of reliability in your systems, services, and products.

    In this module you will:

    • Gain a basic understanding of Site Reliability Engineering (SRE)
    • Learn how to get started with this valuable operations practice
  • Module 2: Respond to incidents and activities in your infrastructure through alerting capabilities in Azure Monitor.

    In this module, you'll:

    • Configure alerts on events in your Azure resources based on metrics, log events, and activity log events.
    • Learn how to use action groups in response to an alert, and how to use alert processing rules to override action groups when necessary.
  • Module 3: Learn about how to capture trace output from your Azure web apps. View a live log stream and download logs files for offline analysis.

    In this module, you will:

    • Enable application logging on an Azure Web App
    • View live application logging activity with the log streaming service
    • Retrieve application log files from an application with Kudu or the Azure CLI
  • Module 4: Learn how to manage site reliability.

    After completing this module, you'll be able to:

    • Describe how site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production.
    • Describe how Application Insights analyzes the performance of your web application and can warn you about potential problems.
    • List the processes that you can implement to monitor site reliability.
    • Build a "just culture" that balances safety and accountability.
  • Module 5: Cloud Admin course from Dr. Majd Sakr at Carnegie Mellon University. Discover what cloud elasticity means and different ways to scale your cloud resources.

    In this module you will:

    • Describe common load patterns and how they drive the need to scale
    • Enumerate the strategies and considerations in scaling cloud applications
    • Discuss the advantages of auto-scaling and the mechanisms used to achieve it
    • Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
    • List the primary benefits of serverless computing and explain the concept of serverless functions

    This content is provided in partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 6: Carnegie Mellon University's Cloud Developer course. Learn how developers write programs that run on the cloud, including how to deploy, be fault-tolerant, load balance, scale, and deal with latency.

    In this module, you will:

    • Evaluate different considerations when programming applications that run on clouds
    • Evaluate different considerations when deploying applications on clouds
    • Compare and contrast proactive and reactive measures for fault tolerance in cloud applications
    • Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
    • Enumerate the strategies and considerations in scaling cloud applications
    • Motivate the case for minimizing tail latency and discuss the various strategies to reduce tail latency
    • Describe the strategies to optimize total operational cost of using cloud services

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 7: Learn how to troubleshoot inbound network connectivity for Azure Load Balancer.

    In this module, you will:

    • Identify common Azure Load Balancer inbound connectivity issues.
    • Identify steps to resolve issues when virtual machines aren't responding to health probe.
  • Module 8: Learn how to monitor the health of your Azure VMs by using Azure Metrics Explorer and metric alerts.

    In this module, you will:

    • Identify metrics and diagnostic data that you can collect for virtual machines
    • Configure monitoring for a virtual machine
    • Use monitoring data to diagnose problems

Syllabus

  • Module 1: Module 1: Introduction to Site Reliability Engineering (SRE)
    • Introduction to Site Reliability Engineering
    • What is SRE and why does it matter?
    • SRE in context
    • Key SRE principles and practices: virtuous cycles
    • Key SRE principles and practices: The human side of SRE
    • Getting started with SRE
    • Summary
  • Module 2: Module 2: Improve incident response with alerting on Azure
    • Introduction
    • Explore the different alert types that Azure Monitor supports
    • Use metric alerts for alerts about performance issues in your Azure environment
    • Exercise - Use metric alerts to alert on performance issues in your Azure environment
    • Use log alerts to alert on events in your application
    • Use activity log alerts to alert on events within your Azure infrastructure
    • Use action groups and alert processing rules to send notifications when an alert is fired
    • Exercise -Use an activity log alert and an action group to notify users about events in your Azure infrastructure
    • Summary
  • Module 3: Module 3: Capture Web Application Logs with App Service Diagnostics Logging
    • Introduction
    • Enable and configure App Service application logging
    • Exercise - Enable and configure App Service application logging using the Azure portal
    • View live application logging with the log streaming service
    • Exercise - View live application logging with the log streaming service using Azure CLI
    • Retrieve application log files
    • Exercise - Retrieve Application Log Files using Azure CLI and Kudu
    • Summary
  • Module 4: Module 4: Manage site reliability
    • Introduction
    • What is reliability engineering?
    • What is Application Insights?
    • Perform ongoing tuning to reduce meaningless alerts
    • Analyze alerts to establish a baseline
    • Blameless postmortems
    • Knowledge check
    • Summary
  • Module 5: Module 5: Scale your cloud resources with elasticity
    • Introduction
    • Compute load patterns
    • Scaling compute resources
    • Automated scaling on the cloud
    • Load balancing
    • Serverless computing
    • Summary
  • Module 6: Module 6: Build applications on the cloud
    • Introduction
    • Programming the cloud
    • Deploy applications on the cloud
    • Build fault-tolerant cloud services
    • Load balancing
    • Scale resources
    • How to deal with tail latency
    • Economics for cloud applications
    • Summary
  • Module 7: Module 7: Troubleshoot inbound network connectivity for Azure Load Balancer
    • Introduction
    • Troubleshoot Azure Load Balancer
    • Diagnose issues by reviewing configurations and metrics
    • Exercise - Set up your environment
    • Exercise - Identify and resolve inbound network connectivity
    • Summary
  • Module 8: Module 8: Monitor the health of your Azure virtual machine by using Azure Metrics Explorer and metric alerts
    • Introduction
    • Monitor the health of the virtual machine
    • Exercise - Set up a VM with boot diagnostics
    • View VM metrics
    • Configure the Azure Diagnostics extension
    • Exercise - Configure the Azure Diagnostics extension
    • Diagnostic data case studies
    • Exercise - Use diagnostic data
    • Summary

Reviews

Start your review of Develop a Site Reliability Engineering (SRE) strategy

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.