O'Reilly logo
live online training icon Live Online training

Systems Design for Site Reliability Engineers

How To Build A Reliable System in Three Hours

Salim Virji

Distributed systems form the foundation for most of our modern computing infrastructure, as well as for much of our application development—whether on-premises or mobile. The software built with distributed systems comes with distinct failure modes, and in order to build reliable systems one must understand how to assess and develop with these modes.

In this hands-on course, you’ll learn the fundamentals of systems design and evaluation so that you have the skills necessary to design, improve, and scale your own system or application using SRE best practices developed at Google.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to design a software system to meet a Service Level Objective (SLO)
  • How to incrementally improve a system
  • How to identify single points of failure (SPOFs) in a large software system

And you’ll be able to:

  • Make required resource estimates to create a bill of materials
  • Incrementally scale a system

This training course is for you because...

  • You’re a site reliability engineer (SRE), or work in a related discipline: DevOps, Systems Engineering, System Administration
  • You manage SREs
  • You want to develop an understanding of practical distributed systems


  • Participants ought to be familiar with “boxes and arrows” diagrams
  • Participants need to be comfortable with orders-of-magnitude math, such as “how many copies of a 1Mb file can a 1Tb drive hold?”

Recommended preparation:

Prepare for this session by reading selected chapters from the SRE Book and SRE Workbook: - Introducing Non-abstract Large System Design - Service Level Objectives

Recommended follow-up:

About your instructor

  • Salim Virji is a site reliability engineer at Google, where he has built distributed systems that enable planet-scale storage and datacenter-size compute loads.


The timeframes are only estimates and may vary according to how the class is progressing

Identify the Problem (50 minutes)

  • Presentation
  • Problem Statement: We are building an image-serving application
  • Terminology and Concepts
  • Service Level Objectives
  • Exercise: Design a distributed system
  • Q&A
  • Break (10 minutes)

The solution has limitations! Let’s improve it (50 minutes)

  • Presentation: Failure domains and defense in depth. This presentation describes how to quantitatively assess the failure domains in a distributed system, and how to provide defense in depth so that failures are isolated
  • Exercise: Identify failure domains, and make the design tolerant to failure. Make a highly-available image-serving system
  • Q&A
  • Break (10 minutes)

Commonly-encountered limitations, and how to design for them (30 minutes)

  • Presentation: capacity limitations, bottlenecks, and compromises. We will describe the boundaries of a system, and how to decide when further scale is important.
  • Discussion: Designing for “10x” scale: why is this a good rule of thumb?
  • Q&A