O'Reilly logo
live online training icon Live Online training

Chaos engineering: Planning and running your first game day

Russ Miles

Modern systems need to be reliable, resilient, robust…and continuously changing. Under these conditions, failure is a normal state for the infrastructure, platforms, and applications that make up a production system. Chaos engineering is a disciplined approach to turning that failure to your advantage, enabling you to inject controlled, preemptive failure into your systems so that you can surface and overcome weaknesses before your customers encounter them.

Join expert Russ Miles to learn how to adopt and apply the mindset and practices of a successful chaos engineer. Through lectures, practical examples, and hands-on exercises, you'll discover how to turn system failure into opportunities for learning as you successfully plan and execute your first game day, a collaborative exercise in which you deliberately place your systems—people, practices, processes, and technology—under stress in order to explore and overcome weaknesses to improve resiliency.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • Why you can't prove system reliability in advance
  • The purpose and limitations of chaos engineering
  • How to explain the value of chaos engineering to your company
  • The purpose of game day exercises

And you’ll be able to:

  • Plan and execute a successful game day to explore system weaknesses at the infrastructure, platform, and application levels
  • Enable appropriate system observability to support chaos engineering
  • Communicate and share the findings from a game day to enable prioritized system improvement

This training course is for you because...

  • You're a software developer who needs to start taking responsibility for your code in production.
  • You're a site reliability engineer (SRE) with a little experience managing production, and you want to be proactive about finding system weaknesses before your customers do.
  • You're a system administrator who is responsible for the availability of production, and you need a proactive technique for surfacing system weaknesses before your customers experience them.
  • You're a product owner who is responsible for delivering a business-critical product or service, and you want to learn how to gain trust and confidence in your system’s reliability.
  • You're a DevSecOps engineer who needs a technique and tools to support discovering, capturing, sharing, and collaborating on security weaknesses.

Prerequisites

  • A general understanding of Kubernetes as a platform and Java

Materials or downloads needed in advance:

  • Visit the course website and follow the precourse instructions
  • Download the game day template (link TBD)
  • Sign up for the course Slack channel #chaosengoreilly

Recommended preparation:

About your instructor

  • Russ Miles is a member of the Chaos Collective, an expert group, founded by Casey Rosenthal, that runs one-day workshops for companies looking to learn about chaos engineering and establish their own in-house chaos engineering capability. He also founded and continues to build a community around the free and open source Chaos Toolkit and Hub projects. For the past three years, Russ has been a chaos engineer at both startups and enterprises. He is an internationally respected teacher, consultant, speaker, and the author of AspectJ Cookbook, Learning UML 2.0, Head First Software Development, and Antifragile Software: Building Adaptable Systems from Microservices.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

The chaos engineering mindset (20 minutes)

  • Lecture: Introduction to the sociotechnical system; the challenges of production; how the Cynefin model of systems proves we can't be sure of system reliability and resilience; trust and confidence; the chaos engineering mindset; how to distill outages into chaos engineering
  • Hands-on exercises: Use the outage template to explore and document findings from a production outage for chaos engineering

Attacks on reliability and resiliency (20 minutes)

  • Lecture: Why chaos engineering is a proactive approach to building trust and confidence in system reliability and resilience; continuous limited-scope disaster recovery; commoditized disaster recovery; the various levels of attack on reliability and resilience
  • Hands-on exercises: Brainstorm and share how you might “prove” your systems are resilient and reliable; explore various attacks on system reliability and resilience

Defining resilience and reliability (10 minutes)

  • Lecture: Reliability and resilience; the “premortem” and how it relates to incident postmortems

Break (10 minutes)

Introduction to game days (50 minutes)

  • Lecture: Game day basic concepts; deciding who attends your game day; to surprise, or not to surprise?; building a hypothesis; understanding and introducing observability; defining your method; defining remediation actions
  • Hands-on exercise: Construct a plan for assessing and improving the observability of your system; design a game day using the game day template

Break (10 minutes)

Learning from your game day (50 minutes)

  • Lecture: How to avoid resistance to game days; the crucial characteristics of collaboration and empathy to the chaos engineer; the ethics of chaos engineering and why it can mean the difference between success and failure in your organization; the 24-hour rule on ideas for solutions to discovered weaknesses; the limitations of game days and how they relate to automated chaos experiments; the power of measuring resiliency through mean time to detect, mean time to diagnose, mean time to recovery, and mean time to all clear; why you should be careful not to rely too much on these statistics
  • Hands-on exercise: Take the findings from a real-world game day and convert them into a plan for system improvement, along with appropriate metrics; work through the findings from multiple game days to build a roadmap for system improvement; identify a list of candidates for further exploration through continuous chaos and automation

Wrap-up and Q&A (10 minutes)