O'Reilly logo
live online training icon Live Online training

Chaos engineering: Planning, designing, and running automated chaos experiments

Russ Miles

Chaos engineering is all about exploring and overcoming weaknesses in your system. Although many think that chaos engineering must be applied to production, you can use chaos engineer methods to find weaknesses in the entire sociotechnical system of development and delivery. For example, "game days"—mock disaster recovery drills—are most commonly executed against a staging environment.

As your systems change and evolve, chaos engineering naturally becomes a continuous practice, just like continuous integration and continuous delivery—continuous chaos, if you will. The problem is that continuous, manual game days are too expensive and time consuming. Enter automated chaos experiments and tests.

Using practical examples and hands-on exercises, Russ Miles takes you beyond manual game days to demonstrate how to construct automated chaos experiments to continuously and collaboratively explore, surface, and overcome weaknesses in your infrastructure, platforms, and applications. By the time you're through, you'll be able to explain the value of chaos engineering to your company and get started with continuous, automated chaos tests to key an eye on current weaknesses in your system and potentially surface new weaknesses in the future.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • Why you can't prove system reliability in advance
  • The purpose and limitations of chaos engineering
  • How to explain the value of chaos engineering to your company
  • How to construct careful chaos experiments applied in production to avoid affecting the customer experience

And you’ll be able to:

  • Design, implement, execute, and share carefully automated chaos engineering experiments to surface technical system weaknesses at the infrastructure, platform, and application levels
  • Communicate and share the findings from automated chaos experiments to enable prioritized system improvement
  • Use chaos experiments as continuous, automated chaos tests to ensure weaknesses do not regress and potentially surface new weaknesses in the future

This training course is for you because...

  • You're a software developer who needs to start taking responsibility for your code in production.
  • You're a site reliability engineer (SRE) with a little experience managing production, and you want to be proactive about finding system weaknesses before your customers do.
  • You're a system administrator who is responsible for the availability of production, and you need a proactive technique for surfacing system weaknesses before your customers experience them.
  • You're a product owner who is responsible for delivering a business-critical product or service, and you want to learn how to gain trust and confidence in your system’s reliability.
  • You're a DevSecOps engineer who needs a technique and tools to support discovering, capturing, sharing, and collaborating on security weaknesses.

Prerequisites

  • A general understanding of Kubernetes as a platform and Java

Materials or downloads needed in advance:

  • A GitHub account
  • A machine with the Chaos Toolkit installed
  • Visit the instructor's course website and follow the precourse instructions

Recommended preparation:

Recommended follow-up:

About your instructor

  • Russ Miles is a member of the Chaos Collective, an expert group, founded by Casey Rosenthal, that runs one-day workshops for companies looking to learn about chaos engineering and establish their own in-house chaos engineering capability. He also founded and continues to build a community around the free and open source Chaos Toolkit and Hub projects. For the past three years, Russ has been a chaos engineer at both startups and enterprises. He is an internationally respected teacher, consultant, speaker, and the author of AspectJ Cookbook, Learning UML 2.0, Head First Software Development, and Antifragile Software: Building Adaptable Systems from Microservices.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Readying your system for chaos (60 minutes)

  • Lecture: Why there is no totally safe chaos in production; the blast radius; the need for observability; the Law of Requisite Variety; a comparison of observability and traditional management and monitoring; measuring "system normal" from the outside in; questions that can be answered with good system observability; the relationship between blue-green releases and “system normal” metrics; how observability helps define what is “system normal”—the first step in chaos engineering; the steady state hypothesis
  • Hands-on exercise: Analyze a software development system for its observability deficiencies; create an observability improvement plan by defining the metrics that indicate the system is “normal,” assessing the maturity of the system’s observability, and describing action points for improvement

Break (10 minutes)

Your first automated chaos experiment (50 minutes)

  • Lecture: The structure of a chaos experiment; a real-world example experiment to illustrate this structure; how to start with just an experimental method; probes and actions; how to distill ideas into a steady state hypothesis; what rollbacks really are; the workflow process of authoring and executing a chaos experiment
  • Hands-on exercise: Define an automated chaos experiment in JSON; define a steady state hypothesis in JSON; define an experimental method in JSON

Break (10 minutes)

Executing chaos (50 minutes)

  • Lecture: How to run your chaos experiment; how to draw upon configuration variables and secrets for your experiments; the output of a chaos experiment; how to author a PDF or HTML report from your chaos experiment’s output; how to effectively present and follow through on your chaos experimental findings
  • Hands-on exercise: Execute an experiment against the case study system’s Kubernetes cluster; evaluate and define appropriate rollback instructions for the experiment and add them to the experiment definition’s JSON; produce a complete report for the experiment’s execution that can be shared with other stakeholders