O'Reilly logo
live online training icon Live Online training

Network Troubleshooting: Basic Theory and Process

Russ White

Troubleshooting is a fundamental skill for all network engineers, from the least to most experienced. However, there is little material on correct and efficient troubleshooting techniques in a network engineering context, and no (apparent) live training in this area. Some chapters in books exist (such as the Computer Networking Problems and Solutions, published in December 2017), and some presentations in Cisco Live, but the level of coverage for this critical skill is far below what engineers working in the field to develop solid troubleshooting skills.

This training focuses on the half-split system of troubleshooting, which is widely used in the electronic and civil engineering domains. The importance of tracing the path of the signal, using models to put the system in context, and the use of a simple troubleshooting “loop” to focus on asking how, what, and why are added to the half-split method to create a complete theory of troubleshooting. Other concepts covered in this course are the difference between permanent and temporary fixes and a review of measuring reliability. The final third of the course contains several practical examples of working through problems to help in applying the theory covered in the first two sections to the real world.

What you'll learn-and how you can apply it

This course will focus on the theory of troubleshooting. By taking this course, you will develop a strong mental model of efficient troubleshooting, helping you reduce MTTR, and even MTBM, in real life deployments. The half-split method, the use of models from forwarding systems to protocol layers, and the general concepts of root cause analysis are all covered.

This training course is for you because...

  • You want to move from ad hoc styles of troubleshooting to more systematic styles
  • You want to have specific, actionable methods to use for troubleshooting network problems and to stage information to improve MTTR
  • You want to understand the relationship between redundancy and resilience better
  • You want to understand the relationship between technical debt, root causes, and problem repair better

Prerequisites

  • A basic understanding of network design and operation (perhaps at the network professional level)
  • A basic understanding of OSPF, IS-IS, BGP, and IP forwarding

Resources

About your instructor

  • Russ White began working with computers in the mid-1980's, and computer networks in 1990. He has experience in designing, deploying, breaking, and troubleshooting large scale networks, and is a strong communicator from the white board to the board room. Across that time, he has co-authored more than forty software patents, participated in the development of several Internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ has a background covering a broad spectrum of topics, including radio frequency engineering and graphic design, and is an active student of philosophy and culture.

    Russ is a co-host at the Network Collective, serves on the Routing Area Directorate at the IETF, co-chairs the BABEL working group, serves on the Technical Services Council/as a maintainer on the open source FR Routing project, and serves on the Linux Foundation (Networking) board. His most recent works are Computer Networking Problems and Solutions, The Art of Network Architecture, Navigating Network Complexity, and the Intermediate System to Intermediate System LiveLesson.

    MSIT Capella University, MACM Shepherds Theological Seminary, PhD (in progress) Southeastern Baptist Theological Seminary CCIE #2635, CCDE 2007::1, CCAr

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Foundations

Length: 50 minutes

  • MTTR, MTBM, MTBM
  • Resiliency in terms of troubleshooting
  • Positive feedback loops
  • Automated processes and fragility
  • The troubleshooting process
  • Avoiding the narrows
  • Using models to dive deeper
  • Using abstraction to counter the combinatorial explosion
  • When abstractions leak
  • What, how, and why models

10 Minute Break

Segment 2: Process

Length: 50 minutes

  • The theory of half split, as seen from search trees
  • Putting it together: a simple troubleshooting loop and the half-split
  • Using manipulability theory to prove it
  • Observations on observations

10 Minute Break

Segment 3: Examples

Length: 50 minutes

  • The EIGRP case
  • The BGP case
  • IS-IS and BFD

10 minute final Question and Answer Period