What is the 5 Whys technique in postmortems?

The 5 Whys is a root cause analysis method where you ask why repeatedly (typically 5 times) to trace a problem back to its fundamental cause rather than stopping at surface-level symptoms.

What are the limitations of 5 Whys in SRE postmortems?

The 5 Whys can oversimplify complex system failures by forcing a single causal chain. Modern SRE prefers blameless postmortems that explore multiple contributing factors simultaneously.

How do you run a blameless 5 Whys session?

Focus questions on system and process failures rather than individual actions. Ask what conditions allowed this to happen instead of who caused this to keep the analysis constructive.

The 5 Whys in a postmortem: getting to a fix...

Teams reach for 5 Whys because it forces discomfort. The first answer is usually tidy and wrong. The second answer starts to touch the system. By the third, you can feel whether you are moving toward something you can change or toward a story you want to believe.

Most of the “5 Whys does not work” complaints are not about the technique. They are about how it gets used in a postmortem. People run it like a checklist, produce a straight line, and end up with a moral lesson. The output feels complete, and nothing changes.

In a postmortem, the only useful end state is a cause expressed as a control problem. Something in the system made the failure likely, made it large, or made it SLOw to respond. If you cannot tie the root cause to a control you will add, strengthen, or enforce, you are not done.

The right mental model is not “ask why five times.” It is “keep asking why until the answer stops being a mechanism and starts being a shrug.” The number is a guardrail against premature stopping, not a goal.

IN THIS ARTICLE

Table of Contents

The failure that 5 Whys is designed to prevent

The common postmortem failure mode is the satisfying story. It looks like insight, but it is really closure.

You see it when the root cause reads like a personality trait, a missed step, or a single bad decision. Those might be true facts in the timeline, but they are almost never the reason the incident was possible. They are the last link in a chain of conditions that made the mistake survivable, or catastrophic.

If your root cause can be fixed by telling someone to be more careful, the system will hand you the same incident again. The next person will make a different mistake in the same gap.

5 Whys is useful because it pushes past attribution and toward conditions. That only happens if you enforce a standard for each answer.

Treat each “why” as a demand for a mechanism

A real answer has a mechanism you could falsify.

“Database overload” is a label. The mechanism is what created the load, why the load crossed a threshold, and why the system did not shed it. Connection churn from retries, a slow query path behind a new feature flag, a pool size that was tuned for last quarter’s traffic, a cache miss storm after a deployment. Those are mechanisms. They may still be incomplete, but you can test them.

When a team is tired, the temptation is to let labels stand in for causes. In a postmortem, labels are where you start asking questions, not where you stop.

This is also where the chain often needs to branch. Incidents in production are rarely linear. If the mechanism has two independent contributors, do not force a single line because the method is called “5 Whys.” A branching chain is still 5 Whys. It is a better reflection of the system you actually run.

Separate the technical chain from the response chain

Most postmortems conflate two different problems.

One problem is the technical failure and its propagation. The other is why it took the team as long as it did to detect, decide, and mitigate. You can have an excellent fix for the first and still keep paying the MTTR tax on the second.

When teams do 5 Whys only on the technical failure, they miss the part that compounds. Alert design, ownership ambiguity, missing decision rules, runbooks that describe components but not actions, dashboards that hide the early symptom, escalation paths that are socially understood but not operationally enforced. Those are not side notes. They are the controls that determine whether an incident stays small.

If you want 5 Whys to produce durable improvements, run it twice. One chain for the system, one chain for the response.

Stop criteria that prevent “why” from turning into philosophy

The useless end of a 5 Whys chain is the generic truth that was always true.

“Because humans make mistakes.” “Because software is complex.” “Because the business moves fast.” Those statements do not help you design controls. They are summaries of the environment.

A practical stop point is when the next why would no longer change the fix you will implement. If you already have a clear control to add, and deeper causes only move you toward generalities, stop. Your job is not to explain reality. Your job is to change the odds.

Another stop point is when the chain starts pointing outside your locus of control. That does not mean the cause is not real. It means the fix will not come from your postmortem, and pretending otherwise creates action items that never ship.

What “good” output looks like

A useful root cause statement names the missing or ineffective control.

It should be specific enough that a reader can tell whether the control exists today. It should be specific enough that an owner can implement it without interpretation. It should be specific enough that a reviewer can confirm it is in place.

If your root cause is “insufficient testing,” you do not have a control. If your root cause is “the release pipeline did not enforce a canary gate for this service and allowed a change that violated the Error Budget policy,” you have something you can build.

That is the bar. It is not poetic, and it is not meant to be.

The easiest ways to break 5 Whys

5 Whys fails quickly when the group starts optimizing for speed instead of truth. That tends to show up in three patterns.

The first is person centered causality. Someone pushed the wrong button, missed a step, ignored a warning. That is a data point in the chain, not the chain. Translate it into the condition that made the action possible and the consequence large.

The second is choosing a single cause because it feels clean. Systems fail through combinations, and your fixes should be layered the same way. You want a control that reduces probability, a control that limits blast radius, and a control that improves response. If your analysis collapses everything into one fix, you are building a brittle defense.

The third is letting the chain drift away from evidence. A 5 Whys session can become a story telling competition if you are not strict about mechanisms and sources. If you cannot point to logs, traces, configs, dashboards, tickets, or the incident record, mark the statement as a hypothesis. Do not promote it to root cause because it sounds right.

How AI can support the method without corrupting it

AI is most useful here as a strict reviewer, not an author.

Give it the incident record, the postmortem draft, and the runbooks. Ask it to flag claims that lack support in the record. Ask it to identify where a “why” answer is a label rather than a mechanism. Ask it to surface places where the chain should branch because the evidence points to multiple contributors.

If it starts inventing causes, you are using it wrong. The value is in forcing discipline, not in producing a narrative.

The standard to hold in review

If the postmortem ends with causes that map to controls, 5 Whys did its job.

If it ends with advice, reminders, or a single moral, it did not.

Image credit: “Change management whiteboard” by Birkenkrahe, licensed for reuse with attribution and share-alike (CC BY-SA). Cropped for header use.

🚨

Incident Management with AI →

How AI is changing incident response: intelligent triage, automated runbooks, LLM-powered postmortems, and on-call health.

Stay Sharp

New articles on AIOps and SRE, straight to your inbox.

Practical content for practitioners. No noise, no vendor pitches.

No spam. Unsubscribe any time.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

The 5 Whys in a postmortem: getting to a fixable cause

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

From Postmortems to Prevention: Building a Real Risk Registry

The Invisible Meter Running Behind Every AI System

Google NotebookLM for AIOps and SRE

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

AIOps Strategies to Cut Incident Response Time

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

AIOps Strategies to Cut Incident Response Time

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

What's Hot

The 5 Whys in a postmortem: getting to a fixable cause

The failure that 5 Whys is designed to prevent

Treat each “why” as a demand for a mechanism

Separate the technical chain from the response chain

Stop criteria that prevent “why” from turning into philosophy

What “good” output looks like

The easiest ways to break 5 Whys

How AI can support the method without corrupting it

The standard to hold in review

New articles on AIOps and SRE, straight to your inbox.

Related Posts