From Postmortems to Prevention: Building a Real Risk Registry

Postmortems do not improve reliability. Most teams just hope they will.

The same incidents repeat not because teams fail to learn, but because nothing enforces follow through. Action items get written, discussed, and agreed on. Then they enter a backlog where they SLOwly lose priority, ownership, and visibility.

A few weeks later, the system fails in the same way again.

The problem is not the quality of the postmortem. The problem is that postmortems are treated as the system, when they are only the input.

If you want reliability to improve, you need a system that continuously manages risk. Not a document that describes what already went wrong.

That system is a risk registry.

IN THIS ARTICLE

Table of Contents

Why Postmortem Action Items Fail

Postmortem action items fail for structural reasons, not cultural ones.

They are often written in a way that sounds correct but lacks precision. “Improve monitoring” or “add better alerting” does not define an outcome. Engineers cannot prioritize work that is not clearly bounded.

Backlogs then make the problem worse. A minor improvement and a systemic failure sit side by side with no way to distinguish them. Everything feels important. Nothing gets done.

Ownership does not drift by accident. It disappears because no system forces a single team to carry the risk. When work spans components or teams, responsibility becomes shared, and shared responsibility is rarely executed.

Time finishes the job. New work arrives, incidents happen, and yesterday’s action items fade into the background.

Most teams do not have a reliability problem. They have a prioritization problem disguised as reliability work.

The Shift: From Action Items to Risk

A risk registry changes the unit of work.

Instead of tracking tasks, it tracks failure modes.

An action item describes something you plan to do. A risk describes something that will happen again if nothing changes. One is optional. The other is already present in the system.

This shift forces clarity. What exactly can fail. How bad would it be. How likely is it to happen again.

Those questions turn vague follow-ups into concrete engineering problems. They also remove the illusion that reliability improves simply because work was documented.

A risk exists whether or not anyone is assigned to fix it.

What a Real Risk Looks Like

A useful risk is not a paragraph and not a vague statement. It is a precise description of a failure mode that engineers immediately recognize.

A typical postmortem action item might read:

“Improve handling of stale resources in backup system.”

That sounds reasonable, but it does not define the problem. A risk reframes it:

“Orphaned snapshot resources accumulate in management clusters, causing backup jobs to degrade over time and exceed recovery objectives under load.”

Now the failure mode is clear. The mechanism is visible. The user impact is explicit.

In practice, a single risk entry looks like this:

Risk	Orphaned snapshot resources accumulate in management clusters
Impact	Backup operations exceed recovery objectives under load
Scope	All clusters using shared backup infrastructure
Likelihood	Increases over time without cleanup mechanisms
Score	High
Owner	Service Lifecycle SRE
Status	Active
Last Reviewed	March 2026

This is no longer a suggestion. It is a defined reliability exposure.

Scoring Risk Instead of Guessing Priority

Once risks are clearly defined, they must be comparable.

Without a model, prioritization becomes subjective. The most recent incident or the loudest failure gets attention. Quieter but more dangerous risks remain unresolved.

A simple scoring model brings consistency by evaluating three dimensions. Impact reflects how severe the outcome is if the risk materializes. Scope reflects how broadly the system is affected. Likelihood reflects how often the failure is expected to occur.

When combined, these create a score that represents real exposure.

Consider two risks. One is a rare failure that could cause a full outage. The other is a daily degradation affecting a large percentage of users. Without structure, teams debate. With scoring, the second often takes priority because it is actively burning reliability every day.

This is where most teams get stuck. They optimize for theoretical severity instead of actual impact over time. A scoring model makes that tradeoff explicit.

Turning Scores Into Decisions

A risk score is only useful if it changes behavior.

High scoring risks should be visible in the same places as incidents and operational metrics. They should show up in planning, in leadership discussions, and in day-to-day engineering work. They are not background tasks. They represent the current state of reliability.

Lower scoring risks still matter, but they should not compete for attention. This is where focus is created — not by ignoring problems, but by ordering them correctly.

Over time, the registry should move. High risks should be reduced or eliminated. New risks should appear as systems evolve. If the same risks sit at the top for months, the system is not working.

Keeping the System From Rotting

A risk registry only works if it is active.

If risks are not reviewed in the same forum as incidents or sprint planning, they will rot. Every time.

The registry must be part of how the team operates, not a separate system that requires extra effort to maintain. Risks should be created directly from postmortems. Ownership must be explicit and tied to real teams. Movement should be visible so engineers can see progress and stagnation clearly.

The moment the registry becomes passive, it turns back into a backlog.

A Real Example

Consider a backup system that gradually slows down.

Over time, orphaned resources accumulate. Each backup job processes more data than necessary. Eventually, recovery objectives are missed during peak load.

A traditional approach generates action items: clean up stale resources, improve monitoring, optimize the pipeline.

A risk registry captures the actual problem. The risk is accumulation of orphaned resources leading to degraded backup performance and missed recovery objectives. Impact is high because it affects data protection. Scope is broad because it affects all clusters using the system. Likelihood is increasing because the condition compounds over time.

This risk rises to the top of the registry. The conversation changes. The team is no longer deciding whether to fix it. The team is deciding how quickly it can be removed.

The Outcome That Matters

A well-run risk registry changes how a team thinks about reliability.

Postmortems stop being the end of the process and become the input. Reliability work becomes visible, measurable, and defensible. Engineering discussions shift from reacting to incidents to reducing the conditions that cause them.

Postmortems should not be the system that improves reliability. They should be the signal that your system for managing risk is working.

That is the difference between documenting failure and preventing it.

Stay Sharp

New articles on AIOps and SRE, straight to your inbox.

Practical content for practitioners. No noise, no vendor pitches.

No spam. Unsubscribe any time.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

From Postmortems to Prevention: Building a Real Risk Registry

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

The Invisible Meter Running Behind Every AI System

The 5 Whys in a postmortem: getting to a fixable cause

Google NotebookLM for AIOps and SRE

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

What's Hot

From Postmortems to Prevention: Building a Real Risk Registry

Why Postmortem Action Items Fail

The Shift: From Action Items to Risk

What a Real Risk Looks Like

Scoring Risk Instead of Guessing Priority

Turning Scores Into Decisions

Keeping the System From Rotting

A Real Example

The Outcome That Matters

New articles on AIOps and SRE, straight to your inbox.

Related Posts