The postmortem ended with a list of lessons learned. Three weeks later, the same failure mode came back with a new ticket number.
That is the difference between learning and documentation. A lesson that does not change behavior is not a lesson. It is a story you tell yourself to make the incident feel useful.
SRE teams are especially vulnerable to this because we are good at analysis and often blocked on authority. We can explain exactly what happened and still be unable to change the conditions that made it inevitable.
The misconception: lessons are the output
The tempting belief is that a postmortem produces lessons learned, and those lessons improve the system.
It fails because the system does not change when you learn. It changes when you ship a control, remove a dependency, or change a decision rule.
If you only produce lessons, what usually breaks first is follow-through. Here’s why. Lessons compete with feature work, and feature work has a calendar. A lesson has good intentions.
What a lesson learned should be in practice
A lesson learned is a durable change that makes a class of failures less likely or less costly.
In operator terms, lessons fall into three buckets.
- Detection: we see the problem earlier.
- Containment: the blast radius is smaller.
- Recovery: the time to restore is lower.
If an action item does not map to one of those, what usually breaks first is relevance. Here’s why. You will not remember it during the next incident, and you will not prioritize it during the next planning cycle.
A concrete trace: the repeat incident that teaches nothing
A common pattern is an overload incident in a dependency.
During the incident you see rising latency, timeouts, and cascading retries. You recover by scaling, failing open, or shedding load. The postmortem identifies the same contributing factors you identified last time: retry storms, lack of backpressure, and a brittle dependency graph.
Then the lesson learned becomes “improve resiliency.” That is not a lesson. That is a wish.
Fastest confirmation for whether you learned is to look for one thing: did you ship a guardrail?
If traffic spikes again, what usually breaks first is the same edge. Here’s why. Load amplifies the weakest coupling. If you did not change the coupling, the edge does not move.
The operator move: turn lessons into decision rules
The highest leverage lessons are decision rules, not tasks.
Example rule: when Error Budget burn accelerates, pause feature releases and ship only burn reducers. That is a rule. It survives the next deadline.
If your organization ignores the rule, what usually breaks first is the claim that you have an SRE practice. Here’s why. SRE is not a title. It is the willingness to let reliability constrain change.
The operational artifact: the “lesson learned” quality gate
Use this to decide whether an action item is real.
- Mechanism: does the item name the causal mechanism it addresses?
- Change: does it change code, configuration, architecture, or a decision policy?
- Verification: how will you confirm it worked without waiting for the next outage?
- Owner: who owns it, and do they have the authority to ship it?
- Deadline: what date forces prioritization, not aspiration?
If verification is “we will see fewer incidents,” what usually breaks first is your memory. Here’s why. You will not know whether you improved or whether the failure simply has not recurred yet.
Failure signature: postmortems that do not compound
Symptoms are recognizable.
- Action items are vague and long-lived.
- Owners rotate and the work resets.
- Repeat incidents look familiar but still surprise people.
Fastest confirmation is to sample the last five postmortems and count how many action items changed a decision rule or shipped a guardrail. If the answer is close to zero, you are producing narrative, not change.
If leadership asks for “more rigor,” what usually breaks first is morale. Here’s why. Engineers will write better documents until they realize documents are not the constraint. Authority and incentives are.
How a senior should explain this to a peer
Lessons learned are only real when they change behavior. In practice that means a guardrail, a constraint, or a decision rule that survives the next deadline. If we cannot verify the change without another outage, we did not learn yet. We just wrote.
The unresolved part is governance. A team can write perfect postmortems and still repeat failures if it cannot force the system to change.
Related operator notes
- Customer Reliability Engineering: make customer pain operational
- Blameless culture in SRE: accountability without scapegoats
- KISS for SRE: shrink the state space
- Feedback loops in SRE: where systems lie to you first
Sanity check questions
- For your last incident, what did you ship that changes detection, containment, or recovery?
- What is the verification signal that proves the change worked before the next outage?
- Which lesson would survive an executive deadline, and which ones are really just tasks waiting to be deprioritized?
Continue Reading
🚨How AI is changing incident response: intelligent triage, automated Runbooks, LLM-powered postmortems, and on-call health.
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


