The outage was not caused by a complex system. It was caused by a simple system with complex failure modes.
The service had one job. It still failed in ways that were hard to predict because the implementation kept accumulating special cases. Every exception was justified. Every exception also multiplied the number of ways the system could surprise the operator.
KISS is not a style preference. In SRE it is a reliability policy: reduce the number of states you can be in and the number of ways you can transition between them.
The misconception: simplicity is about elegance
The tempting belief is that KISS is an aesthetic. Keep the code clean, keep the design tidy, feel good about the architecture.
It fails because reliability is not about elegance. Reliability is about fewer failure paths.
If requirements change, what usually breaks first is your exception handling. Here’s why. Exceptions get added under pressure and rarely get removed later. They become permanent complexity debt.
What KISS means in production
Simple means the system has fewer operational states and fewer hidden couplings.
You can ship a complex system that is reliable if it is constrained and observable. Most teams do not. They ship complexity without constraints, then rely on heroics to operate it.
If your system depends on institutional memory, what usually breaks first is onboarding. Here’s why. New on-call engineers cannot reason about the system’s edges, so they revert to guessing and escalation.
The contrast pair: simple interface versus simple behavior
Teams often mistake a simple interface for simple behavior.
A service can have a clean API and still behave like a set of loosely related workflows behind the scenes. That is where incidents hide. The interface looks stable. The behavior is not.
Prediction prompt: when you simplify only the interface, what breaks first?
It is diagnosis. Operators can no longer infer internal state from external behavior because the behavior depends on hidden branches.
A concrete trace: the configuration matrix from hell
This is common in platform services and managed offerings.
You add a feature flag. Then another. Then environment specific overrides. Then customer specific overrides. Then migration modes. None of them are wrong in isolation. Together they create a configuration matrix that nobody can fully enumerate.
Fastest confirmation that you are in trouble is not a code review comment. It is an outage where two environments behave differently under the same inputs and nobody can explain why in the first ten minutes.
If the number of configurations grows, what usually breaks first is your test coverage. Here’s why. You cannot test the matrix. You can only sample it. Incidents happen in the untested corners.
The operator move: shrink the state space
The default I would ship is ruthless: delete configuration degrees of freedom until you can reason about the system again.
That means:
- Prefer one mechanism over many.
- Prefer one mode of operation over hidden modes.
- Prefer one migration path you can observe over several you cannot.
Alternatives exist, but they change behavior only if you have a strong testing and release discipline. Most teams do not at the moment they are adding complexity.
If you cannot roll back safely, what usually breaks first is your appetite for change. Here’s why. Complex systems make rollback risky, so teams stop rolling back. Then incidents become longer because the safest move is gone.
The operational artifact: complexity budget checklist
Use this before you accept a new feature that increases branching or configuration.
- New states: what new system states does this add?
- Transitions: what new transitions are possible between states?
- Observability: what signals tell an operator which state the system is in?
- Rollback: what is the rollback path, and what state does rollback return you to?
- Default: what is the default mode, and is it enforced?
If you cannot answer rollback, what usually breaks first is your incident tempo. Here’s why. The fastest safe move becomes unavailable, and every incident turns into a manual repair.
How a senior should explain this to a peer
KISS in SRE is a constraint on state space. We limit how many modes the system can run in, we keep the defaults enforceable, and we ensure an operator can infer state from signals quickly. Complexity is not free, even when it is justified. It taxes diagnosis and rollback first.
The unresolved part is product pressure. The easiest time to keep a system simple is before the matrix exists. After it exists, simplification means saying no to legitimate requests, and that is where teams often fold.
Related operator notes
- Customer Reliability Engineering: make customer pain operational
- Blameless culture in SRE: accountability without scapegoats
- Lessons learned that actually change systems
- Feedback loops in SRE: where systems lie to you first
Sanity check questions
- What is the current state space of your service, and can an on-call infer state in minutes?
- Which configuration degrees of freedom could you delete without harming the core value?
- What is your rollback path, and what state does it return you to under pressure?
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


