A production system rarely fails all at once. It fails by shifting constraints. On-call fails the same way. People do not suddenly burn out. The system quietly moves from sustainable to brittle, and you only notice when performance drops, mistakes increase, or the team starts bleeding experienced operators.
If you treat on-call as a personal stamina problem, you will keep hiring tougher humans for a system that keeps getting worse. If you treat it as a system, you can measure it, model it, and change its inputs until it behaves.
The central idea is simple. On-call load is demand arriving into a finite-capacity service. Pages, tickets, interruptions, escalations, and follow-up work all compete for the same operator attention. When demand approaches or exceeds capacity, the queue grows. When the queue grows, stress rises. When stress rises, decision quality drops. When decision quality drops, incidents last longer and demand increases. That loop is how teams get trapped.
The constraint you are actually managing
Most teams fixate on the most visible symptom, which is the page. The constraint is usually elsewhere.
Sometimes the constraint is signal quality. The team is flooded with alerts that are not actionable, not attributable, or not time-sensitive. Sometimes the constraint is tool friction, where every mitigation requires four dashboards, three jump boxes, and a dozen manual steps. Sometimes the constraint is change velocity without guardrails, where deployments create a steady stream of recoverable failures that still cost human sleep. Sometimes the constraint is follow-up debt, where every incident creates a backlog of work that lands on the same people who are already carrying the pager.
You cannot fix what you do not name. Start by asking a blunt question: what is the first thing that becomes scarce when we are on call? Is it uninterrupted time, sleep, decision focus, escalation bandwidth, or the ability to complete follow-up work during business hours? The answer tells you what to measure.
Measure demand, not heroics
On-call measurement fails when it turns into performance evaluation. The goal is not to rank responders. The goal is to quantify the shape of demand and the cost of handling it, then redesign the system.
A useful measurement set has three properties. It is leading, not lagging. It is tied to user harm or operational risk, not internal discomfort. It is actionable, meaning you can point to a lever that changes it.
Start with demand rate and demand quality.
Demand rate is the arrival of work into on-call. Count pages, incidents, and escalations per shift, but separate business hours from after-hours because sleep disruption is a different class of cost. Look at the distribution, not the average. One operator absorbing three nights of clustered pages is what breaks people, even if the monthly mean looks fine.
Demand quality asks whether each interruption was worth the cost. Track the actionable rate. A page that cannot be acted on is a tax on attention. Track attribution quality, which is whether the alert points to an owning service and an obvious first step. Track duplicate demand, where one failure produces a cascade of pages across dependent services.
Then measure handling cost, because two teams with the same page volume can have completely different experiences.
Handling cost shows up as time to acknowledge, time to mitigate, and time to recover, but the most revealing metric is often time spent per interrupt, including the hidden tail. A two-minute page that produces forty minutes of cleanup and coordination is not a two-minute page. It is a context switch with a long shadow.
Finally, measure follow-up load, because burnout often comes from the second shift no one budgets for. Incidents create tickets, postmortems, action items, and design work. If the same on-call pool is expected to clear that backlog while continuing to take pages, the queue never drains. You need a view of open operational debt, its age, and how often on-call rotations end with unfinished follow-up work.
The human cost that predicts failure
Some of the most predictive measures are not engineering metrics, but they still map cleanly to reliability outcomes.
Sleep disruption is one. Track the count of after-hours pages that require waking up, not just those that fire. Track how many nights per month are interrupted per person, and how clustered those nights are. A system that spreads pain evenly is survivable. A system that randomly concentrates pain on a few people creates churn.
Interrupt density is another. Track how many distinct interruptions occur per hour during a shift, because the experience of being continuously preempted is worse than handling the same total work in fewer blocks.
Escalation pressure matters as well. If on-call frequently requires pulling in subject matter experts, you have a design signal: ownership boundaries are unclear, systems are too coupled, or runbooks are too thin. It also tells you that the true load is not captured by the primary responder’s pager metrics.
You should also track rotation sustainability signals that show up in staffing behavior. How often do people try to swap out of shifts. How often do you rely on volunteers. How often do you violate rest expectations. Those are not cultural problems. They are system alarms.
Tie every metric to a lever
Measuring is pointless if it does not change policy.
If actionable rate is low, your lever is alert design. That means suppression, deduplication, ownership routing, better thresholds, and shifting from symptom alerts to user-harm signals. If handling cost is high, your lever is runbooks, tooling, and standard mitigations that reduce cognitive overhead. If follow-up backlog is growing, your lever is capacity allocation, which usually means a protected reliability queue and explicit time carved out of the week for operational debt.
If after-hours load is too high, your lever is not try harder. It is a combination of change management and safety rails. Reduce risky deploy windows. Add canaries that roll back automatically. Introduce error-budget based release gating that actually SLOws change when reliability is degraded. If you do not have an enforcement mechanism, you do not have a policy. You have a suggestion.
This is the place where senior operators earn their keep. They insist that the system will not renegotiate reliability under pressure.
A simple operating model that prevents renegotiation
Pick one decision you will make faster next time, and bind it to a signal that represents user harm. Then pre-commit to what you will do when that signal moves.
If you want an example that is small enough to run next week, use an error-budget burn or a high-confidence user-impact signal, decide what actions are allowed while burn is elevated, and decide what actions are forbidden. The forbidden list is the part that prevents burnout, because it removes the expectation that on-call will compensate for risk the organization chose to take.
A senior should be able to explain this to a peer without jargon. You pick a signal that maps to user harm. You define what you do when it moves. You refuse to renegotiate in the middle of the incident. You do the uncomfortable follow-up work to make the decision easier next time.
Sanity checks before you declare the system healthy
You can validate whether your measurements are pointing at the right reality with three questions.
What decision are you trying to make faster, and why does it matter? What signal proves user harm rather than internal discomfort? What will you stop doing when risk is unaffordable, and who has the authority to enforce that stop?
If you cannot answer those cleanly, the on-call system is already drifting. You are just early enough that it still feels normal.
Continue Reading
🤖The practitioner guide to AIOps: alert correlation, anomaly detection, LLM integration, and automated remediation.
How AI is changing incident response: intelligent triage, automated runbooks, LLM-powered postmortems, and on-call health.
Metrics, distributed tracing, structured logs, SLOs, and Error Budgets — and how to extend them for AI systems.
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


