Mean time to detect sounds simple: the average time between when a problem starts and when your team knows about it. The formula is short. The concept is obvious. And yet most SRE teams that track MTTD are measuring it in a way that obscures the actual problem.
This page covers how MTTD works, where the common measurement mistakes are, and what actually moves the number.
What MTTD Measures (and What It Doesn’t)
MTTD (Mean Time to Detect) is the average elapsed time between incident start and incident detection — the moment your monitoring system, a user report, or an engineer first registers that something is wrong.
The Formula
MTTD = Total detection time across all incidents
─────────────────────────────────────────
Number of incidents measuredExample:
- Incident 1: Problem started 14:00, alert fired 14:04 → 4 minutes
- Incident 2: Problem started 09:15, engineer noticed 09:47 → 32 minutes
- Incident 3: Problem started 22:30, customer report 22:41 → 11 minutes
MTTD = (4 + 32 + 11) / 3 = 15.7 minutes
What MTTD Does Not Measure
MTTD only covers the detection phase. It does not include acknowledgment time (how long before an engineer picks up the alert), triage time (how long to understand what’s actually happening), or MTTR (total time to resolution).
It also does not tell you how the incident was detected. An alert that fired in 4 minutes is very different from a customer complaint that surfaced in 4 minutes, even if the MTTD number looks identical.
How MTTD Is Calculated in Practice
Most teams calculate MTTD from incident tracking data. The key inputs are:
| Field | Source | Notes |
|---|---|---|
| Incident start time | First anomalous metric timestamp | Often estimated, not exact |
| Detection timestamp | Alert fire time OR first human awareness | Varies by detection method |
| Detection method | Alert / user report / engineer observation | Track this separately |
The detection method matters more than the number. A 10-minute MTTD driven by customer complaints is a fundamentally different problem than a 10-minute MTTD from automated monitoring. Track both.
Why Most Teams Measure MTTD Wrong
Problem 1: Using alert fire time as incident start time
The most common mistake. Teams log “incident started when the alert fired” because that’s the data they have. But the incident started when the system first broke — which may be minutes or hours before any alert.
The gap between actual failure onset and alert fire time is called detection lag. It’s often the biggest component of your real MTTD, and it’s invisible if you’re using alert time as your start timestamp.
Fix: Use the first anomalous metric timestamp as start time, not the alert timestamp. Most monitoring platforms surface this in the incident timeline.
Problem 2: Excluding user-reported incidents
Teams often track MTTD only for alerted incidents. User-reported incidents — where a customer contacted support before any alert fired — are logged separately or not counted at all.
This systematically understates MTTD and hides monitoring blind spots.
Fix: Include all incident types. Tag each with detection method (automated / user-reported / self-discovered). The distribution tells you where your coverage gaps are.
Problem 3: Averaging across incident severities
A P1 database outage and a P3 SLOw endpoint have very different detection requirements and very different consequences for delayed detection. Averaging them together produces a number that’s not actionable for either.
Fix: Track MTTD by severity tier. P1 MTTD should be your headline metric.
Problem 4: Not tracking detection method over time
MTTD as a single number tells you the average. What you actually want to know is whether your monitoring is getting better at catching things before customers do. That requires tracking detection method distribution over time — specifically, the ratio of automated detections to user-reported ones.
MTTD Benchmarks: What Good Looks Like
These are rough industry reference points, not targets. Your meaningful benchmark is your own trend over time.
| Severity | Weak | Acceptable | Strong |
|---|---|---|---|
| P1 | > 15 min | 5–15 min | < 5 min |
| P2 | > 30 min | 10–30 min | < 10 min |
| P3 | > 60 min | 20–60 min | < 20 min |
Teams with mature AIOps tooling and dense instrumentation often achieve sub-2-minute P1 MTTD for well-understood failure modes. Novel failure modes — new services, unexpected interaction effects — will always lag.
What Actually Reduces MTTD
1. Instrument leading indicators, not more trailing ones
The instinct when MTTD is high is to add more alerting. This usually makes things worse. More alerts means more noise, which means engineers start ignoring them, which increases effective MTTD.
The right approach is to identify the leading indicators for your most common P1 and P2 failure modes — the metrics that move before the service degrades visibly — and alert on those.
For a web service, “error rate > 1% for 2 minutes” is a trailing indicator. “95th percentile response time increasing 40% over baseline for 3 minutes” is a leading indicator that fires earlier with more diagnostic context.
2. Reduce detection lag with anomaly detection baselines
Static thresholds miss gradual degradation. A service that normally runs at 40ms p99 and slowly climbs to 400ms over two hours won’t cross most static alert thresholds until it’s already a P1.
ML-based AIOps anomaly detection establishes dynamic baselines and alerts when behavior deviates significantly from the expected pattern — catching the slow degradation that static thresholds miss. Tools that implement this well include Datadog Watchdog, Dynatrace Davis, and New Relic anomaly detection.
3. Add synthetic monitors for user-facing transactions
Synthetic monitoring simulates real user transactions on a scheduled interval. It detects failures in critical user paths even when internal metrics look healthy.
A payment service can show healthy CPU, memory, and error rates while synthetic checkout transactions are silently failing. Without synthetic monitoring, that failure shows up in your MTTD as a customer complaint.
4. Reduce the alert-to-awareness gap
MTTD technically ends at detection (when the system knows), not at acknowledgment (when a human knows). But the practically important number is how long before an engineer is actively working the incident.
Reducing this requires clear on-call rotation ownership, tested escalation paths, and alerts that include enough context for an engineer to immediately understand what’s happening. An alert that says “CPU high” is not actionable. An alert that says “CPU > 90% on payment-api-prod-3, correlating with p99 latency spike, 2 recent deploys in last 6 hours” gives the on-call a starting point.
5. Close coverage gaps through user-reported incident reviews
Every user-reported incident is evidence of a monitoring blind spot. After any incident detected by customers before your alerting, run a specific analysis on the detection failure: what signal existed that an alert could have caught, why didn’t existing alerts cover it, and what would need to change to detect this in under 5 minutes next time.
Make monitoring improvement a permanent part of your postmortem process, not a one-off reaction.
MTTD vs. MTTR: Which to Optimize First
| MTTD | MTTR | |
|---|---|---|
| Measures | Speed of detection | Speed of resolution |
| Main lever | Monitoring quality | Runbook quality + tooling |
| Typical range | Minutes | Minutes to hours |
| Impact | Earlier response, less blast radius | Faster recovery, less downtime |
For most teams, MTTD is the higher-leverage optimization early in maturity. If incidents aren’t being caught quickly, all the SRE runbook template quality in the world only helps after significant damage has already occurred. Get P1 MTTD under 5 minutes before optimizing MTTR.
MTTD in the Full Incident Metrics Chain
MTTD doesn’t stand alone. It’s one phase in a sequence:
Failure onset
↓ [MTTD — mean time to detect]
Detection
↓ [MTTA — mean time to acknowledge]
Acknowledgment
↓ [Mean time to triage]
Triage complete
↓ [Mean time to resolve]
ResolutionTrack each phase separately. When overall incident time is high, you need to know which phase is the bottleneck. Teams that only track MTTD + MTTR often can’t tell whether long resolution times are caused by slow detection, slow acknowledgment, or slow actual fixing. The full picture lives in SRE KPIs and connects back to your AIOps for SRE strategy.
FAQ
What’s a good MTTD target?
For P1 incidents, under 5 minutes is a reasonable target for services with mature monitoring. Under 2 minutes is achievable for well-instrumented services with synthetic monitoring. Don’t compare your number to industry peers — compare your trend over time.
How do I track MTTD in PagerDuty?
PagerDuty logs alert time automatically. For accurate MTTD, you’ll need to manually record or import the failure onset time in the incident timeline. Some teams use custom fields or integrations with Datadog to pull first-anomaly timestamps automatically.
Does MTTD include time before the alert fires?
It should. Alert fire time minus incident start time (the detection lag) is often the largest and most actionable component of MTTD. If you’re only measuring from alert fire time, you’re measuring acknowledgment speed, not actual detection quality.
What’s the difference between MTTD and MTTA?
MTTD (Mean Time to Detect) = time from failure to first detection by any means. MTTA (Mean Time to Acknowledge) = time from alert fire to an engineer acknowledging the alert. They measure different parts of the incident lifecycle and should be tracked separately.
How does AIOps improve MTTD?
AIOps tools improve MTTD primarily through better anomaly detection (catching deviations before they become user-visible), alert correlation (reducing noise so engineers act on fewer, higher-quality signals), and predictive monitoring. The connection between AIOps and incident management is covered in depth separately.
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


