What is Mean Time to Detect?

Mean Time to Detect (MTTD) is the average duration from when a problem starts until it is detected by monitoring/alerting. Lower MTTD means problems are caught earlier.

How do you improve MTTD?

Improve MTTD by: adding more monitoring points, reducing alert thresholds (carefully), using anomaly detection to catch unusual patterns, and creating composite alerts that detect problems faster than individual signals.

What is a good MTTD target?

A good MTTD target is 1-5 minutes for critical services. In practice, targets depend on business impact: payment systems may target 1 minute, non-critical services may accept 15-30 minutes.

MTTD Explained: Why Most Teams Get It Wrong (and How to Fix It)

Mean time to detect sounds simple: the average time between when a problem starts and when your team knows about it. The formula is short. The concept is obvious. And yet most SRE teams that track MTTD are measuring it in a way that obscures the actual problem.

This page covers how MTTD works, where the common measurement mistakes are, and what actually moves the number.

IN THIS ARTICLE

Table of Contents

What MTTD Measures (and What It Doesn’t)

MTTD (Mean Time to Detect) is the average elapsed time between incident start and incident detection — the moment your monitoring system, a user report, or an engineer first registers that something is wrong.

The Formula

MTTD = Total detection time across all incidents
       ─────────────────────────────────────────
              Number of incidents measured

Example:

Incident 1: Problem started 14:00, alert fired 14:04 → 4 minutes
Incident 2: Problem started 09:15, engineer noticed 09:47 → 32 minutes
Incident 3: Problem started 22:30, customer report 22:41 → 11 minutes

MTTD = (4 + 32 + 11) / 3 = 15.7 minutes

What MTTD Does Not Measure

MTTD only covers the detection phase. It does not include acknowledgment time (how long before an engineer picks up the alert), triage time (how long to understand what’s actually happening), or MTTR (total time to resolution).

It also does not tell you how the incident was detected. An alert that fired in 4 minutes is very different from a customer complaint that surfaced in 4 minutes, even if the MTTD number looks identical.

How MTTD Is Calculated in Practice

Most teams calculate MTTD from incident tracking data. The key inputs are:

Field	Source	Notes
Incident start time	First anomalous metric timestamp	Often estimated, not exact
Detection timestamp	Alert fire time OR first human awareness	Varies by detection method
Detection method	Alert / user report / engineer observation	Track this separately

The detection method matters more than the number. A 10-minute MTTD driven by customer complaints is a fundamentally different problem than a 10-minute MTTD from automated monitoring. Track both.

Why Most Teams Measure MTTD Wrong

Problem 1: Using alert fire time as incident start time

The most common mistake. Teams log “incident started when the alert fired” because that’s the data they have. But the incident started when the system first broke — which may be minutes or hours before any alert.

The gap between actual failure onset and alert fire time is called detection lag. It’s often the biggest component of your real MTTD, and it’s invisible if you’re using alert time as your start timestamp.

Fix: Use the first anomalous metric timestamp as start time, not the alert timestamp. Most monitoring platforms surface this in the incident timeline.

Problem 2: Excluding user-reported incidents

Teams often track MTTD only for alerted incidents. User-reported incidents — where a customer contacted support before any alert fired — are logged separately or not counted at all.

This systematically understates MTTD and hides monitoring blind spots.

Fix: Include all incident types. Tag each with detection method (automated / user-reported / self-discovered). The distribution tells you where your coverage gaps are.

Problem 3: Averaging across incident severities

A P1 database outage and a P3 SLOw endpoint have very different detection requirements and very different consequences for delayed detection. Averaging them together produces a number that’s not actionable for either.

Fix: Track MTTD by severity tier. P1 MTTD should be your headline metric.

Problem 4: Not tracking detection method over time

MTTD as a single number tells you the average. What you actually want to know is whether your monitoring is getting better at catching things before customers do. That requires tracking detection method distribution over time — specifically, the ratio of automated detections to user-reported ones.

MTTD Benchmarks: What Good Looks Like

These are rough industry reference points, not targets. Your meaningful benchmark is your own trend over time.

Severity	Weak	Acceptable	Strong
P1	> 15 min	5–15 min	< 5 min
P2	> 30 min	10–30 min	< 10 min
P3	> 60 min	20–60 min	< 20 min

Teams with mature AIOps tooling and dense instrumentation often achieve sub-2-minute P1 MTTD for well-understood failure modes. Novel failure modes — new services, unexpected interaction effects — will always lag.

What Actually Reduces MTTD

1. Instrument leading indicators, not more trailing ones

The instinct when MTTD is high is to add more alerting. This usually makes things worse. More alerts means more noise, which means engineers start ignoring them, which increases effective MTTD.

The right approach is to identify the leading indicators for your most common P1 and P2 failure modes — the metrics that move before the service degrades visibly — and alert on those.

For a web service, “error rate > 1% for 2 minutes” is a trailing indicator. “95th percentile response time increasing 40% over baseline for 3 minutes” is a leading indicator that fires earlier with more diagnostic context.

2. Reduce detection lag with anomaly detection baselines

Static thresholds miss gradual degradation. A service that normally runs at 40ms p99 and slowly climbs to 400ms over two hours won’t cross most static alert thresholds until it’s already a P1.

ML-based AIOps anomaly detection establishes dynamic baselines and alerts when behavior deviates significantly from the expected pattern — catching the slow degradation that static thresholds miss. Tools that implement this well include Datadog Watchdog, Dynatrace Davis, and New Relic anomaly detection.

3. Add synthetic monitors for user-facing transactions

Synthetic monitoring simulates real user transactions on a scheduled interval. It detects failures in critical user paths even when internal metrics look healthy.

A payment service can show healthy CPU, memory, and error rates while synthetic checkout transactions are silently failing. Without synthetic monitoring, that failure shows up in your MTTD as a customer complaint.

4. Reduce the alert-to-awareness gap

MTTD technically ends at detection (when the system knows), not at acknowledgment (when a human knows). But the practically important number is how long before an engineer is actively working the incident.

Reducing this requires clear on-call rotation ownership, tested escalation paths, and alerts that include enough context for an engineer to immediately understand what’s happening. An alert that says “CPU high” is not actionable. An alert that says “CPU > 90% on payment-api-prod-3, correlating with p99 latency spike, 2 recent deploys in last 6 hours” gives the on-call a starting point.

5. Close coverage gaps through user-reported incident reviews

Every user-reported incident is evidence of a monitoring blind spot. After any incident detected by customers before your alerting, run a specific analysis on the detection failure: what signal existed that an alert could have caught, why didn’t existing alerts cover it, and what would need to change to detect this in under 5 minutes next time.

Make monitoring improvement a permanent part of your postmortem process, not a one-off reaction.

MTTD vs. MTTR: Which to Optimize First

	MTTD	MTTR
Measures	Speed of detection	Speed of resolution
Main lever	Monitoring quality	Runbook quality + tooling
Typical range	Minutes	Minutes to hours
Impact	Earlier response, less blast radius	Faster recovery, less downtime

For most teams, MTTD is the higher-leverage optimization early in maturity. If incidents aren’t being caught quickly, all the SRE runbook template quality in the world only helps after significant damage has already occurred. Get P1 MTTD under 5 minutes before optimizing MTTR.

MTTD in the Full Incident Metrics Chain

MTTD doesn’t stand alone. It’s one phase in a sequence:

Failure onset
    ↓ [MTTD — mean time to detect]
Detection
    ↓ [MTTA — mean time to acknowledge]
Acknowledgment
    ↓ [Mean time to triage]
Triage complete
    ↓ [Mean time to resolve]
Resolution

Track each phase separately. When overall incident time is high, you need to know which phase is the bottleneck. Teams that only track MTTD + MTTR often can’t tell whether long resolution times are caused by slow detection, slow acknowledgment, or slow actual fixing. The full picture lives in SRE KPIs and connects back to your AIOps for SRE strategy.

FAQ

What’s a good MTTD target?

For P1 incidents, under 5 minutes is a reasonable target for services with mature monitoring. Under 2 minutes is achievable for well-instrumented services with synthetic monitoring. Don’t compare your number to industry peers — compare your trend over time.

How do I track MTTD in PagerDuty?

PagerDuty logs alert time automatically. For accurate MTTD, you’ll need to manually record or import the failure onset time in the incident timeline. Some teams use custom fields or integrations with Datadog to pull first-anomaly timestamps automatically.

Does MTTD include time before the alert fires?

It should. Alert fire time minus incident start time (the detection lag) is often the largest and most actionable component of MTTD. If you’re only measuring from alert fire time, you’re measuring acknowledgment speed, not actual detection quality.

What’s the difference between MTTD and MTTA?

MTTD (Mean Time to Detect) = time from failure to first detection by any means. MTTA (Mean Time to Acknowledge) = time from alert fire to an engineer acknowledging the alert. They measure different parts of the incident lifecycle and should be tracked separately.

How does AIOps improve MTTD?

AIOps tools improve MTTD primarily through better anomaly detection (catching deviations before they become user-visible), alert correlation (reducing noise so engineers act on fewer, higher-quality signals), and predictive monitoring. The connection between AIOps and incident management is covered in depth separately.

Stay Sharp

New articles on AIOps and SRE, straight to your inbox.

Practical content for practitioners. No noise, no vendor pitches.

No spam. Unsubscribe any time.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

MTTD Explained: Why Most Teams Get It Wrong (and How to Fix It)

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

From Postmortems to Prevention: Building a Real Risk Registry

The Invisible Meter Running Behind Every AI System

The 5 Whys in a postmortem: getting to a fixable cause

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

What's Hot

MTTD Explained: Why Most Teams Get It Wrong (and How to Fix It)

What MTTD Measures (and What It Doesn’t)

The Formula

What MTTD Does Not Measure

How MTTD Is Calculated in Practice

Why Most Teams Measure MTTD Wrong

Problem 1: Using alert fire time as incident start time

Problem 2: Excluding user-reported incidents

Problem 3: Averaging across incident severities

Problem 4: Not tracking detection method over time

MTTD Benchmarks: What Good Looks Like

What Actually Reduces MTTD

1. Instrument leading indicators, not more trailing ones

2. Reduce detection lag with anomaly detection baselines

3. Add synthetic monitors for user-facing transactions

4. Reduce the alert-to-awareness gap

5. Close coverage gaps through user-reported incident reviews

MTTD vs. MTTR: Which to Optimize First

MTTD in the Full Incident Metrics Chain

FAQ

What’s a good MTTD target?

How do I track MTTD in PagerDuty?

Does MTTD include time before the alert fires?

What’s the difference between MTTD and MTTA?

How does AIOps improve MTTD?

New articles on AIOps and SRE, straight to your inbox.

Related Posts