A Runbook is the difference between a 4-minute resolution and a 45-minute one at 2 AM. Not because it’s magic, but because it eliminates the cognitive load of figuring out what to do when you’re already stressed, paged, and half awake.
This page gives you a complete SRE runbook template, a real production example, a downloadable Markdown version, and answers to every common question about how to write one that holds up under pressure.
What Is a Runbook (and Why Most Are Too Vague to Use)
A runbook is a documented procedure for responding to a specific operational event — typically an incident, degraded service, or scheduled maintenance task. It tells an on-call engineer exactly what steps to take, in what order, and what to do when those steps don’t work.
The problem with most runbooks is that they’re written for the person who already knows how to handle the incident. “Check the database” is not a step. “Run SHOW PROCESSLIST on the read replica and look for queries running longer than 30 seconds” is a step.
Effective runbooks are written for the engineer who has never seen this failure before.
Runbook vs. Playbook: The Distinction That Actually Matters
These terms are used interchangeably, but in practice they describe different scopes:
| Runbook | Playbook | |
|---|---|---|
| Scope | Single service or failure type | Broader incident category |
| Audience | On-call engineer | Incident commander + team |
| Depth | Step-by-step commands | Decision trees and escalation flows |
| Example | “How to restart the payment service” | “How to manage a P1 data outage” |
A playbook tells you what kind of response to run. A runbook tells you how to execute the specific procedure. For SRE teams, you typically maintain both: a small set of playbooks covering major failure categories, and a larger library of runbooks for individual services and common failure modes.
The SRE Runbook Template: All 7 Sections
Every runbook should include all seven sections. The ones marked [required] are non-negotiable — without them the runbook fails under real incident conditions.
Section 1: Incident Metadata [required]
Runbook Name:
Service / Component:
Severity Level: P1 / P2 / P3
Last Updated:
Owner (team or individual):
Related Alerts: [alert name in PagerDuty/Datadog/etc.]
Related Runbooks: [links to related procedures]Without metadata, runbooks go stale without anyone noticing. The “Last Updated” and “Owner” fields create accountability. “Related Alerts” ensures the runbook surfaces automatically when the right alert fires.
Section 2: Symptoms and Signals [required]
What the alert says:
What the user experiences:
Key metrics to check:
- Metric 1: [name, location, what abnormal looks like]
- Metric 2:
- Metric 3:
Dashboards: [direct links — not "go find the dashboard"]
Logs: [query or filter to run, location]Write this section for the first five minutes of an incident. The engineer needs to quickly confirm they’re dealing with what the alert claims — not a false positive, not a different underlying issue.
Section 3: Initial Response Steps [required]
1. Acknowledge the alert in [tool name]
2. Join the incident channel: #incidents-[service]
3. Assign incident commander if severity is P1
4. Confirm the symptom by running: [specific command or check]
5. Post initial status update to [Slack / status page] within 5 minutesThis section should take less than 5 minutes to execute. It’s not about fixing the problem — it’s about confirming the problem and getting the right people aware.
Section 4: Diagnostic Procedures [required]
This is the core of the runbook. Structure it as a decision tree where possible.
Step 1: Check service health
Command: [exact command]
If healthy → go to Step 3
If degraded → go to Step 2
Step 2: Check upstream dependencies
Dependency A: [how to check]
Dependency B: [how to check]
If dependency is root cause → escalate to [team], link incident
Step 3: Check recent deployments
Command: [exact command or link to CI/CD system]
If deployment within last 2 hours → consider rollback (see Section 5)
Step 4: Review error logs
Log location: [path or query]
Filter: [exact filter string]
Look for: [specific error patterns]Avoid vague steps. “Check the logs” is not a step. Include the exact query, the exact path, and what a bad result looks like.
Section 5: Mitigation and Resolution
Option A – Restart the service (safe, under 30 seconds downtime):
Command: [exact command]
Verify with: [command to confirm service is healthy]
Expected outcome: [what you should see]
Option B – Rollback last deployment:
Command: [exact command or link to pipeline]
Time required: ~8 minutes
Verify with: [health check command]
Option C – Scale up replicas (for load-related issues):
Command: [exact command]
Max safe replicas: [number]
Verify with: [command]
Option D – Escalate (if none of the above work):
Escalate to: [name/team]
Contact method: [Slack handle / PagerDuty policy]
What to tell them: [context to provide]Section 6: Escalation Paths
P1 escalation (within 15 minutes if not resolved):
Primary: [name / role]
Secondary: [name / role]
Contact: [Slack / PagerDuty]
Stakeholder communication:
Engineering lead: [how to notify]
Customer-facing update: [who posts to status page]
Format: "We are investigating [service] degradation.
Impact: [X]. ETA for update: [Y]."Section 7: Post-Incident Actions
Immediately after resolution:
- Post all-clear to incident channel
- Update status page
- Record resolution time and method in incident tracker
Within 24 hours:
- Create postmortem ticket
- Capture: timeline, root cause hypothesis, contributing factors
Within 1 week:
- Postmortem review meeting
- Update this runbook with anything that was missing or wrong
- File action items for any gaps foundReal Runbook Example: Database Connection Pool Exhaustion
Here’s what the key sections look like filled in for a failure mode that most production SRE teams have encountered.
Failure: Application servers can’t acquire database connections. Requests time out. Error: connection pool exhausted or too many connections.
Section 2 (Symptoms) — filled in
- Alert fires:
db-connection-pool-utilization > 90% for 5 minutes - User impact: 503 errors on checkout and account pages
- Metrics to check:
db_pool_active_connections,db_pool_waiting_requestsin Datadog - Logs:
grep "connection pool" /var/log/app/error.log | tail -100
Section 4 (Diagnostics) — filled in
Step 1: Confirm pool exhaustion
Run: SHOW STATUS LIKE 'Threads_connected';
If Threads_connected near max_connections → pool is the bottleneck
Step 2: Find long-running queries holding connections
Run: SELECT * FROM information_schema.processlist
WHERE TIME > 30 ORDER BY TIME DESC;
If results → these queries are holding connections open
Step 3: Check for recent traffic spike
Dashboard: [link to request-rate graph]
If spike → scaling may be needed (Option C below)
Step 4: Check for connection leak
Metric: db_pool_idle_connections should be > 0
If 0 idle with low traffic → likely a leak, escalate to app teamSection 5 (Resolution) — filled in
Option A – Kill long-running queries (under 2 min impact):
KILL QUERY [process_id]; -- run for each query from Step 2
Verify: SHOW STATUS LIKE 'Threads_connected'; should drop
Option B – Temporarily increase max connections:
SET GLOBAL max_connections = 300; -- default is usually 151
Note: Temporary fix. File a ticket to investigate root cause.
Option C – Scale app servers to distribute connection load:
kubectl scale deployment/api --replicas=6
Verify: watch kubectl get pods -l app=api
Option D – Escalate to DBA:
Slack: #oncall-database
Context: "Connection pool exhausted since [time].
Threads_connected=[X], long queries: [yes/no],
traffic spike: [yes/no]."Downloadable Runbook Template (Markdown)
Copy the block below and save it as runbook-[service-name].md in your team’s runbook repo.
# Runbook: [Incident Type]
**Service:**
**Severity:** P1 / P2 / P3
**Owner:**
**Last Updated:**
**Related Alerts:**
---
## Symptoms
- Alert message:
- User impact:
- Key metrics:
- Dashboard links:
- Log query:
---
## Initial Response (first 5 minutes)
1. Acknowledge alert
2. Join #incidents-[service]
3. Assign IC if P1
4. Confirm symptom with: [command]
5. Post status update
---
## Diagnostics
Step 1: [what to check] → [command]
- Healthy: go to Step X
- Degraded: go to Step Y
---
## Mitigation Options
### Option A: [Name] — [time, impact]
Command:
Verify with:
### Escalation
Escalate to: [team]
Via: [Slack/PD]
Tell them: [context template]
---
## Post-Incident
- [ ] All-clear posted
- [ ] Status page updated
- [ ] Postmortem ticket created
- [ ] This runbook updatedHow to Write a Runbook That People Actually Use Under Pressure
The runbooks that get ignored are the ones written after incidents are over, by the engineer who resolved it, for an audience of one. Here’s how to write for the engineer who gets paged at 3 AM and has never seen this failure:
Write commands, not descriptions. “Restart the service” is not a step. systemctl restart api-server && systemctl status api-server is a step.
Include what “success” looks like. After every resolution step, specify what healthy output looks like. Engineers under stress can’t always tell if their fix worked.
Test it with someone who wasn’t involved. Give the runbook to a junior engineer or someone from a different team. If they get stuck, the runbook has a gap.
Keep it under one screen per section. If diagnostic steps scroll off the page, they won’t be followed correctly under pressure. Summarize, link to detail.
Embed dashboard links, not instructions to find dashboards. Every second spent navigating monitoring tools is a second not spent fixing the problem.
Common Runbook Anti-Patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| “Check the logs” | No location, no filter, no expected output | Specify log path, query, and what bad looks like |
| “Contact the team” | Who? How? When? | Name, Slack handle, PagerDuty policy |
| Runbook not linked to alert | On-call has to find it mid-incident | Map runbook URL to alert in your alerting tool |
| Never updated after incidents | Stale commands, dead links | Add runbook review to postmortem checklist |
| Assumes service familiarity | Fails for new hires and rotation engineers | Write for the person who’s never seen this service |
FAQ: Runbooks for SRE Teams
What is a runbook?
A runbook is a documented set of procedures for responding to a specific operational event. It tells on-call engineers what to check, what commands to run, and what to do if those commands don’t work — without requiring them to hold that knowledge in their head.
What’s the difference between a runbook and a playbook?
A runbook covers a specific service or failure type with exact steps. A playbook covers a broader incident category with decision trees, escalation flows, and cross-team coordination. In practice: playbooks tell you which runbooks to use.
How long should a runbook be?
As short as it can be while remaining complete. Most runbooks for a single failure mode fit in one to three pages. If it’s longer, it’s probably covering too many failure modes — split it.
Where should runbooks live?
Close to the alerts that trigger them. Most teams store runbooks in a shared wiki (Confluence, Notion, or GitHub) and link directly from PagerDuty, Datadog, or whatever alerting system fires the on-call page. The runbook that requires navigation to find is the runbook that won’t get used.
How often should runbooks be updated?
After every incident where the runbook was used. The postmortem methodology checklist should include “update runbook with anything that was wrong or missing.”
Can AI write runbooks?
AI can draft a starting point from a service description or alert configuration, but the result needs to be validated by someone who has actually responded to the incident. LLM-generated runbooks often include plausible-sounding but incorrect commands. Treat AI output as a first draft, not a finished runbook. For more on how AI fits into incident management, and what to understand about mean time to detect (MTTD) before optimizing response workflows, those pages cover both in depth. The broader picture lives in AIOps-fundamentals/">AIOps for SRE.
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


