SRE Runbook Template: Production-Ready Example + Free Download

A Runbook is the difference between a 4-minute resolution and a 45-minute one at 2 AM. Not because it’s magic, but because it eliminates the cognitive load of figuring out what to do when you’re already stressed, paged, and half awake.

This page gives you a complete SRE runbook template, a real production example, a downloadable Markdown version, and answers to every common question about how to write one that holds up under pressure.

IN THIS ARTICLE

Table of Contents

What Is a Runbook (and Why Most Are Too Vague to Use)

A runbook is a documented procedure for responding to a specific operational event — typically an incident, degraded service, or scheduled maintenance task. It tells an on-call engineer exactly what steps to take, in what order, and what to do when those steps don’t work.

The problem with most runbooks is that they’re written for the person who already knows how to handle the incident. “Check the database” is not a step. “Run SHOW PROCESSLIST on the read replica and look for queries running longer than 30 seconds” is a step.

Effective runbooks are written for the engineer who has never seen this failure before.

Runbook vs. Playbook: The Distinction That Actually Matters

These terms are used interchangeably, but in practice they describe different scopes:

	Runbook	Playbook
Scope	Single service or failure type	Broader incident category
Audience	On-call engineer	Incident commander + team
Depth	Step-by-step commands	Decision trees and escalation flows
Example	“How to restart the payment service”	“How to manage a P1 data outage”

A playbook tells you what kind of response to run. A runbook tells you how to execute the specific procedure. For SRE teams, you typically maintain both: a small set of playbooks covering major failure categories, and a larger library of runbooks for individual services and common failure modes.

The SRE Runbook Template: All 7 Sections

Every runbook should include all seven sections. The ones marked [required] are non-negotiable — without them the runbook fails under real incident conditions.

Section 1: Incident Metadata [required]

Runbook Name:
Service / Component:
Severity Level: P1 / P2 / P3
Last Updated:
Owner (team or individual):
Related Alerts: [alert name in PagerDuty/Datadog/etc.]
Related Runbooks: [links to related procedures]

Without metadata, runbooks go stale without anyone noticing. The “Last Updated” and “Owner” fields create accountability. “Related Alerts” ensures the runbook surfaces automatically when the right alert fires.

Section 2: Symptoms and Signals [required]

What the alert says:
What the user experiences:
Key metrics to check:
  - Metric 1: [name, location, what abnormal looks like]
  - Metric 2:
  - Metric 3:
Dashboards: [direct links — not "go find the dashboard"]
Logs: [query or filter to run, location]

Write this section for the first five minutes of an incident. The engineer needs to quickly confirm they’re dealing with what the alert claims — not a false positive, not a different underlying issue.

Section 3: Initial Response Steps [required]

1. Acknowledge the alert in [tool name]
2. Join the incident channel: #incidents-[service]
3. Assign incident commander if severity is P1
4. Confirm the symptom by running: [specific command or check]
5. Post initial status update to [Slack / status page] within 5 minutes

This section should take less than 5 minutes to execute. It’s not about fixing the problem — it’s about confirming the problem and getting the right people aware.

Section 4: Diagnostic Procedures [required]

This is the core of the runbook. Structure it as a decision tree where possible.

Step 1: Check service health
  Command: [exact command]
  If healthy → go to Step 3
  If degraded → go to Step 2

Step 2: Check upstream dependencies
  Dependency A: [how to check]
  Dependency B: [how to check]
  If dependency is root cause → escalate to [team], link incident

Step 3: Check recent deployments
  Command: [exact command or link to CI/CD system]
  If deployment within last 2 hours → consider rollback (see Section 5)

Step 4: Review error logs
  Log location: [path or query]
  Filter: [exact filter string]
  Look for: [specific error patterns]

Avoid vague steps. “Check the logs” is not a step. Include the exact query, the exact path, and what a bad result looks like.

Section 5: Mitigation and Resolution

Option A – Restart the service (safe, under 30 seconds downtime):
  Command: [exact command]
  Verify with: [command to confirm service is healthy]
  Expected outcome: [what you should see]

Option B – Rollback last deployment:
  Command: [exact command or link to pipeline]
  Time required: ~8 minutes
  Verify with: [health check command]

Option C – Scale up replicas (for load-related issues):
  Command: [exact command]
  Max safe replicas: [number]
  Verify with: [command]

Option D – Escalate (if none of the above work):
  Escalate to: [name/team]
  Contact method: [Slack handle / PagerDuty policy]
  What to tell them: [context to provide]

Section 6: Escalation Paths

P1 escalation (within 15 minutes if not resolved):
  Primary: [name / role]
  Secondary: [name / role]
  Contact: [Slack / PagerDuty]

Stakeholder communication:
  Engineering lead: [how to notify]
  Customer-facing update: [who posts to status page]
  Format: "We are investigating [service] degradation.
           Impact: [X]. ETA for update: [Y]."

Section 7: Post-Incident Actions

Immediately after resolution:
  - Post all-clear to incident channel
  - Update status page
  - Record resolution time and method in incident tracker

Within 24 hours:
  - Create postmortem ticket
  - Capture: timeline, root cause hypothesis, contributing factors

Within 1 week:
  - Postmortem review meeting
  - Update this runbook with anything that was missing or wrong
  - File action items for any gaps found

Real Runbook Example: Database Connection Pool Exhaustion

Here’s what the key sections look like filled in for a failure mode that most production SRE teams have encountered.

Failure: Application servers can’t acquire database connections. Requests time out. Error: connection pool exhausted or too many connections.

Section 2 (Symptoms) — filled in

Alert fires: db-connection-pool-utilization > 90% for 5 minutes
User impact: 503 errors on checkout and account pages
Metrics to check: db_pool_active_connections, db_pool_waiting_requests in Datadog
Logs: grep "connection pool" /var/log/app/error.log | tail -100

Section 4 (Diagnostics) — filled in

Step 1: Confirm pool exhaustion
  Run: SHOW STATUS LIKE 'Threads_connected';
  If Threads_connected near max_connections → pool is the bottleneck

Step 2: Find long-running queries holding connections
  Run: SELECT * FROM information_schema.processlist
       WHERE TIME > 30 ORDER BY TIME DESC;
  If results → these queries are holding connections open

Step 3: Check for recent traffic spike
  Dashboard: [link to request-rate graph]
  If spike → scaling may be needed (Option C below)

Step 4: Check for connection leak
  Metric: db_pool_idle_connections should be > 0
  If 0 idle with low traffic → likely a leak, escalate to app team

Section 5 (Resolution) — filled in

Option A – Kill long-running queries (under 2 min impact):
  KILL QUERY [process_id];  -- run for each query from Step 2
  Verify: SHOW STATUS LIKE 'Threads_connected'; should drop

Option B – Temporarily increase max connections:
  SET GLOBAL max_connections = 300;  -- default is usually 151
  Note: Temporary fix. File a ticket to investigate root cause.

Option C – Scale app servers to distribute connection load:
  kubectl scale deployment/api --replicas=6
  Verify: watch kubectl get pods -l app=api

Option D – Escalate to DBA:
  Slack: #oncall-database
  Context: "Connection pool exhausted since [time].
            Threads_connected=[X], long queries: [yes/no],
            traffic spike: [yes/no]."

Downloadable Runbook Template (Markdown)

Copy the block below and save it as runbook-[service-name].md in your team’s runbook repo.

# Runbook: [Incident Type]

**Service:**
**Severity:** P1 / P2 / P3
**Owner:**
**Last Updated:**
**Related Alerts:**

---
## Symptoms
- Alert message:
- User impact:
- Key metrics:
- Dashboard links:
- Log query:

---
## Initial Response (first 5 minutes)
1. Acknowledge alert
2. Join #incidents-[service]
3. Assign IC if P1
4. Confirm symptom with: [command]
5. Post status update

---
## Diagnostics
Step 1: [what to check] → [command]
  - Healthy: go to Step X
  - Degraded: go to Step Y

---
## Mitigation Options

### Option A: [Name] — [time, impact]
Command:
Verify with:

### Escalation
Escalate to: [team]
Via: [Slack/PD]
Tell them: [context template]

---
## Post-Incident
- [ ] All-clear posted
- [ ] Status page updated
- [ ] Postmortem ticket created
- [ ] This runbook updated

How to Write a Runbook That People Actually Use Under Pressure

The runbooks that get ignored are the ones written after incidents are over, by the engineer who resolved it, for an audience of one. Here’s how to write for the engineer who gets paged at 3 AM and has never seen this failure:

Write commands, not descriptions. “Restart the service” is not a step. systemctl restart api-server && systemctl status api-server is a step.

Include what “success” looks like. After every resolution step, specify what healthy output looks like. Engineers under stress can’t always tell if their fix worked.

Test it with someone who wasn’t involved. Give the runbook to a junior engineer or someone from a different team. If they get stuck, the runbook has a gap.

Keep it under one screen per section. If diagnostic steps scroll off the page, they won’t be followed correctly under pressure. Summarize, link to detail.

Embed dashboard links, not instructions to find dashboards. Every second spent navigating monitoring tools is a second not spent fixing the problem.

Common Runbook Anti-Patterns

Anti-pattern	Why it fails	Fix
“Check the logs”	No location, no filter, no expected output	Specify log path, query, and what bad looks like
“Contact the team”	Who? How? When?	Name, Slack handle, PagerDuty policy
Runbook not linked to alert	On-call has to find it mid-incident	Map runbook URL to alert in your alerting tool
Never updated after incidents	Stale commands, dead links	Add runbook review to postmortem checklist
Assumes service familiarity	Fails for new hires and rotation engineers	Write for the person who’s never seen this service

FAQ: Runbooks for SRE Teams

What is a runbook?

A runbook is a documented set of procedures for responding to a specific operational event. It tells on-call engineers what to check, what commands to run, and what to do if those commands don’t work — without requiring them to hold that knowledge in their head.

What’s the difference between a runbook and a playbook?

A runbook covers a specific service or failure type with exact steps. A playbook covers a broader incident category with decision trees, escalation flows, and cross-team coordination. In practice: playbooks tell you which runbooks to use.

How long should a runbook be?

As short as it can be while remaining complete. Most runbooks for a single failure mode fit in one to three pages. If it’s longer, it’s probably covering too many failure modes — split it.

Where should runbooks live?

Close to the alerts that trigger them. Most teams store runbooks in a shared wiki (Confluence, Notion, or GitHub) and link directly from PagerDuty, Datadog, or whatever alerting system fires the on-call page. The runbook that requires navigation to find is the runbook that won’t get used.

How often should runbooks be updated?

After every incident where the runbook was used. The postmortem methodology checklist should include “update runbook with anything that was wrong or missing.”

Can AI write runbooks?

AI can draft a starting point from a service description or alert configuration, but the result needs to be validated by someone who has actually responded to the incident. LLM-generated runbooks often include plausible-sounding but incorrect commands. Treat AI output as a first draft, not a finished runbook. For more on how AI fits into incident management, and what to understand about mean time to detect (MTTD) before optimizing response workflows, those pages cover both in depth. The broader picture lives in AIOps-fundamentals/">AIOps for SRE.

Stay Sharp

New articles on AIOps and SRE, straight to your inbox.

Practical content for practitioners. No noise, no vendor pitches.

No spam. Unsubscribe any time.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE Runbook Template: Production-Ready Example + Free Download

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

From Postmortems to Prevention: Building a Real Risk Registry

The Invisible Meter Running Behind Every AI System

The 5 Whys in a postmortem: getting to a fixable cause

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

What's Hot

SRE Runbook Template: Production-Ready Example + Free Download

What Is a Runbook (and Why Most Are Too Vague to Use)

Runbook vs. Playbook: The Distinction That Actually Matters

The SRE Runbook Template: All 7 Sections

Section 1: Incident Metadata [required]

Section 2: Symptoms and Signals [required]

Section 3: Initial Response Steps [required]

Section 4: Diagnostic Procedures [required]

Section 5: Mitigation and Resolution

Section 6: Escalation Paths

Section 7: Post-Incident Actions

Real Runbook Example: Database Connection Pool Exhaustion

Section 2 (Symptoms) — filled in

Section 4 (Diagnostics) — filled in

Section 5 (Resolution) — filled in

Downloadable Runbook Template (Markdown)

How to Write a Runbook That People Actually Use Under Pressure

Common Runbook Anti-Patterns

FAQ: Runbooks for SRE Teams

What is a runbook?

What’s the difference between a runbook and a playbook?

How long should a runbook be?

Where should runbooks live?

How often should runbooks be updated?

Can AI write runbooks?

New articles on AIOps and SRE, straight to your inbox.

Related Posts