SRE Incident Assistant: A Complete Reference
Executive Summary: The SRE Incident Assistant centralizes incident response by integrating Slack, Jira, Confluence, PagerDuty, and optionally Robusta or Prometheus (for alerting). This guide covers setup, best practices (like ephemeral Slack war rooms, slash commands, runbook automation), and postmortems.
1. Prerequisites & Overview
- Slack: Bot token & signing secret for automation and ephemeral channels.
- PagerDuty: API key for reading/updating incidents.
- Jira: Service account & API token for ticket creation.
- Confluence: Service account & API token to create/edit runbooks.
- Robusta (Optional): Enhances Kubernetes alerts with automation/playbooks.
- Prometheus (Optional): Existing pipeline sending alerts to PagerDuty.
All these tools are orchestrated by the SRE Incident Assistant, which can run as a container or VM-based service, simplifying and centralizing incident workflows.
2. Architecture & Key Components
- Incident Assistant Service: Container or VM that ties Slack, Jira, Confluence, PagerDuty, and optional Robusta/Prometheus flows into a single system.
- Optional Database Layer: Stores metadata, logs, config if needed.
- Authentication & Access Control: Typically service accounts with API keys or corporate SSO.
+---------------------+ | Robusta (K8s) | +----------+---------+ | v +----------------------------------------------------+ | SRE INCIDENT ASSISTANT (Slack, Jira, PD, Confl) | | Docker/K8s or VM-based; orchestrates integrations | +--------+-------------------+------------------------+ | | Slack Bot API PagerDuty API | | v v Jira API Confluence (Runbooks)
3. Environment Requirements
Requirement | Description |
---|---|
Operating System | Linux (Ubuntu 20.04+ or CentOS/RHEL 7+) |
Container Platform | Docker or Kubernetes (recommended) |
Networking | Outbound to Slack, Jira, PD, Confluence, inbound if using webhooks |
SSL/TLS Certificates | Required if hosting via HTTPS |
Credentials | Slack Bot Token, Jira API token, PD API key, Confluence token, etc. |
Resource Allocation | ~2 CPU cores, 4GB RAM for the Assistant; additional resources for Robusta |
4. Quick-Start Flow
+---------------------------------------+ | Source Code / Docker Registry | | (Assistant Image) | +-----------+---------------------------+ | v 1) Build or Pull the image | v 2) Create config.yaml | v 3) Deploy to Docker or K8s | v 4) Check logs for errors | v 5) Integrate & Test (Slack, Jira, PD, Confluence, Robusta) | v 6) Confirm end-to-end
5. Installation & Setup
5.1 Build or Pull the Assistant
Option A: Pull from Registry
docker pull your-registry.com/sre-incident-assistant:latest
Option B: Build from Source
git clone https://your-git-repo/sre-incident-assistant.git cd sre-incident-assistant docker build -t your-registry.com/sre-incident-assistant:latest . docker push your-registry.com/sre-incident-assistant:latest
5.2 Create a Configuration File
config.yaml: slack: bot_token: "xoxb-12345..." signing_secret: "abc123..." jira: base_url: "https://jira.yourcompany.com" project_key: "OPS" username: "jira_bot" api_token: "JIRA_API_TOKEN" pagerduty: api_key: "PD_API_KEY" service_integration_key: "PD_SERVICE_KEY" confluence: base_url: "https://confluence.yourcompany.com" username: "confluence_bot" api_token: "CONF_API_TOKEN"
5.3 Deploy the Assistant
Docker Example:
docker run -d \ --name sre_incident_assistant \ -v /path/to/config.yaml:/app/config.yaml \ -p 8080:8080 \ your-registry.com/sre-incident-assistant:latest
Kubernetes Example: Create a Deployment referencing the image, and use a ConfigMap/Secret for config.yaml.
5.4 Validate Logs & Health
docker logs sre_incident_assistant -f # or kubectl logs -f
6. Dev/Staging Environment Strategy
- Deploy to a minimal staging environment first.
- Test Slack, Jira, Confluence (sandbox if possible), and PagerDuty integration.
- Trigger sample incidents; confirm end-to-end flow.
- Roll out to production when ready.
7. Installing & Integrating Robusta (Optional)
- Install Robusta CLI:
pip install robusta-cli
- Initialize on Kubernetes: kubectl config use-context robusta install
- Configure Alerts:
- Either post to a Slack channel monitored by the Assistant
- Or use a webhook: globalConfig: custom_webhook_url: “https:///api/robusta”
- Define Playbooks: (e.g., High CPU → Slack + Jira + PD + Confluence link)
- Validate: Force a test alert to ensure all steps work.
8. Integrating with Existing Tools
Tool | Requirements | Testing |
---|---|---|
Slack | Slack Bot Token & Signing Secret | Mention the bot, send a test alert |
Jira | Base URL, Project Key, Service Account API Token | Trigger an incident → see Jira ticket |
Confluence | Base URL, Service Account & Token | Create/edit runbook pages |
PagerDuty | API key & Integration key | Check existing incidents or create a new one |
Robusta | Slack or custom_webhook config | Force a test alert, watch Slack/Jira/PD notifications |
9. Slack: War Rooms & Slash Commands
9.1 Ephemeral War Rooms
- Create a dedicated Slack channel, e.g. #incident-12345, for high-severity alerts
- Invite on-call engineers automatically
- Archive the channel once resolved
- Post final summary and runbook/postmortem links
9.2 Slash Commands
/incident create-runbook
: Creates or links a Confluence runbook/incident pd-note <message>
: Adds a note to PagerDuty/incident summary
: Shows incident severity, assigned SRE, runbook links, etc.
Register these commands in Slack App settings to point to the Assistant’s endpoints.
10. Jira Integration
- Create a Jira service account with create/update issue permissions.
- Generate a Jira API token.
- Update
config.yaml
with base URL, project key, etc. - Trigger an incident from Slack or Robusta to confirm a new Jira issue is created.
11. Confluence Integration
- Service account with permission to create/edit pages.
- Configure in
config.yaml
. confluence: base_url: “https://confluence.yourcompany.com” username: “conf_bot” api_token: “CONF_API_TOKEN” - Enable runbook automation or linking via slash commands or the Assistant’s workflow.
12. PagerDuty Integration
If Prometheus is already sending alerts to PD, the Assistant can:
- Read existing PD incidents, attach notes or Confluence links
- Create new incidents if a Slack or Robusta-triggered alert requires escalation
Use a PD API key with read/write access. Optionally configure PD webhooks for real-time updates on ack/resolves.
13. Testing & Validation
- Unit Tests: If your code has them, run them locally.
- Integration Tests: Force a test alert (via Robusta), confirm Slack war room or message, Jira ticket creation, PD incident updates, Confluence page linking.
- User Acceptance Tests (UAT): Have an SRE do a full incident scenario from detection to resolution, verifying ephemeral channels, slash commands, runbooks, etc.
14. Logging & Observability
- Local Logs:
docker logs sre_incident_assistant -f
orkubectl logs -f <pod>
- PagerDuty Incident Feed: All triggered, acknowledged, resolved states remain in PD.
- Short-Term Retention: Store logs locally if needed for historical review.
15. Incident Lifecycle & Postmortems
1) Monitoring (Prometheus or other) triggers a PD incident 2) Robusta or direct Slack alerts feed into the Assistant 3) The Assistant: - Creates ephemeral Slack war room if critical - Opens/updates a Jira ticket - Links/creates runbooks in Confluence 4) On-call SREs respond and mitigate 5) PD incident resolved/closed 6) Postmortem documented in Confluence
Use a standard postmortem template to ensure consistent retrospectives and continuous improvement.
16. Troubleshooting
Symptom | Possible Cause | Resolution |
---|---|---|
Slack bot not responding | Missing OAuth scopes or signing secret | Re-check Slack App config & config.yaml |
Jira ticket fails | Service account lacks permissions | Update project roles |
PagerDuty updates not visible | No PD webhooks or insufficient read scope | Configure PD webhooks or verify PD API key |
Robusta alerts missing | Wrong webhook URL or Slack channel ID | Check robusta-cli config, Slack channel |
Confluence page creation fails | Invalid token or insufficient permissions | Check token & space permissions |
17. Best Practices & Strategy
Roles & Responsibilities
Role | Description | Typical Actions |
---|---|---|
Incident Commander | Oversees incident, ensures resolution | Acknowledge PD alerts, manage Slack war room |
Secondary On-Call | Supports Commander | Triages, uses slash commands, updates Jira |
Comms Lead | Handles stakeholder updates | Posts Slack announcements, references runbooks |
Assistant/Bot | Automation layer | Creates ephemeral channels, updates PD, triggers runbooks |
Automation vs. Manual Remediation
- Keep critical/high-risk actions manual (e.g., major redeploys).
- Automate routine tasks (e.g., small pod restarts).
- Set thresholds (only escalates to PD for certain severities).
17. Example Scripts & YAML
Kubernetes Deployment + ConfigMap
apiVersion: apps/v1 kind: Deployment metadata: name: sre-incident-assistant spec: replicas: 1 selector: matchLabels: app: sre-incident-assistant template: metadata: labels: app: sre-incident-assistant spec: containers: - name: incident-assistant image: your-registry.com/sre-incident-assistant:latest ports: - containerPort: 8080 volumeMounts: - name: config-volume mountPath: /app/config.yaml subPath: config.yaml volumes: - name: config-volume configMap: name: assistant-config
apiVersion: v1 kind: ConfigMap metadata: name: assistant-config data: config.yaml: | slack: bot_token: "xoxb-123456" signing_secret: "abc123xyz" pagerduty: api_key: "PD_API_KEY" service_integration_key: "PD_SERVICE_KEY" ...
18. FAQ
Question | Answer |
---|---|
How do I filter unwanted Prometheus alerts? | Configure Robusta or PD to only forward “critical” severity, or use custom filters. |
Will ephemeral Slack channels spam if repeated triggers occur? | Check if #incident-XYZ already exists; reuse or rename if truly new. |
Can the Assistant auto-attach runbooks to PD incidents? | Yes, if PD event JSON includes a documentation.content field, the Assistant can parse and reference it. |
What about partial vs. full automation? | You can configure the Assistant to prompt SREs for manual approval before major remediation tasks. |
Recurring incidents in Jira? | The Assistant can check if there’s an open ticket for that PD incident ID; if closed, create a new one. |
19. Security Hardening
- Rotate Secrets: Slack, Jira, PD, Confluence tokens, etc.
- RBAC in K8s: Limit privileges if deployed in Kubernetes.
- HTTPS/TLS: Use valid certificates if externally exposed.
- Access Control: Restrict who can modify slash commands or ephemeral channel creation logic.
20. Conclusion
By unifying Slack, Jira, Confluence, PagerDuty, and optional monitoring tools like Robusta or Prometheus, the SRE Incident Assistant provides a centralized, highly automated approach to incident management. It reduces mean time to resolution (MTTR), fosters better collaboration through ephemeral Slack war rooms, automates ticket and runbook creation, and supports robust postmortems.
Implementing this system allows SRE teams to focus on strategic improvements rather than manual incident coordination, ultimately driving higher reliability and better business outcomes.