SRE Incident Assistant: A Complete Reference

Executive Summary: The SRE Incident Assistant centralizes incident response by integrating Slack, Jira, Confluence, PagerDuty, and optionally Robusta or Prometheus (for alerting). This guide covers setup, best practices (like ephemeral Slack war rooms, slash commands, runbook automation), and postmortems.

1. Prerequisites & Overview

Slack: Bot token & signing secret for automation and ephemeral channels.
PagerDuty: API key for reading/updating incidents.
Jira: Service account & API token for ticket creation.
Confluence: Service account & API token to create/edit runbooks.
Robusta (Optional): Enhances Kubernetes alerts with automation/playbooks.
Prometheus (Optional): Existing pipeline sending alerts to PagerDuty.

All these tools are orchestrated by the SRE Incident Assistant, which can run as a container or VM-based service, simplifying and centralizing incident workflows.

2. Architecture & Key Components

Incident Assistant Service: Container or VM that ties Slack, Jira, Confluence, PagerDuty, and optional Robusta/Prometheus flows into a single system.
Optional Database Layer: Stores metadata, logs, config if needed.
Authentication & Access Control: Typically service accounts with API keys or corporate SSO.

               +---------------------+
               |  Robusta (K8s)     |
               +----------+---------+
                          |
                          v
+----------------------------------------------------+
|  SRE INCIDENT ASSISTANT (Slack, Jira, PD, Confl)   |
|  Docker/K8s or VM-based; orchestrates integrations |
+--------+-------------------+------------------------+
         |                   |
   Slack Bot API        PagerDuty API
         |                   |
         v                   v
     Jira API         Confluence (Runbooks)

3. Environment Requirements

Requirement	Description
Operating System	Linux (Ubuntu 20.04+ or CentOS/RHEL 7+)
Container Platform	Docker or Kubernetes (recommended)
Networking	Outbound to Slack, Jira, PD, Confluence, inbound if using webhooks
SSL/TLS Certificates	Required if hosting via HTTPS
Credentials	Slack Bot Token, Jira API token, PD API key, Confluence token, etc.
Resource Allocation	~2 CPU cores, 4GB RAM for the Assistant; additional resources for Robusta

4. Quick-Start Flow

   +---------------------------------------+
   | Source Code / Docker Registry         |
   |   (Assistant Image)                   |
   +-----------+---------------------------+
               |
               v
      1) Build or Pull the image
               |
               v
      2) Create config.yaml
               |
               v
      3) Deploy to Docker or K8s
               |
               v
      4) Check logs for errors
               |
               v
      5) Integrate & Test 
         (Slack, Jira, PD, Confluence, Robusta)
               |
               v
      6) Confirm end-to-end

5. Installation & Setup

5.1 Build or Pull the Assistant

Option A: Pull from Registry

docker pull your-registry.com/sre-incident-assistant:latest

Option B: Build from Source

git clone https://your-git-repo/sre-incident-assistant.git
cd sre-incident-assistant
docker build -t your-registry.com/sre-incident-assistant:latest .
docker push your-registry.com/sre-incident-assistant:latest

5.2 Create a Configuration File

config.yaml:

slack:
  bot_token: "xoxb-12345..."
  signing_secret: "abc123..."

jira:
  base_url: "https://jira.yourcompany.com"
  project_key: "OPS"
  username: "jira_bot"
  api_token: "JIRA_API_TOKEN"

pagerduty:
  api_key: "PD_API_KEY"
  service_integration_key: "PD_SERVICE_KEY"

confluence:
  base_url: "https://confluence.yourcompany.com"
  username: "confluence_bot"
  api_token: "CONF_API_TOKEN"

5.3 Deploy the Assistant

Docker Example:

docker run -d \
 --name sre_incident_assistant \
 -v /path/to/config.yaml:/app/config.yaml \
 -p 8080:8080 \
 your-registry.com/sre-incident-assistant:latest

Kubernetes Example: Create a Deployment referencing the image, and use a ConfigMap/Secret for config.yaml.

5.4 Validate Logs & Health

docker logs sre_incident_assistant -f
# or
kubectl logs -f

6. Dev/Staging Environment Strategy

Deploy to a minimal staging environment first.
Test Slack, Jira, Confluence (sandbox if possible), and PagerDuty integration.
Trigger sample incidents; confirm end-to-end flow.
Roll out to production when ready.

7. Installing & Integrating Robusta (Optional)

Install Robusta CLI: pip install robusta-cli
Initialize on Kubernetes: kubectl config use-context robusta install
Configure Alerts:
- Either post to a Slack channel monitored by the Assistant
- Or use a webhook: globalConfig: custom_webhook_url: “https:///api/robusta”
Define Playbooks: (e.g., High CPU → Slack + Jira + PD + Confluence link)
Validate: Force a test alert to ensure all steps work.

8. Integrating with Existing Tools

Tool	Requirements	Testing
Slack	Slack Bot Token & Signing Secret	Mention the bot, send a test alert
Jira	Base URL, Project Key, Service Account API Token	Trigger an incident → see Jira ticket
Confluence	Base URL, Service Account & Token	Create/edit runbook pages
PagerDuty	API key & Integration key	Check existing incidents or create a new one
Robusta	Slack or custom_webhook config	Force a test alert, watch Slack/Jira/PD notifications

9. Slack: War Rooms & Slash Commands

9.1 Ephemeral War Rooms

Create a dedicated Slack channel, e.g. #incident-12345, for high-severity alerts
Invite on-call engineers automatically
Archive the channel once resolved
Post final summary and runbook/postmortem links

9.2 Slash Commands

/incident create-runbook: Creates or links a Confluence runbook
/incident pd-note <message>: Adds a note to PagerDuty
/incident summary: Shows incident severity, assigned SRE, runbook links, etc.

10. Jira Integration

Create a Jira service account with create/update issue permissions.
Generate a Jira API token.
Update config.yaml with base URL, project key, etc.
Trigger an incident from Slack or Robusta to confirm a new Jira issue is created.

11. Confluence Integration

Service account with permission to create/edit pages.
Configure in config.yaml. confluence: base_url: “https://confluence.yourcompany.com” username: “conf_bot” api_token: “CONF_API_TOKEN”
Enable runbook automation or linking via slash commands or the Assistant’s workflow.

12. PagerDuty Integration

If Prometheus is already sending alerts to PD, the Assistant can:

Read existing PD incidents, attach notes or Confluence links
Create new incidents if a Slack or Robusta-triggered alert requires escalation

Use a PD API key with read/write access. Optionally configure PD webhooks for real-time updates on ack/resolves.

13. Testing & Validation

Unit Tests: If your code has them, run them locally.
Integration Tests: Force a test alert (via Robusta), confirm Slack war room or message, Jira ticket creation, PD incident updates, Confluence page linking.
User Acceptance Tests (UAT): Have an SRE do a full incident scenario from detection to resolution, verifying ephemeral channels, slash commands, runbooks, etc.

14. Logging & Observability

Local Logs: docker logs sre_incident_assistant -f or kubectl logs -f <pod>
PagerDuty Incident Feed: All triggered, acknowledged, resolved states remain in PD.
Short-Term Retention: Store logs locally if needed for historical review.

15. Incident Lifecycle & Postmortems

1) Monitoring (Prometheus or other) triggers a PD incident
2) Robusta or direct Slack alerts feed into the Assistant
3) The Assistant:
   - Creates ephemeral Slack war room if critical
   - Opens/updates a Jira ticket
   - Links/creates runbooks in Confluence
4) On-call SREs respond and mitigate
5) PD incident resolved/closed
6) Postmortem documented in Confluence

Use a standard postmortem template to ensure consistent retrospectives and continuous improvement.

16. Troubleshooting

Symptom	Possible Cause	Resolution
Slack bot not responding	Missing OAuth scopes or signing secret	Re-check Slack App config & `config.yaml`
Jira ticket fails	Service account lacks permissions	Update project roles
PagerDuty updates not visible	No PD webhooks or insufficient read scope	Configure PD webhooks or verify PD API key
Robusta alerts missing	Wrong webhook URL or Slack channel ID	Check robusta-cli config, Slack channel
Confluence page creation fails	Invalid token or insufficient permissions	Check token & space permissions

17. Best Practices & Strategy

Roles & Responsibilities

Role	Description	Typical Actions
Incident Commander	Oversees incident, ensures resolution	Acknowledge PD alerts, manage Slack war room
Secondary On-Call	Supports Commander	Triages, uses slash commands, updates Jira
Comms Lead	Handles stakeholder updates	Posts Slack announcements, references runbooks
Assistant/Bot	Automation layer	Creates ephemeral channels, updates PD, triggers runbooks

Automation vs. Manual Remediation

Keep critical/high-risk actions manual (e.g., major redeploys).
Automate routine tasks (e.g., small pod restarts).
Set thresholds (only escalates to PD for certain severities).

17. Example Scripts & YAML

Kubernetes Deployment + ConfigMap

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sre-incident-assistant
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sre-incident-assistant
  template:
    metadata:
      labels:
        app: sre-incident-assistant
    spec:
      containers:
      - name: incident-assistant
        image: your-registry.com/sre-incident-assistant:latest
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: config-volume
          mountPath: /app/config.yaml
          subPath: config.yaml
      volumes:
      - name: config-volume
        configMap:
          name: assistant-config

apiVersion: v1
kind: ConfigMap
metadata:
  name: assistant-config
data:
  config.yaml: |
    slack:
      bot_token: "xoxb-123456"
      signing_secret: "abc123xyz"
    pagerduty:
      api_key: "PD_API_KEY"
      service_integration_key: "PD_SERVICE_KEY"
    ...

18. FAQ

Question	Answer
How do I filter unwanted Prometheus alerts?	Configure Robusta or PD to only forward “critical” severity, or use custom filters.
Will ephemeral Slack channels spam if repeated triggers occur?	Check if #incident-XYZ already exists; reuse or rename if truly new.
Can the Assistant auto-attach runbooks to PD incidents?	Yes, if PD event JSON includes a `documentation.content` field, the Assistant can parse and reference it.
What about partial vs. full automation?	You can configure the Assistant to prompt SREs for manual approval before major remediation tasks.
Recurring incidents in Jira?	The Assistant can check if there’s an open ticket for that PD incident ID; if closed, create a new one.

19. Security Hardening

Rotate Secrets: Slack, Jira, PD, Confluence tokens, etc.
RBAC in K8s: Limit privileges if deployed in Kubernetes.
HTTPS/TLS: Use valid certificates if externally exposed.
Access Control: Restrict who can modify slash commands or ephemeral channel creation logic.

20. Conclusion

By unifying Slack, Jira, Confluence, PagerDuty, and optional monitoring tools like Robusta or Prometheus, the SRE Incident Assistant provides a centralized, highly automated approach to incident management. It reduces mean time to resolution (MTTR), fosters better collaboration through ephemeral Slack war rooms, automates ticket and runbook creation, and supports robust postmortems.

Implementing this system allows SRE teams to focus on strategic improvements rather than manual incident coordination, ultimately driving higher reliability and better business outcomes.

Stay Ahead with Exclusive Insights

What's Hot

Robusta Incident Management: The Ultimate SRE Stack Integration with GenAI, PagerDuty, Jira, and Slack