Every practitioner needs a trusted stack. These are the tools we actually use and recommend for AIOps, SRE, Observability, and incident management — no fluff, no affiliate noise.

How this list works: Tools are grouped by function. Each entry includes what it does, who it’s for, and when to reach for it. Last updated March 2026.

Observability & Monitoring

Prometheus

What it is: Open-source metrics collection and alerting system. The de facto standard for Kubernetes and cloud-native environments.

Best for: Teams running containerized workloads who need pull-based metrics, flexible PromQL queries, and tight Grafana integration.

Open Source Metrics

Grafana

What it is: The visualization layer for your observability stack. Connects to Prometheus, Loki, Tempo, and dozens of other data sources.

Best for: Building dashboards that actually surface reliability signals. Grafana Alerting brings alerting and visualization together in one pane.

Open Source / Cloud Dashboards

OpenTelemetry

What it is: The CNCF standard for instrumenting applications to emit traces, metrics, and logs in a vendor-neutral format.

Best for: Teams who want to avoid vendor lock-in and instrument once, then route to any backend (Jaeger, Datadog, Honeycomb, etc.).

Open Source Traces / Metrics / Logs

Datadog

What it is: Full-stack observability platform covering APM, infrastructure monitoring, logs, synthetics, and AI observability (LLM Observability).

Best for: Teams that want everything in one place and have the budget for it. The LLM Observability module is genuinely useful for AI-powered systems.

Commercial APM / AI Observability

Incident Management

PagerDuty

What it is: The original on-call alerting and incident response platform. Deep integrations with every monitoring tool you already use.

Best for: Organizations that need enterprise-grade on-call scheduling, escalation policies, and audit trails. PagerDuty’s AIOps features auto-correlate alerts to reduce noise.

Commercial On-Call / Incident Response

Opsgenie

What it is: Atlassian’s on-call and alert management tool, tightly integrated with Jira and the broader Atlassian ecosystem.

Best for: Teams already in the Atlassian ecosystem who want native Jira incident linking and shared on-call visibility.

Commercial On-Call / Jira Integration

FireHydrant

What it is: Incident management platform focused on structured Runbooks, retrospectives, and MTTR reduction through process automation.

Best for: Teams that want to operationalize incident response with consistent runbooks, automated stakeholder updates, and built-in retrospective tooling.

Commercial Runbooks / Retrospectives

AI & AIOps

Claude (Anthropic)

What it is: Large language model with a 200K token context window, strong reasoning, and API access for building AI-powered operations workflows.

Best for: Drafting runbooks, summarizing incident timelines, generating SLO reviews, and building LLM-powered alert triage pipelines. Claude’s large context window is particularly valuable for feeding full log dumps or postmortems.

Commercial / API LLM / Runbooks

NotebookLM (Google)

What it is: AI-powered research notebook that grounds answers in your own uploaded documents — postmortems, runbooks, architecture docs.

Best for: Making your team’s institutional knowledge searchable. Upload 6 months of postmortems, ask “what patterns keep causing P1s?” and get grounded answers.

Free / Workspace Knowledge Management / LLM

Cursor

What it is: AI-native code editor built on VS Code with LLM-assisted editing and codebase-aware chat.

Best for: SREs writing automation scripts, Terraform, or custom exporters. Ask “what does this Helm chart do?” and get a grounded answer from your own codebase.

Commercial Code Editor / AI-Assisted

Reliability Engineering

Chaos Mesh / Chaos Monkey

What it is: Chaos engineering tools for injecting controlled failures into your systems. Chaos Monkey targets instance-level failures; Chaos Mesh targets Kubernetes workloads.

Best for: Teams running game days and proactive reliability testing. Run in staging first, always with a kill switch and a monitoring dashboard open.

Open Source Chaos Engineering / Kubernetes

Gremlin

What it is: Commercial chaos engineering platform with a GUI, scenario library, and enterprise safety controls (auto-halt, blast radius limits).

Best for: Organizations that want structured chaos engineering with guardrails, compliance audit logs, and pre-built failure scenarios. Easier to get exec buy-in than open source alternatives.

Commercial Chaos Engineering / Enterprise

Deployment & Release

ArgoCD

What it is: GitOps continuous delivery tool for Kubernetes. Syncs your cluster state to what’s declared in Git — no more configuration drift.

Best for: Kubernetes teams that want GitOps workflows, automatic drift detection, and a clear audit trail of every deployment.

Open Source GitOps / Kubernetes

LaunchDarkly

What it is: Feature flag and experimentation platform. Decouple deployment from release — ship code to production without exposing it to users until you’re ready.

Best for: Teams that want to reduce deployment risk by controlling rollout percentage, targeting specific user segments, and toggling off broken features instantly without a rollback.

Commercial Feature Flags / Progressive Delivery

Missing a tool you rely on?

This list grows as the community does. If you’re using something that belongs here, drop a note in the comments below.

New here? Start with the essentials →

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

The SRE & AIOps Tool Stack

Observability & Monitoring

Prometheus

Grafana

OpenTelemetry

Datadog

Incident Management

PagerDuty

Opsgenie

FireHydrant

AI & AIOps

Claude (Anthropic)

NotebookLM (Google)

Cursor

Reliability Engineering

Chaos Mesh / Chaos Monkey

Gremlin

Deployment & Release

ArgoCD

LaunchDarkly

Missing a tool you rely on?

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE