Every practitioner needs a trusted stack. These are the tools we actually use and recommend for AIOps, SRE, Observability, and incident management — no fluff, no affiliate noise.
Observability & Monitoring
Prometheus
What it is: Open-source metrics collection and alerting system. The de facto standard for Kubernetes and cloud-native environments.
Best for: Teams running containerized workloads who need pull-based metrics, flexible PromQL queries, and tight Grafana integration.
Open Source Metrics
Grafana
What it is: The visualization layer for your observability stack. Connects to Prometheus, Loki, Tempo, and dozens of other data sources.
Best for: Building dashboards that actually surface reliability signals. Grafana Alerting brings alerting and visualization together in one pane.
Open Source / Cloud Dashboards
OpenTelemetry
What it is: The CNCF standard for instrumenting applications to emit traces, metrics, and logs in a vendor-neutral format.
Best for: Teams who want to avoid vendor lock-in and instrument once, then route to any backend (Jaeger, Datadog, Honeycomb, etc.).
Open Source Traces / Metrics / Logs
Datadog
What it is: Full-stack observability platform covering APM, infrastructure monitoring, logs, synthetics, and AI observability (LLM Observability).
Best for: Teams that want everything in one place and have the budget for it. The LLM Observability module is genuinely useful for AI-powered systems.
Commercial APM / AI Observability
Incident Management
PagerDuty
What it is: The original on-call alerting and incident response platform. Deep integrations with every monitoring tool you already use.
Best for: Organizations that need enterprise-grade on-call scheduling, escalation policies, and audit trails. PagerDuty’s AIOps features auto-correlate alerts to reduce noise.
Commercial On-Call / Incident Response
Opsgenie
What it is: Atlassian’s on-call and alert management tool, tightly integrated with Jira and the broader Atlassian ecosystem.
Best for: Teams already in the Atlassian ecosystem who want native Jira incident linking and shared on-call visibility.
Commercial On-Call / Jira Integration
FireHydrant
What it is: Incident management platform focused on structured Runbooks, retrospectives, and MTTR reduction through process automation.
Best for: Teams that want to operationalize incident response with consistent runbooks, automated stakeholder updates, and built-in retrospective tooling.
Commercial Runbooks / Retrospectives
AI & AIOps
Claude (Anthropic)
What it is: Large language model with a 200K token context window, strong reasoning, and API access for building AI-powered operations workflows.
Best for: Drafting runbooks, summarizing incident timelines, generating SLO reviews, and building LLM-powered alert triage pipelines. Claude’s large context window is particularly valuable for feeding full log dumps or postmortems.
Commercial / API LLM / Runbooks
NotebookLM (Google)
What it is: AI-powered research notebook that grounds answers in your own uploaded documents — postmortems, runbooks, architecture docs.
Best for: Making your team’s institutional knowledge searchable. Upload 6 months of postmortems, ask “what patterns keep causing P1s?” and get grounded answers.
Free / Workspace Knowledge Management / LLM
Cursor
What it is: AI-native code editor built on VS Code with LLM-assisted editing and codebase-aware chat.
Best for: SREs writing automation scripts, Terraform, or custom exporters. Ask “what does this Helm chart do?” and get a grounded answer from your own codebase.
Commercial Code Editor / AI-Assisted
Reliability Engineering
Chaos Mesh / Chaos Monkey
What it is: Chaos engineering tools for injecting controlled failures into your systems. Chaos Monkey targets instance-level failures; Chaos Mesh targets Kubernetes workloads.
Best for: Teams running game days and proactive reliability testing. Run in staging first, always with a kill switch and a monitoring dashboard open.
Open Source Chaos Engineering / Kubernetes
Gremlin
What it is: Commercial chaos engineering platform with a GUI, scenario library, and enterprise safety controls (auto-halt, blast radius limits).
Best for: Organizations that want structured chaos engineering with guardrails, compliance audit logs, and pre-built failure scenarios. Easier to get exec buy-in than open source alternatives.
Commercial Chaos Engineering / Enterprise
Deployment & Release
ArgoCD
What it is: GitOps continuous delivery tool for Kubernetes. Syncs your cluster state to what’s declared in Git — no more configuration drift.
Best for: Kubernetes teams that want GitOps workflows, automatic drift detection, and a clear audit trail of every deployment.
Open Source GitOps / Kubernetes
LaunchDarkly
What it is: Feature flag and experimentation platform. Decouple deployment from release — ship code to production without exposing it to users until you’re ready.
Best for: Teams that want to reduce deployment risk by controlling rollout percentage, targeting specific user segments, and toggling off broken features instantly without a rollback.
Commercial Feature Flags / Progressive Delivery
Missing a tool you rely on?
This list grows as the community does. If you’re using something that belongs here, drop a note in the comments below.

