TOPIC HUB

Observability for SRE

Observability is the foundation of reliability engineering. This hub covers the three pillars of observability (metrics, traces, logs), how to build SLOs and Error Budgets on top of them, and what changes when you’re observing AI systems as well as traditional infrastructure.

Observability vs. Monitoring: The Distinction That Matters

Monitoring is about tracking known failure modes. You define what you care about, set thresholds, and get alerted when those thresholds are crossed. It works well for systems you understand thoroughly.

Observability is about understanding systems you don’t fully understand yet — or systems complex enough that you can’t anticipate every failure mode. An observable system lets you answer arbitrary questions about its behavior from the outside, without having to redeploy code to add new instrumentation.

The practical difference: with monitoring, you know what questions you’ll ask in advance. With observability, you can ask questions you didn’t anticipate during an incident at 3am.

The Three Pillars

Metrics

Numeric measurements sampled over time. Fast to query, cheap to store, great for dashboards and alerting. The backbone of SLO tracking.

Tools: Prometheus, Datadog, CloudWatch, OpenTelemetry

Key metrics to track: Request rate, error rate, latency percentiles (p50/p95/p99), saturation

Traces

Distributed tracing follows a request as it moves through your system, recording timing and context at each service hop. Essential for understanding latency in microservices architectures.

Tools: Jaeger, Tempo, OpenTelemetry, Datadog APM

Key use case: “Which service is adding latency to this request path?”

Logs

Structured or unstructured event records. Rich context for debugging but expensive to store and query at scale. Structured logs (JSON) are significantly more useful than unstructured text.

Tools: Loki, Elasticsearch, Splunk, CloudWatch Logs

Key practice: Structured logging from day one; log sampling for high-volume services

SLOs and Error Budgets

Observability data is only useful if it’s connected to reliability targets. Service Level Objectives (SLOs) define what “good enough” looks like — the percentage of requests that must succeed, or the latency threshold that must be met. Error budgets convert that target into a running balance of how much more failure you can afford before breaching your SLO.

The Error Budget Math

SLO target: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget: 0.1% × 43,200 = 43.2 minutes of allowable downtime
Once that 43.2 minutes is consumed, all discretionary deployments pause until the budget resets.

Error budgets create a shared language between product and engineering: when the budget is healthy, teams can move fast. When it’s burning, reliability work takes priority. This eliminates the perpetual negotiation about “is it safe to deploy?” — the budget answers the question objectively.

Observability for AI Systems

Traditional observability assumes your system behaves deterministically — given the same input, you get the same output. AI systems break this assumption. LLM calls are probabilistic, context-dependent, and expensive in ways that CPU cycles aren’t.

Observing AI systems requires extending your existing stack with AI-specific signals:

Token usage metrics

Input tokens, output tokens, and total tokens per request. Correlate with cost. Alert when token usage per call exceeds budget thresholds.

LLM latency percentiles

Time-to-first-token and total generation time. P95 and P99 matter most — LLM latency tails are long and directly impact user experience.

Context window utilization

Percentage of context window used per request. Overflow causes truncation, which silently degrades output quality without throwing errors.

Output quality signals

Harder to measure but critical: refusal rates, user feedback signals, downstream task success rates. Quality can degrade without any error metrics changing.

Articles in This Topic Cluster

The Invisible Meter Running Behind Every AI System

Token usage, cost tracking, and latency monitoring for LLM-powered systems — the metrics practitioners actually need.

AI reliability is constrained by physics, not software

Why infrastructure fundamentals set the reliability ceiling for AI systems — and what that means for your SLOs.

On-call load is a system: what to measure before burnout shows up

Treating on-call health as an observability problem — the metrics that matter before burnout becomes visible.

Google NotebookLM for AIOps and SRE

Using AI to make your observability knowledge base — Runbooks, dashboards, postmortems — queryable by your team.

Related hubs: AIOps Fundamentals · Incident Management with AI · AIOps & SRE Glossary

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

Observability for SRE

Observability for SRE

Observability vs. Monitoring: The Distinction That Matters

The Three Pillars

Metrics

Traces

Logs

SLOs and Error Budgets

The Error Budget Math

Observability for AI Systems

Token usage metrics

LLM latency percentiles

Context window utilization

Output quality signals

Articles in This Topic Cluster

The Invisible Meter Running Behind Every AI System

AI reliability is constrained by physics, not software

On-call load is a system: what to measure before burnout shows up

Google NotebookLM for AIOps and SRE

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

SRE Runbook Template: Production-Ready Example + Free Download

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE