TOPIC HUB
Observability for SRE
Observability is the foundation of reliability engineering. This hub covers the three pillars of observability (metrics, traces, logs), how to build SLOs and Error Budgets on top of them, and what changes when you’re observing AI systems as well as traditional infrastructure.
Observability vs. Monitoring: The Distinction That Matters
Monitoring is about tracking known failure modes. You define what you care about, set thresholds, and get alerted when those thresholds are crossed. It works well for systems you understand thoroughly.
Observability is about understanding systems you don’t fully understand yet — or systems complex enough that you can’t anticipate every failure mode. An observable system lets you answer arbitrary questions about its behavior from the outside, without having to redeploy code to add new instrumentation.
The practical difference: with monitoring, you know what questions you’ll ask in advance. With observability, you can ask questions you didn’t anticipate during an incident at 3am.
The Three Pillars
SLOs and Error Budgets
Observability data is only useful if it’s connected to reliability targets. Service Level Objectives (SLOs) define what “good enough” looks like — the percentage of requests that must succeed, or the latency threshold that must be met. Error budgets convert that target into a running balance of how much more failure you can afford before breaching your SLO.
The Error Budget Math
SLO target: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget: 0.1% × 43,200 = 43.2 minutes of allowable downtime
Once that 43.2 minutes is consumed, all discretionary deployments pause until the budget resets.
Error budgets create a shared language between product and engineering: when the budget is healthy, teams can move fast. When it’s burning, reliability work takes priority. This eliminates the perpetual negotiation about “is it safe to deploy?” — the budget answers the question objectively.
Observability for AI Systems
Traditional observability assumes your system behaves deterministically — given the same input, you get the same output. AI systems break this assumption. LLM calls are probabilistic, context-dependent, and expensive in ways that CPU cycles aren’t.
Observing AI systems requires extending your existing stack with AI-specific signals:
Token usage metrics
Input tokens, output tokens, and total tokens per request. Correlate with cost. Alert when token usage per call exceeds budget thresholds.
LLM latency percentiles
Time-to-first-token and total generation time. P95 and P99 matter most — LLM latency tails are long and directly impact user experience.
Context window utilization
Percentage of context window used per request. Overflow causes truncation, which silently degrades output quality without throwing errors.
Output quality signals
Harder to measure but critical: refusal rates, user feedback signals, downstream task success rates. Quality can degrade without any error metrics changing.
Articles in This Topic Cluster
The Invisible Meter Running Behind Every AI System
Token usage, cost tracking, and latency monitoring for LLM-powered systems — the metrics practitioners actually need.
AI reliability is constrained by physics, not software
Why infrastructure fundamentals set the reliability ceiling for AI systems — and what that means for your SLOs.
On-call load is a system: what to measure before burnout shows up
Treating on-call health as an observability problem — the metrics that matter before burnout becomes visible.
Google NotebookLM for AIOps and SRE
Using AI to make your observability knowledge base — Runbooks, dashboards, postmortems — queryable by your team.

