Every field has jargon. Ours happens to be useful. This glossary is written by practitioners for practitioners — no textbook definitions, no vendor marketing speak. Each term here is something you’ll actually encounter on the job.
A
Agent Loop — The cycle where an AI agent observes a system state, takes an action, and evaluates the outcome before deciding on the next action. In SRE, this might mean an AI system detecting a performance anomaly, triggering a diagnostic check, reviewing results, and escalating to a human. Good agent design means the system knows when to stop and hand off to a person.
Alert Fatigue — What happens when your alerting system has more false positives than real incidents. Your team starts ignoring alerts. Real problems get missed. The fix isn’t turning off alerts — it’s tuning them ruthlessly so every alert means something actionable. If an alert doesn’t warrant waking someone up at 3 AM, it’s noise.
AIOps — Using machine learning and AI to automate and augment operations — specifically incident detection, diagnosis, and response. Not AI replacing engineers. AI making engineers faster at finding the signal in the noise. Think: ML-powered anomaly detection that actually reduces toil, or an LLM helping you parse logs during a 2 AM page.
B
Blameless Postmortem — A postmortem where you focus on what happened and why, not who messed up. The goal is to fix systems, not people. Works only if leadership actually means it — if someone gets fired after a “blameless” postmortem, your culture is broken and you won’t get honest answers next time.
C
Cardinality — The number of unique values a metric or label can take. High cardinality (like per-user-ID metrics) will crush your monitoring system’s storage and query performance. Know your cardinality limits or your bill and query latency will both explode.
Change Failure Rate — One of the DORA metrics. The percentage of deployments that result in a rollback, hotfix, or degradation. Lower is better. High CFR usually means weak testing, poor observability, or deployment processes that don’t catch problems before production.
Chaos Engineering — Deliberately breaking things in controlled conditions to find weaknesses before real incidents do. Not reckless — planned, measured, with a kill switch. Run a chaos test, learn what breaks, fix it. Beats being surprised at 2 AM.
Context Window — The amount of text an LLM can process in a single request, measured in tokens. Modern models support hundreds of thousands of tokens. In ops, a bigger context window means you can feed a model your runbook, relevant logs, and a question, and it may actually give useful answers without losing track of the context partway through.
D
Deployment Frequency — How often you ship code to production (DORA metric). Higher frequency generally means smaller, safer changes and faster feedback. Teams deploying daily learn faster than teams deploying quarterly.
DORA Metrics — Four metrics that correlate with software delivery performance: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery. Measuring these tells you if you’re actually improving. Ignore them and you’re guessing.
E
Error Budget — How much unavailability you can tolerate while still meeting your SLO. If your SLO is 99.9% uptime, you have roughly 43 minutes of downtime per month to spend. When it’s burned, you stop risky deployments. It’s the contract between product and engineering that keeps reliability from being an afterthought.
Escalation — Moving an incident up the chain — from engineer to senior engineer, from one team to another. Good escalation paths are documented, clear, and actually used. Bad ones mean critical incidents get stuck or ignored at the wrong level.
F
Failure Injection — Deliberately introducing a specific failure under controlled conditions to test a specific assumption about how your system responds. See: Chaos Engineering. The intentionality is what separates it from just breaking things.
Foundation Model — A large language model trained on vast text data before any task-specific tuning. GPT-4, Claude, Llama — these are foundation models. They’re generalists. In ops, you add your context (runbooks, logs, configs) on top to make them useful for your specific environment.
I
Incident — An unplanned interruption to your service or a degradation of performance that impacts users or the business. Don’t call something an incident if nobody noticed and nothing broke. That’s a bug or a maintenance task.
L
LLM (Large Language Model) — A neural network trained on massive amounts of text to generate language. In operations, LLMs help parse logs, generate runbooks, explain incidents, and draft documentation. They’re good at pattern matching and synthesis. They’re unreliable for precise math or authoritative facts — always verify output that matters.
M
MLOps — The discipline of building, deploying, and maintaining ML systems in production. Borrows from DevOps: model versioning, CI/CD pipelines for models, drift monitoring, and observability. If you run ML in production without MLOps discipline, your models will silently degrade.
MTTA (Mean Time to Acknowledge) — How long from alert firing to someone responding. Long MTTA means on-call coverage is weak, alerting is too noisy, or both.
MTTD (Mean Time to Detect) — How long for your monitoring to notice a problem. Faster detection gives you more time to fix things before customers feel it.
MTTF (Mean Time to Failure) — Average time between a system starting and its next failure. Useful to track whether your reliability improvements are actually extending the gap between incidents.
MTTR (Mean Time to Recovery) — How long from detection to resolution. This is your north star. Shorter MTTR means faster response, better runbooks, better automation, better on-call training. Everything else flows from it.
O
Observability — The ability to understand what’s happening inside your system from its outputs (logs, metrics, traces). Not monitoring — monitoring tells you something is wrong. Observability tells you why. Good observability means you can answer novel questions about your system without having predicted what you’d need to know.
On-Call — Being responsible for responding to incidents outside normal hours. On-call rotations should be sustainable, documented, and compensated. Bad on-call schedules burn people out. Good ones are boring — alerts are rare and fast to resolve.
P
P50 / P95 / P99 — Percentiles describing performance distribution. P50 is what typical users see. P99 is what unlucky users see. Optimizing only for P50 while P99 is broken is how you silently lose customers. Track both.
Playbook — A high-level guide for responding to a class of incidents — the overall strategy. A runbook is step-by-step. Playbooks guide decisions; runbooks execute them.
Postmortem — A structured review after an incident: what happened, why, what gets fixed. A good postmortem is blameless and produces concrete action items with owners. A bad one is a blame session that produces nothing actionable.
Prompt Engineering — Constructing inputs to an LLM to get useful outputs. In ops, this means giving the model the right context (logs, metrics, runbook), being specific about what you need, and iterating. Good prompting is the difference between an AI tool that helps and one that hallucinates confidently.
R
RAG (Retrieval Augmented Generation) — A technique where an AI retrieves relevant documents before generating a response, instead of relying purely on training data. In SRE: feeding an LLM your runbooks and incident history so it answers based on your environment, not generic knowledge.
Runbook — Step-by-step instructions for a specific incident or operational task. Write it assuming the person reading is panicking at 2 AM with half your context. A good runbook is tested, versioned, and updated after every incident. A bad one is documentation nobody reads.
S
Signal-to-Noise Ratio — The proportion of actionable alerts to junk. If you’re silencing 90% of alerts to get work done, your alerting strategy is broken. High noise kills the ability to detect real problems.
SLA (Service Level Agreement) — A contract with users about availability. Miss it and you owe something — refunds, credits, apologies. SLAs are looser than SLOs because they’re tied to business commitments and consequences.
SLI (Service Level Indicator) — The actual metric you measure, like “percent of requests completing under 200ms.” SLIs are the truth. You measure them. You track them. Your SLO is set relative to them.
SLO (Service Level Objective) — Your internal reliability target: “99.9% of requests complete under 200ms.” SLOs are the contract between engineering and product. They define what “good enough” means and when stability should take priority over features.
T
Telemetry — The raw data collected from your system: logs, metrics, traces, events. Telemetry is the raw material. Observability is what you build from it through correlation and analysis.
Token — The unit an LLM processes. Usually a word or word-fragment. “Hello world” is roughly two tokens. LLM pricing, context limits, and latency are all measured in tokens. Know your token count or you’ll be surprised by what fits (and what doesn’t).
Token Budget — The maximum tokens available for a single LLM request. Exceed it and the request fails or truncates. In ops, this means you can’t always throw your entire log file and runbook at an LLM — you have to be selective about what context you include.
Toil — Manual, repetitive operational work that scales linearly with system size: restarting services, applying patches, running the same troubleshooting steps over and over. Toil is the enemy of SRE. Your job is to eliminate it through automation.
Toil Budget — The maximum fraction of time acceptable for toil work. Google targets roughly 50% operational work (toil included) vs 50% engineering. If you’re at 90% toil, you’re not doing SRE — you’re firefighting.

