AIOps & SRE Glossary

Question 1

What is Agent Loop?

Accepted Answer

The cycle where an AI agent observes a system state, takes an action, and evaluates the outcome before deciding on the next action. In SRE this might mean an AI system detecting a performance anomaly, triggering a diagnostic check, reviewing results, and escalating to a human.

Question 2

What is Alert Fatigue?

Accepted Answer

What happens when your alerting system has more false positives than real incidents. Your team starts ignoring alerts. Real problems get missed. The fix is tuning alerts ruthlessly so every alert means something actionable.

Question 3

What is AIOps?

Accepted Answer

Using machine learning and AI to automate and augment operations, specifically incident detection, diagnosis, and response. Not AI replacing engineers — AI making engineers faster at finding the signal in the noise.

Question 4

What is Blameless Postmortem?

Accepted Answer

A post-incident review that focuses on systems, processes, and contributing factors rather than individual fault. The goal is learning, not accountability theater. If your postmortems result in blame, they stop producing honest timelines.

Question 5

What is Cardinality?

Accepted Answer

The number of unique values a label or dimension can take in a metrics system. High cardinality (e.g., a label with a unique value per user request) is the fastest way to blow up your Prometheus storage and query performance.

Question 6

What is Error Budget?

Accepted Answer

The amount of unreliability you are allowed before breaching your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — about 43 minutes per month. When the budget is healthy, ship fast. When it is burning, reliability work takes priority.

Question 7

What is MTTR (Mean Time to Recovery)?

Accepted Answer

The average time from when an incident is detected to when the system is restored to normal. One of the four DORA metrics. Improving MTTR means faster detection, better runbooks, and cleaner rollback paths.

Question 8

What is Observability?

Accepted Answer

The ability to understand the internal state of a system from its external outputs. An observable system lets you ask arbitrary questions about its behavior from outside — without shipping new code to get the answers. Built on metrics, traces, and logs.

Question 9

What is an SLO (Service Level Objective)?

Accepted Answer

A target reliability level expressed as a percentage over a time window, e.g., 99.9% of requests succeed over 30 days. SLOs are the contract between product and engineering. They convert reliability into something measurable and negotiable.

Question 10

What is a Runbook?

Accepted Answer

A documented procedure for responding to a specific operational event or alert. A good runbook is opinionated, step-by-step, and tested under pressure. A bad runbook is a wall of context with no clear actions.

Question 11

What is Toil?

Accepted Answer

Manual, repetitive, automatable work that scales with service growth and produces no lasting value. Google SRE formalized this concept to help teams justify investing in automation. If doing more of it does not make things better, it is toil.

Question 12

What is RAG (Retrieval Augmented Generation)?

Accepted Answer

A pattern where an LLM is given relevant retrieved documents alongside the user prompt, grounding its response in specific sources rather than relying solely on training data. Critical for reducing hallucinations in ops workflows.

Question 13

What is a Token?

Accepted Answer

The unit of text an LLM processes. Roughly 0.75 words per token in English. Every LLM API call costs tokens — both input and output — and has a maximum context window measured in tokens. Token usage is your AI compute bill.

Question 14

What is Chaos Engineering?

Accepted Answer

The practice of deliberately injecting failures into a system to test its resilience. Game days, fault injection, and controlled experiments that find weaknesses before production does. The output is confidence, not breakage.

Question 15

What is an Incident?

Accepted Answer

An unplanned interruption or degradation of a service. What counts as an incident depends on your SLOs. If you are burning error budget, you have an incident. Incidents are not failures of people — they are signals from systems.

What's Hot

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

AIOps & SRE Glossary

A

B

C

D

E

F

I

L

M

O

P

R

S

T

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE

SRE vs Platform Engineering: Where the Line Actually Is

Most Popular

AIOps tools: what matters in production and what does not

Eliminate Alert Fatigue for Good: Powerful AIOps Techniques

Key Performance Indicators (KPIs)

Our Picks

MTTD Is Lying to You. And It’s Costing You Incidents You Never See.

AI Agents Are Production Systems Now. Your SRE Model Isn’t Ready.

OpenTelemetry: What It Is, How We Got Here, and Why It Changes AIOps SRE