Most teams meet AI agents as a UI trick first: a chat box that can run commands, open tickets, or change a dashboard state. It looks like magic until the first time it touches production and leaves you asking a new question: did the system change because the incident evolved, or because the agent did something plausible?
SREs should treat agents differently. An agent is not a feature. It is a control loop with permission to change the world.
AIOps has historically been about perception. It helps you notice, cluster, rank, and summarize. It shortens the search for a hypothesis. Agents are about execution. They decide, act, verify, and repeat. That shift, from explaining to doing, is the bridge, and it is where reliability discipline has to move with it.
From signals to commitments
A clean way to cut through the marketing is to separate three categories that get blended together in practice.
AIOps produces signals. It tells you what is unusual, what is correlated, what might be risky, and what evidence supports a hypothesis. Copilots propose actions. They draft the query, outline the runbook step, and prepare the artifact so a human can execute quickly. Agents execute actions. They call tools, mutate systems, verify outcomes, and sometimes roll back.
That sounds incremental until you internalize the operational difference. AIOps makes recommendations. Agents make commitments. The moment something commits, it creates blast radius, audit requirements, failure handling, and an accountability chain. You are not adopting a smarter dashboard. You are adopting a new kind of automation.
What changes when the system can act
If an agent can page, silence, roll back, scale, or change configuration, you now own four things that AIOps did not have to own.
You need a decision policy that is explicit about when the agent acts versus asks, and what it does when uncertainty increases. You need a tool model that is explicit about what calls are allowed, what success looks like, what is idempotent, and what must never be retried blindly. You need a verification model that is machine-checkable, time-bound, and paired with rollback triggers. You need an accountability model that answers the uncomfortable questions, including who is responsible when it goes wrong and where the full action trail lives.
This is why the intersection with SRE is not philosophical. It is operational. An agent is automation with a probabilistic planner in the middle, and the planner will occasionally choose an action that looks reasonable but is wrong for the moment you are in.
The failure modes that matter in production
Agents usually do not fail by saying something incorrect. They fail by doing something plausible.
One common failure mode is confident action on ambiguous signals. Latency rises, the agent takes the obvious mitigation, and the action is not catastrophic, but it changes evidence and consumes time. The on-call now has to untangle what the system did from what the system was doing. The practical outcome is SLOwer time to first decision, even if the agent’s action was defensible.
Another is partial failure across tool boundaries. Tool A succeeds, tool B fails, the agent retries, and tool A was not idempotent. You get double writes, duplicated tickets, or conflicting changes. You have effectively built a distributed transaction without a transaction manager, and the agent will happily keep pushing until it hits rate limits or a human stops it.
Permission drift is quieter and more dangerous. The agent starts read-only. Someone adds a scope because approvals are annoying. A few weeks later it can silently mutate the control plane, not through malice, but through incremental convenience. The reliability story here is privilege escalation via workflow friction.
Retries can also amplify load. Humans see an error and pause. Agents see an error and try again. Ten retries later you have a thundering herd against an API, or a self-induced incident caused by helpful persistence. The reliability signature is load amplification and cascading failure.
Finally, tool injection is the modern version of command injection. Anything the agent reads is an input: ticket text, emails, log lines, alert payloads. If those inputs can steer tool calls, you must treat them as untrusted, even if the source is internal. The attack surface is not new. The interface is just friendlier.
Where agents and SRE actually meet
To make this real, anchor agents to workflows you already run, then constrain them to actions that are both useful and reversible.
In incident response, AIOps can cluster alerts and summarize evidence, but it still leaves the operator with the hard part: picking the next safe action. An agent can help if it stays bounded. It can open the incident artifact with the right owners and links, pull a consistent evidence bundle into one place, and execute a pre-approved mitigation that has tight guardrails, such as scaling a canary one step or rolling back a single deployment to a known good build. What it should not do is silence alerts globally, roll back multiple services at once, or change routing policy without an approval gate. Those are high-blast actions that require human intent, not probabilistic confidence.
In change risk management, AIOps can predict elevated risk and detect anomalies after deployment. An agent can translate that into disciplined routing rather than absolute blocking. When the risk model flags a change, the agent moves it into a slower lane with explicit review criteria and an operator decision point. The win is not stopping change. The win is making risk visible and enforceable without turning every deploy into a debate.
In operational hygiene, the value is loop closure. Most SRE toil is not difficult. It is repetitive, cross-tool, and context switching heavy. A well-designed agent can run a periodic drift sweep, identify candidates based on inventory and signals, propose a remediation plan with explicit blast radius, and then execute only the safe subset while queuing the rest for humans. This is where autonomy pays off, because the work is low creativity and high interruption cost.
The control loop contract that keeps you safe
If you want one artifact that creates leverage, it is a short contract for every agent that can change production. It should be readable by an on-call under stress.
Start with purpose and success. State what the agent exists to do and what good means as a metric. Declare inputs, including which sources are allowed and which are explicitly untrusted. For each tool the agent can call, define allowed operations, required parameters, idempotency expectations, rate limits, and failure behavior, including when it retries and when it stops. Then define decision policy in plain language: the explicit conditions for acting, the explicit conditions for requiring approval, and the explicit conditions for never acting.
Verification needs to be machine-checkable. Define what signals prove success, how long the agent waits, and what triggers rollback. Add stop rules that force the agent into advisory-only mode when conditions are unstable, plus a clear paging path when those stop rules fire. Finally, require audit completeness: every tool call logged with who, what, when, why, inputs, and outputs, along with a human-readable timeline.
This contract forces the right kind of clarity. If it cannot be audited, it cannot be automated.
How to measure whether the agent is helping
For AIOps, teams often report alert reduction. That is not the right success metric for agents.
Measure time to first decision, because the goal is faster commitment to a hypothesis and next action. Measure false action rate, because plausible actions that do not move recovery forward still burn time and trust. Measure rollback rate and rollback time, because they tell you whether the agent recognizes when it is wrong quickly. Measure human interruption rate, because frequent manual stops indicate poor boundaries or weak stop rules. Measure audit completeness, because if you cannot reconstruct causality, you are one incident away from turning the agent off.
The goal is not more automation. The goal is a shorter path from signal to safe action.
Where to start without creating a new incident class
Sequence this the way most teams learned automation the hard way.
Start with advisory agents that draft the next few actions, generate queries, assemble evidence, and create tickets, but perform no production writes. Then allow guarded execution in low-risk domains, using reversible actions only, verified outcomes, and conservative stop rules. Only after that should you expand into controlled production actions, with gradual permissions, break-glass controls, and an explicit reliability SLO for the agent itself.
The intersection, stated plainly
AIOps improved perception. SRE improved response. Agents can bridge the gap to execution, but only if you treat them like production automation with a probabilistic brain and a governed blast radius.
Before you give an agent authority, answer one question: what is the smallest action it can take that is both useful and reversible?
Start there. Everything else is marketing.
Continue Reading
🤖The practitioner guide to AIOps: alert correlation, anomaly detection, LLM integration, and automated remediation.
Stay Sharp
New articles on AIOps and SRE, straight to your inbox.
Practical content for practitioners. No noise, no vendor pitches.
No spam. Unsubscribe any time.


