ABOUT THE SITE
Hi, I’m Nate. I built this because no one else was writing what I needed to read.
I’m an SRE and platform engineering practitioner. I’ve been on call, written Runbooks at 2am, and watched AI systems fail in ways that traditional reliability tooling wasn’t built to catch. This site is where I document what actually works.
Why AIOps SRE Exists
Most content about AIOps is written by vendors selling platforms, or analysts describing the category from the outside. Very little of it comes from people actually running these systems in production and dealing with what breaks.
When I started working on AI-augmented operations, building pipelines that use LLMs for alert triage, anomaly detection, and runbook automation, I kept running into the same problem. The SRE playbook doesn’t fully cover AI systems, and the AI/ML literature doesn’t cover reliability engineering. There was a gap.
AIOps SRE is my attempt to fill it. Every article comes from something I’ve actually worked through: a production incident, a tool evaluation, an architecture decision, a failed experiment I learned from.
What I Write About
AI Observability
How to monitor LLM calls, track token costs, detect quality drift, and build dashboards that actually tell you something useful.
Incident Management
On-call practices, runbook design, postmortem culture, and how AI is changing what incident response looks like.
SRE Fundamentals
SLOs, Error Budgets, toil reduction, and the organizational patterns that make reliability engineering actually work.
Platform Engineering
Building internal developer platforms, golden paths, and the infrastructure that lets teams move fast without breaking things.
AIOps in Practice
Real evaluations of AIOps tools, honest takes on what works versus the hype, and patterns for integrating AI into operations workflows.
The Human Side
On-call burnout, building sustainable operations culture, and what it actually feels like to do this work over the long haul.
What This Site Is Not
This is not a vendor review site. I don’t accept sponsored content or write promotional articles. When I say something works, it’s because I’ve used it. And I’ll tell you when something doesn’t work too.
This is not a news aggregator. I’m not trying to cover everything happening in AI or DevOps. I want to write things that are still useful in six months, not just today.
This is not an academic blog. I care about being accurate, but I care more about being useful. When I explain something, I explain it the way you’d explain it to someone on your team who needs to understand it to do their job.
Get in Touch
Best way to reach me is through the comments on any article. I read them all and try to respond to substantive questions.
If you’re working on something interesting in AIOps or SRE and want to share your experience, I’m open to hearing about it. Guest posts and practitioner perspectives are welcome.
Ready to dive in?
Start with the fundamentals, browse by topic, or jump straight into the latest articles.

