IT engineer with tablet in server room
Editorial

Agentic AIOps: How to Build the Guardrails for Autonomous Infrastructure

4 minute read
Pavan Madduri avatar
By
SAVED
AI agents can remediate outages in seconds. The real challenge is making sure they don’t turn small incidents into P0 disasters.

AIOps spent its first decade as a very expensive dashboard. Ingest telemetry, correlate logs, surface the alert that an engineer would have found anyway — just faster. Useful, but not transformative.

That era is ending. The industry is now building Agentic AIOps: systems with write-access to production, designed not just to identify a failing service at 3 AM but to fix it before your on-call engineer's phone ever rings.

The engineering case is straightforward. An AI agent that detects a latency spike, identifies the degraded node pool, scales the replacement capacity and rebalances the workloads — all within 90 seconds — eliminates an entire category of operational pain. For SRE teams managing sprawling multi-cluster environments, that is not a nice-to-have. It is a competitive necessity.

But giving a non-deterministic system write-access to production infrastructure is a different category of risk than anything AIOps has introduced before. If an LLM hallucinates a code snippet, a developer deletes it. If an Agentic AIOps system hallucinates an infrastructure command, it can de-provision a payments database or misconfigure a load balancer at 3 AM on a Friday. The blast radius is not a bad pull request. It is a P0 incident.

Velocity without governance is just a faster way to break things. Here is the architecture that makes Agentic AIOps safe enough to actually run in production.

Table of Contents

You Cannot Automate What You Haven't Declared

The first hard truth: Agentic AI cannot operate safely in imperative, undocumented environments.

Dropping an autonomous agent into infrastructure built on manual click-ops and undocumented bash scripts is like installing a self-driving system in a 1990s manual transmission car. The AI knows what turn to make. It has no steering column to execute it.

The mandatory prerequisite is a declarative GitOps control plane. Tools like ArgoCD and Crossplane enforce a model where the desired state of infrastructure is version-controlled, machine-readable and immutable outside of approved change processes. This gives the agent a predictable operating boundary — and critically, it gives your team an auditable record of every automated change. Branch protection rules and signed commits (via Sigstore) close the remaining gap: even if the agent has Git write-access, it cannot bypass required reviewers or push unsigned commits to protected branches.

No GitOps foundation, no Agentic AIOps. That sequencing is non-negotiable.

Related Article: How and Why Agentic AI Changes the Game

Shadow Mode: The AI as Operator, the Human as Approver

The most common mistake organizations make when deploying AI agents is over-provisioning permissions at the start.

Assigning cluster-admin to an AI service account because it is convenient is how you end up with an autonomous system that can delete your production namespace.

The correct model starts with Shadow Mode. When an incident fires, the agent does the diagnostic work: it queries Prometheus, isolates the degraded pod, traces the dependency chain and drafts the remediation manifest. But its RBAC permissions explicitly exclude the ability to apply that manifest directly to the cluster. The agent's write boundary stops at the Git repository. Its only permitted action is opening a Pull Request.

This reframes the on-call experience entirely. The engineer waking up at 3 AM is not hunting through logs trying to understand what broke. They are reviewing a PR with the fix already written. If the logic holds, they approve it and the GitOps engine handles the rest. The cognitive load drops from an hour of investigation to a five-minute review.

Once an agent demonstrates consistent accuracy across hundreds or thousands of incidents in Shadow Mode — and once your team has defined what "accurate" actually means for your specific remediation patterns, which is itself a non-trivial engineering problem — you can selectively grant it RBAC permissions to auto-merge low-risk changes. Start narrow: permission to scale a specific deployment in a specific namespace, nothing more.

Blast Radius Containment: Namespace Isolation and Ephemeral Testing

RBAC prevents the agent from acting outside its designated scope. But RBAC alone does not protect you from the agent doing something wrong within that scope.

ResourceQuotas and Namespace isolation are the second containment layer — an agent authorized to remediate the frontend-web namespace should have zero visibility into the payments-gateway namespace, enforced through both RBAC policy and network policy (Cilium or Calico for true traffic-level isolation).

Before any AI-generated change reaches production, route it through an ephemeral sandboxed environment that mirrors your production topology. Stateless services can be mirrored cleanly; stateful dependencies like databases require traffic replay or synthetic load to approximate real conditions. This is operationally complex, but skipping it means your first real test of the agent's logic happens in production.

Policy-as-Code: The Automated Gatekeeper

Human PR review does not scale indefinitely. Once your agent is generating hundreds of remediation proposals per day, requiring a human to approve every one defeats the purpose. Policy-as-Code is how you automate the gatekeeping without removing the guardrails.

Kyverno is the right tool for Kubernetes-native policy enforcement. Every remediation PR the agent generates passes through Kyverno admission webhooks before it can merge. Policies are explicit: no containers running as root, all new workloads require resource limits and billing labels, images must come from approved registries. If the agent's proposed fix violates any of these constraints, the PR is automatically rejected and the agent is forced to generate a compliant alternative.

OPA/Gatekeeper is the better choice when your policy scope extends beyond Kubernetes into broader infrastructure — cloud IAM, Terraform plans, cross-service dependencies. For most teams starting with Agentic AIOps, Kyverno is the faster path to enforcement.

The rules of your infrastructure become self-enforcing. Compliance stops being a quarterly audit and becomes a continuous, automated check on every single change the agent proposes.

Related Article: Vibe Coding: Reimagining Software Development for the Age of Agents

The Engineer's Role Does Not Disappear — It Elevates

Agentic AIOps does not replace the platform engineer. It eliminates the observability tax: the hours spent querying logs, chasing alerts and manually tracing failures in systems too complex for any individual to hold in their head.

The AI is fast and tireless at pattern recognition across thousands of signals simultaneously. It is not good at judgment calls that require business context, architectural intuition or understanding of what a "normal" traffic pattern looks like after a major product launch.

Learning Opportunities

Anchor your agents inside a GitOps framework. Enforce zero-trust RBAC from day one. Let Policy-as-Code handle the compliance enforcement automatically. The AI raises your operational floor. The platform engineer still sets the ceiling.

fa-solid fa-hand-paper Learn how you can join our contributor community.

About the Author
Pavan Madduri

Pavan Madduri is a Senior Platform Engineer at W.W. Grainger and a CNCF Golden Kubestronaut—an elite designation held by fewer than 400 practitioners globally. An IEEE Senior Member and active open-source contributor to CNCF Dragonfly, Pavan specializes in hyperscale AI infrastructure, NUMA-aware scheduling, and Zero-Trust autonomous incident response. Connect with Pavan Madduri:

Main image: Framestock | Adobe Stock
Featured Research