toyin
← writing

Agentic Operations: Where LLMs Meet the Ops Workflow

How I use LLM agents to close the gap between intent and execution in an ops-heavy environment.

There is a category of operational work that sits in an uncomfortable middle ground: the task is too structured to require deep human judgment on every step, but too context-dependent to be handled by a conventional script. Writing a runbook entry for a deployment failure. Drafting a post-incident summary. Generating a first-pass Kubernetes manifest from a verbal description of what a service needs. These tasks used to just live on the pile of “things I will do when I have time,” which meant they often did not get done well.

LLM agents are changing that calculus, and I want to describe concretely what that looks like in an ops workflow rather than speak in abstractions.

The Gap Between Runbooks and Reality

Every ops team has runbooks, and every ops team knows that runbooks are only as good as the last person who updated them. The moment you finish writing a runbook, the system it describes starts to drift. Dependencies change, endpoints move, a flag gets renamed. The runbook stays the same.

The problem is not that runbooks are a bad idea. The problem is that maintaining them requires consistent effort that competes with every other priority. When an incident happens, you are not updating the runbook — you are putting out the fire. The update happens afterward, sometimes, if you have the energy.

An LLM agent with access to your git history, your ArgoCD application state, and your recent deployment logs can draft a runbook update automatically as part of a post-incident workflow. It will not be perfect. But a draft that is 80% correct and already in a pull request is infinitely more useful than a blank page you intend to fill in later. The human review step becomes editing rather than authoring, and editing is faster.

What Agentic Loops Look Like in Practice

The pattern I find most useful is what I think of as a bounded loop: the agent operates within a defined scope, generates output for human review, and only proceeds after explicit approval. This is not a fully autonomous system. It is a system where the tedious work — gathering context, formatting output, cross-referencing documentation — is handled by the agent, and the judgment calls remain with the engineer.

Concretely, this looks like:

  • An agent scans recent ArgoCD sync events and open Kubernetes events, then produces a structured “what is currently unhealthy and why” summary to start an on-call shift
  • An agent takes a description of a new microservice (“needs to pull from a specific queue, write to Postgres, expose a health endpoint, run on arm64”) and generates a Kubernetes deployment manifest with sane defaults — resource limits, liveness probes, non-root security context — for an engineer to review and adjust
  • An agent reads the diff of a recent infrastructure PR, cross-references the relevant ArgoCD app definition, and writes a plain-language description of what the change will actually do at runtime

None of these replace engineering judgment. All of them compress the time between “something needs to happen” and “a human is looking at a concrete proposal.”

The Honesty About Current Limitations

LLM agents in an ops context are not reliable enough to be autonomous decision-makers for anything that changes production state. The failure modes — confidently wrong context, outdated knowledge about API versions, hallucinated configuration options — are real, and in an ops context, wrong configuration changes cause outages.

The appropriate role right now is augmentation, not replacement. The agent is a very fast research assistant and first-draft author. It can surface the right documentation section when you describe a symptom. It can generate candidate solutions that you then evaluate. It can structure an incident timeline from log fragments you paste in.

The discipline required is making the human review step non-negotiable. When an agent produces a Kubernetes manifest, it goes through the same review process as a human-authored one — ideally through a pull request, ideally with a second set of eyes. The agent does not get a shorter review path because it worked quickly.

Integrating Agents Into a GitOps Workflow

The natural integration point for agents in a GitOps shop is the pull request. An agent that produces infrastructure changes as pull requests, rather than applying them directly, fits cleanly into an existing review and audit process. The ArgoCD controller only sees what merges to the target branch. Everything before that — however it was generated — goes through the standard gate.

This means the agent’s output is always auditable. You can look at the git history and see exactly what was generated, when, and what changes the reviewing engineer made before merge. The chain of custody is intact. This matters for regulated environments and it matters for post-incident analysis: when something goes wrong, you want to know whether the configuration came from a human decision or an agent suggestion, and whether it was reviewed.

Where This Is Going

The trajectory here is toward agents that can handle progressively more of the routine operations surface — not because they are more trustworthy, but because the tooling for verifiable, reversible, auditable agent actions is maturing. When an agent’s output is always a pull request, always goes through a linter and policy check before merge, and always lands in a system that continuously reconciles and reports drift, the blast radius of a wrong answer shrinks considerably.

The goal is not to automate judgment out of ops. It is to automate the parts that do not require judgment, so the parts that do get the attention they deserve.