Can Four LLM‑Powered Agents Build a Real Kubernetes Cluster Without Human Help?
An experiment with four LLM‑driven autonomous agents—Architect, Builder, Security Sentinel, and QA Tester—attempted to provision a Proxmox‑based HA Kubernetes cluster using real hardware, revealing costly context drift, emergent coordination failures, and stark differences between Gemini and Claude in diagnosing infrastructure‑as‑code errors.
Experiment Overview
Goal: build an autonomous SRE team of four LLM‑driven agents that can configure two idle mini‑PCs running Proxmox VE 9 into a production‑ready HA Kubernetes cluster with an ingress controller and a hello‑world app, while keeping LLM usage costs low (total 56 sessions costing $15).
Agent Architecture
Architect_Zero – plans and coordinates, can read/write documents but cannot execute commands; delegates all tasks.
DevOps_Builder – executes Terraform, Ansible and SSH commands, writes files, and directly manipulates the infrastructure.
Security_Sentinel – audits plans for security issues; read‑only audit permissions.
QA_Tester – runs health checks and SSH probes to verify deployments.
Each agent runs a LangChain ReAct loop (max 100 reasoning steps per turn) and communicates via a shared Redis pub/sub channel using @AgentName tags.
Cost constraints led to the selection of GPT‑5‑mini as the LLM backbone; all 56 sessions together cost about $15.
Key Failures
Day 1 – Missing Terraform provider
DevOps_Builder repeatedly referenced a non‑existent Terraform provider hashicorp/proxmox , generated invalid data/proxmox and local.binary resources, and wrote variables to proxmox.vars.tf instead of a proper .tfvars file, causing HCL parsing errors. The agent entered a loop of over 20 save_file and run_terraform calls without convergence. It eventually discovered the correct .auto.tfvars usage but never cleaned the corrupted file. It also attempted to call an undefined tool request_security_audit, which the model invented.
Day 2 – Security audit loop
Security_Sentinel detected sensitive data in terraform.tfstate, issued a STOP command, and demanded state file deletion, credential rotation, Git history purge, remote backend setup, and pre‑commit hooks. DevOps_Builder complied, but Terraform recreated the state file on the next run, triggering another STOP. This loop repeated 13 times, consuming about seven hours on compliance documentation rather than VM creation.
Days 3‑5 – Progress and new bugs
After correcting the provider name, adding risk statements, and raising step limits, DevOps_Builder successfully SSHed into both nodes, listed devices, and ran terraform init, producing Ansible playbooks. A Jinja2 bug then iterated over each character of an IP string, generating a separate nftables rule per character (e.g., ip saddr 1 tcp dport 22 accept). The agent identified the string‑vs‑list iteration problem and suggested a fix, but human intervention was still required to apply it.
Context Drift
Architect_Zero rewrote user_requirements.md to a minimal statement – “User has a 2‑node Proxmox cluster” – dropping all Kubernetes‑related requirements (HA, ingress, credential rotation). Subsequent plans focused on 900+ lines of nftables rules, ignoring the primary goal. When the conversation exceeded the 8‑message memory window, the agents lost the original context and optimized sub‑tasks instead of the final deliverable.
Emergent Dysfunction Patterns
Approval loop : architect proposes a plan, security approves, architect resubmits the same plan, security approves again – no progress.
Escalation cascade : DevOps reports an error, architect creates a new plan, security reviews, DevOps receives the same erroneous instruction and repeats the failure.
Option‑menu disease : agents present 3‑4 protocol tokens (e.g., FIX_TEMPLATE_AND_RETRY, HOLD, PLEASE_PROVIDE_RENDERED_FOR_REVIEW) for the user to choose, turning the user into a manual router.
Think‑message leakage : internal reasoning is published to Redis; DevOps reacts to the architect’s “thoughts” rather than concrete decisions.
Gemini vs Claude in Agent Code Generation
Gemini 3 Pro/3.1 Pro produced a runnable framework (Redis pub/sub, LangChain ReAct loop, Dockerfiles, tool definitions) but missed 13 structural bugs; its remediation was to tweak prompts rather than address root causes.
Claude identified the same 13 bugs in the first prompt, including a read_env("ALL") call that leaked the OpenAI API key. Security_Sentinel flagged this as a SECURITY ALERT, and Claude traced the leak to the _think() function that published to Redis without a sender filter.
Both assistants wrote correct Python code; Claude provided deeper diagnostic insight, while Gemini only patched surface symptoms.
Practical Recommendations
Use the planning output (HA topology, technology selection) as a solid baseline for a human SRE.
Recognize that syntactically correct Terraform does not guarantee a successful terraform apply against real APIs.
Security agents need a calibrated threat model rather than static rule checks to avoid rubber‑stamp or endless blocker behavior.
Be aware of token limits: the implementation plan grew to ~27 k tokens, exhausting each agent’s context window and leaving little room for reasoning.
Recovering from erroneous code is the hardest part; human iterative debugging outperforms exhaustive variant generation by agents.
Open‑Source Repository
GitHub: https://github.com/beniamin/sre-ai-team-experiment
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
