What 5 Hard‑Earned Lessons Reveal About Running Multi‑Agent AI Systems

A four‑day experiment with a six‑person AI agent team shows how fragile monitoring, hidden glue code, and unrealistic cost assumptions can cripple automation, and it distills five concrete lessons plus a three‑step OVA debugging method to build more reliable AI‑driven workflows.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
What 5 Hard‑Earned Lessons Reveal About Running Multi‑Agent AI Systems

Background

I had a TypeScript project that required continuous iteration—adding tests, fixing bugs, and refactoring architecture. Manual work was slow and error‑prone, so I decided to try a multi‑agent approach inspired by recent research papers.

Agent Architecture

Lead Agent : designs the solution, breaks tasks into subtasks, assigns them to workers, and validates output quality.

Worker Agents (×3) : each runs in an independent Git worktree, executes the assigned task, and reports back to the Lead Agent.

Gatekeeper : reviews code before it is merged into the main branch and provides feedback.

Watchdog : polls every 10 minutes, detects crashes or abnormal logs, diagnoses issues, attempts fixes, and restarts the system.

Four‑Day Journey & Key Failures

The system was launched after writing six agent prompts, a runner script, and a watchdog script. Over the next four days I observed a series of failures:

Day 1 : Eight versions crashed due to a compatibility issue; monitoring reported “normal” while the process was stuck, leading to a 11.6‑hour outage.

Day 2 : Watchdog logged “normal” 68 times while an agent silently hung; the log file remained 0 bytes because output was buffered until process exit.

Day 3 : A worker modified many files, but the heartbeat monitor killed it after detecting no log growth.

Day 4 : The Bash script grew to 3600 lines and became unreadable, prompting a rewrite in TypeScript.

Five Lessons Learned

Useless monitoring is dangerous. A false sense of safety caused me to ignore obvious problems.

Systems evolve from real failures. None of the protective measures anticipated Bash incompatibilities, buffered output, or environment‑variable leakage.

Problems hide where you least suspect. The 0‑byte log bug was caused by OS‑level buffering, not a code bug.

Never trust tool‑generated numbers blindly. The CLI reported a $196.60 cost, but raw logs showed 92 % of tokens were processed by cheap models (GLM‑4.7, DeepSeek‑V3.1) while Claude Opus was used only 2.5 % of the time.

Glue code is the real challenge. Process management, heartbeat, timeout handling, and cost tracking made up >80 % of the codebase and were the source of almost every crash.

OVA Debugging Method

I now start troubleshooting with a three‑step approach:

Observe – verify that the monitoring/measurement tool itself is reliable.

Verify – cross‑check the collected data against raw logs or independent sources.

Analyze – finally examine the business logic.

This reduced my average MTTR from 11.6 hours to under 30 minutes.

Cost Insight

Analyzing 56 log files revealed that the majority of token consumption came from inexpensive models, while the CLI’s internal pricing inflated the perceived cost. The lesson: always trace numbers back to raw data and validate with an independent source.

Glue Code Overhead

Across the project, “glue” – process orchestration, error detection, heartbeat, environment isolation, timeout control, cost tracking, and log archiving – accounted for more than 80 % of the code. These seemingly trivial components were the single points of failure.

Takeaway

When building AI‑driven automation, focus on robust infrastructure and glue code, treat monitoring as a first‑class citizen, and validate every metric you rely on. The real path from a demo to a production‑ready system lies in the invisible plumbing, not just the headline‑grabbing AI prompts.

.ts
operationsAI agentssoftware engineeringSystem monitoring
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.