Stop Treating LLMs as 'All‑Purpose Tools': Practical Spring AI Multi‑Agent Architecture for Production
This article analyses why a single‑agent LLM approach quickly hits scalability, context, and governance limits, and presents a production‑ready Spring AI Multi‑Agent design—including layered architecture, agent metadata, skill engineering, routing strategies, orchestration, resilience, A2A service discovery, Kubernetes deployment, observability, security, and cost‑control—backed by concrete Java code examples.
Why a Single Agent Fails in Production
Many teams start with a single LLM agent that bundles intent recognition, order lookup, inventory checks, discount calculation, refund approval, logistics tracking, and human escalation into one prompt. While this works for prototypes, real‑world scenarios such as e‑commerce support, financial advice, or enterprise knowledge bases expose four critical problems:
Cognitive overload : the model must choose among many tools, leading to unstable decisions and noisy reasoning.
Context pollution : mixed intents (order, product, after‑sale) share the same context window, causing the model to reuse irrelevant history.
Poor scalability : a traffic spike for order queries forces the whole service to scale, inflating costs.
Governance blind spots : different business teams cannot audit or gray‑release a monolithic prompt.
Multi‑Agent as the Engineering Solution
Instead of “multiple bots chatting”, a production Multi‑Agent system splits capabilities into dedicated agents with clear responsibilities, tool permissions, context boundaries, and governance policies. A typical e‑commerce chatbot can be decomposed into:
Intent Agent – intent detection and confidence scoring.
Order Agent – order query, cancellation, refund pre‑check.
Product Agent – product info, inventory, recommendation.
AfterSale Agent – logistics, dispute handling.
Risk Agent – risk validation and policy checks.
Human Handoff Agent – escalation to human operators.
Each agent is a Spring bean implementing a common Agent interface and described by an AgentDescriptor record:
public record AgentDescriptor(
String name,
String version,
String description,
Set<String> capabilities,
Set<String> allowedToolNames,
Duration timeout,
int maxConcurrency,
boolean streamingSupported,
AgentStatus status) {}The descriptor is the contract for registration, discovery, rate‑limiting, and health checks.
Skill Engineering and Tool Governance
Spring AI’s @Tool annotation exposes Java methods to the model, but production‑grade skills must also provide metadata, permission checks, resilience, metrics, and idempotency. Example – a production‑ready inventory skill:
@Component
public class InventorySkill {
private static final Logger log = LoggerFactory.getLogger(InventorySkill.class);
private final InventoryClient inventoryClient;
private final MeterRegistry meterRegistry;
public InventorySkill(InventoryClient inventoryClient, MeterRegistry meterRegistry) {
this.inventoryClient = inventoryClient;
this.meterRegistry = meterRegistry;
}
public SkillMetadata metadata() {
return new SkillMetadata(
"inventory_query",
"1.0.0",
"inventory-team",
"product",
SkillRiskLevel.LOW,
Set.of("product-agent", "orchestrator-agent"),
Duration.ofSeconds(2),
300,
false);
}
@Tool(name = "inventory_query", description = "查询指定 SKU 在指定仓库的可售库存。只用于库存咨询,不用于锁库存。")
@CircuitBreaker(name = "inventoryClient", fallbackMethod = "fallback")
@RateLimiter(name = "inventoryClient")
public InventoryResponse queryInventory(@ToolParam(description = "商品 SKU 编码,例如 SKU-10086") String sku,
@ToolParam(description = "仓库编码,可为空;为空时查询全国汇总库存") String warehouseCode) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
InventoryResponse response = inventoryClient.query(sku, warehouseCode);
meterRegistry.counter("agent.skill.success", "skill", "inventory_query").increment();
return response;
} catch (RuntimeException e) {
meterRegistry.counter("agent.skill.error", "skill", "inventory_query").increment();
log.warn("Inventory query failed, sku={}, warehouse={}", sku, warehouseCode, e);
throw e;
} finally {
sample.stop(meterRegistry.timer("agent.skill.duration", "skill", "inventory_query"));
}
}
public InventoryResponse fallback(String sku, String warehouseCode, Throwable error) {
return new InventoryResponse(sku, warehouseCode, -1, "INVENTORY_TEMPORARILY_UNAVAILABLE");
}
}Key governance points are explicitly listed in the metadata: owner, risk level, allowed agents, timeout, QPS limit, and idempotency flag.
Context Management
Four context layers are maintained:
Request Context – requestId, traceId, deadline (single request).
Session Context – recent messages, user preferences (one session).
Task Context – current task state, visited agents (one task).
Long‑Term Memory – persisted user profile, historical tickets (cross‑session).
Context is stored in Redis with a 12‑hour TTL and compressed after a configurable window (6‑12 turns). Example of the AgentContext class:
public class AgentContext {
private final String tenantId;
private final String sessionId;
private final String threadId;
private final List<Message> recentMessages = new ArrayList<>();
private final Map<String, Object> attributes = new ConcurrentHashMap<>();
private final ExecutionTrace trace = new ExecutionTrace();
private final Instant createdAt = Instant.now();
// getters, put(), get() omitted for brevity
}Routing Strategy – Rules First, Model Second
A three‑stage strategy chain is recommended:
Rule‑based routing (high confidence, fast, deterministic).
Embedding / classification model for ambiguous intents.
LLM fallback that returns a strict JSON payload (e.g., {"targetAgent":"order-agent","reason":"rule matched"}).
Implementation example of a rule‑based DecisionStrategy and an LLM‑based fallback is provided in the source.
Orchestrator – The Core Execution Engine
The orchestrator receives an OrchestrationCommand, recognises intent, selects a decision strategy, and then either executes a single agent or runs parallel agents using a virtual‑thread executor. All calls are wrapped with Resilience4j circuit‑breakers and retries, and a global request deadline (12 s) prevents hanging.
public AgentResult orchestrate(OrchestrationCommand command, AgentContext context) {
Instant deadline = Instant.now().plus(REQUEST_BUDGET);
String requestId = UUID.randomUUID().toString();
UserIntent intent = intentService.recognize(command.input(), context);
context.put("intent", intent);
AgentDecision decision = decide(command, intent);
AgentResult result = decision.parallel()
? executeParallel(requestId, command, context, decision, deadline)
: executeSingle(requestId, command, context, decision.targetAgent(), deadline);
meterRegistry.counter("agent.orchestration.success", "target", decision.targetAgent()).increment();
return result;
}Guarded execution applies per‑agent retry and circuit‑breaker instances retrieved from RetryRegistry and CircuitBreakerRegistry.
High Concurrency with Virtual Threads
Java 21 virtual threads dramatically reduce thread‑per‑request cost for I/O‑heavy LLM calls, tool invocations, and downstream services. The executor bean is defined as:
@Bean(destroyMethod = "close")
public ExecutorService agentExecutor() {
return Executors.newVirtualThreadPerTaskExecutor();
}Even with virtual threads, you must still enforce per‑LLM provider rate limits, per‑agent concurrency caps, and per‑tenant quotas via Resilience4j rate‑limiters (example configuration shown in the article).
A2A Service Discovery with Nacos
When the number of agents grows beyond a single process, each agent is packaged as an independent Spring Boot service and registers an AgentCard in Nacos. The card contains name, description, provider, and version, enabling other services to discover and invoke remote agents via the A2A client.
Remote agents are wrapped in an adapter that implements the local Agent interface, translating calls to the A2A protocol.
Kubernetes Deployment and Autoscaling
Each agent runs in its own Deployment with readiness/liveness probes, resource requests/limits, and a HorizontalPodAutoscaler that scales on CPU and a custom metric agent_inflight_requests. Observability is enriched with OpenTelemetry, Prometheus, and custom metrics such as agent.request.latency, agent.tool.error, and agent.llm.tokens for cost accounting.
Security, Risk, and Human‑in‑the‑Loop
High‑risk operations (order cancellation, refund, address change, coupon issuance, personal data access, ops changes) are routed through a RiskAgent that validates policies and may require explicit user confirmation or manual review. All write‑type tools are required to be idempotent, accepting an idempotencyKey generated from requestId + businessKey.
Cost‑Budget Guard
A TokenBudgetGuard checks the estimated token consumption against a tenant‑specific quota before invoking any LLM. If the budget is exhausted, the system falls back to rule‑only responses, smaller models for intent detection, or hands the conversation to a human operator.
Common Production Issues & Remedies
Token explosion : limit tool exposure per agent, enable dynamic tool discovery, cap per‑request token usage, and trim RAG chunks.
Agent call loops : track visited agents in AgentRequest, enforce a maximum depth (e.g., 3), and emit trace warnings.
Wrong tool selection : give tools explicit names, write detailed descriptions, add rule pre‑checks for risky tools, and validate tool results.
Streaming failures : send SSE event: error and event: done frames, enable client retry, and persist partial results.
Redis hot‑spots : namespace keys per tenant, separate summary and recent‑message storage, use short‑TTL local caches, and checkpoint only at logical boundaries.
Adoption Roadmap
Phase 1 – In‑process modularisation : single Spring Boot app with local agents, Redis context, Micrometer metrics.
Phase 2 – Service‑level agents : separate high‑traffic agents, Nacos registration, Kafka for async tasks, Kubernetes scaling.
Phase 3 – Enterprise‑grade platform : full A2A mesh, MCP for external tool onboarding, automated evaluation suites, cost‑budget dashboards, gray‑release pipelines.
Launch Checklist
Agent descriptors include owner, version, capabilities, and health status.
All tools have permission policies, timeouts, rate limits, circuit breakers, and audit logging.
Write‑type tools are idempotent and require explicit user confirmation.
Context isolation per tenant, compression strategy, and PII redaction.
Orchestrator enforces a global deadline and graceful degradation.
LLM JSON parsing includes schema validation and fallback handling.
Trace IDs propagate through every agent, tool, and LLM request.
Token usage is tracked per tenant, agent, and model.
Critical agents have evaluation suites and can be gray‑released.
Kubernetes probes, HPA, custom metrics, and alerts are in place.
Conclusion
Treating a large model as a universal tool works for demos but cannot sustain production workloads. The real power of a Multi‑Agent system lies in turning business boundaries into independent, observable, and governable services. Spring AI provides the glue to expose LLM capabilities as Java beans; Spring AI Alibaba, A2A, MCP, and Nacos extend this to a distributed, cloud‑native ecosystem. By engineering agents as micro‑services with robust routing, resilience, observability, and cost controls, Java teams can safely bring AI into mission‑critical applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
