Self‑Healing Agents: Rebuilding a High‑Concurrency Travel System with Spring AI ReAct
This article details how a legacy travel‑booking service was transformed into a production‑grade, self‑healing agent system using Spring AI ReAct and multi‑tool coordination, covering architectural redesign, tool governance, error semantics, high‑concurrency safeguards, observability, security, and real‑world performance gains.
Background and Problem Statement
When enterprises first integrate large language models (LLMs), the focus is often on model intelligence, but production stability depends on the system’s ability to complete tasks reliably under high load, timeouts, and external service volatility.
Why Traditional Workflow Engines Fail
Hard‑coded state machines or BPMN workflows handle deterministic paths well but cannot cope with the dynamic decision‑making required in travel booking, such as reacting to sold‑out flights, price fluctuations, or budget constraints.
ReAct as a Decision Engine
ReAct (Reasoning + Acting) turns the model into a task solver that iterates through
understand → reason → call tool → observe → adjust → repeat. The model decides *what* to query or book, while the platform enforces *how* and *when* side‑effects may occur.
Architectural Refactor
The legacy travel‑processor monolith was replaced by a layered travel‑agent‑service:
Client / Gateway → Agent Orchestrator (Prompt, ReAct Loop, Step Guard) → Tool Governance (Registry, Timeout, Retry, Rate‑limit, Idempotency) → Domain Services (Budget, Inventory, Approval) → Async Coordination (Kafka, Saga) → Data Stores (MySQL, Redis, ES) → Observability (Metrics, Traces, Structured Logs)This separation isolates dynamic decision‑making from strong‑constraint execution, improves testability, and prevents cascading failures.
Tool Design Principles
Structured input parameters; no free‑form text.
Structured output with success, code, retryable, suggestions to guide the model’s next step.
All side‑effect tools require an idemKey for idempotency.
Explicit timeout, retry, and fallback policies.
Example of a unified result type:
public record ToolResult<T>(boolean success, String code, String message, boolean retryable, boolean compensated, T data, List<String> suggestions) {}Budget, inventory, and approval tools are annotated with @Tool and return ToolResult. Failure cases return rich error semantics, e.g., a sold‑out flight returns code="FLIGHT_SOLD_OUT" with suggestions for alternative queries.
Self‑Healing Capability
The system defines a four‑layer self‑healing model:
Failure identification (timeout, sold‑out, rate‑limit).
Local recovery within the current candidate set.
Strategy fallback (switch transport mode, supplier, price tier).
Compensation loop (release budget, cancel bookings, close approvals).
A concrete scenario shows the agent handling a request for a Shanghai‑to‑Beijing trip, automatically switching to a cheaper flight or an alternative hotel when the first choice fails, and releasing the budget if approval times out.
High‑Concurrency Engineering
Key bottlenecks are model latency, tool‑call amplification, and external supplier spikes. Governance strategies include:
Gateway rate‑limiting and cabin isolation.
Model call quotas with timeout enforcement.
Resilience4j circuit‑breaker, rate‑limiter, and retry for each tool.
Result caching for hot queries.
Async event‑driven handling for long‑running steps such as approvals.
Splitting the workflow into synchronous (real‑time decision) and asynchronous (post‑approval) phases reduces front‑end latency from ~8 s to < 4 s.
Scalable Agent Composition
Instead of a single “super‑brain”, the system uses three agent types:
Router Agent – determines the business domain.
Domain Agent – runs ReAct within a bounded tool set (e.g., travel).
Commit Agent/Workflow – enforces strong constraints and compensation.
A tool registry dynamically assembles the appropriate tool bundle per domain, keeping prompts short and relevant.
Observability and Security
Metrics collected include request count/latency, step count, tool call count/latency, failure counts, self‑heal successes, and human hand‑offs. Structured logs capture traceId, requestId, step number, tool name, arguments, outcome, and latency, enabling full replay of any task.
Security is enforced at the tool layer with read‑only, recoverable‑write, and high‑risk write permissions, plus role‑based checks and optional human confirmation switches. Prompt engineering is not relied upon for safety.
Performance Impact
After refactor, peak concurrency rose from ~600 req/min to ~2 200 req/min, average end‑to‑end latency dropped from 8.6 s to 3.9 s, manual intervention fell from 23 % to 7 %, and budget‑dirty‑data rate fell from 1.9 % to 0.2 %.
Applicability
Suitable for task‑oriented, multi‑service workflows with dynamic decision points (travel, order fulfillment, complex approvals). Not recommended for strictly deterministic, low‑risk financial transactions where every step must be formally verified.
Adoption Checklist
Separate query, reservation, and submission tools.
Idempotent keys and compensation actions for every side‑effect.
Rich error semantics with retryable flags and suggestions.
Bounded step count and execution timeout.
Async handling for long‑running tasks.
Observability stack (metrics, traces, structured logs).
Permission layers and human‑in‑the‑loop safeguards.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
