29 min read

Building a Production‑Ready High‑Concurrency Story Generation System with Spring AI Alibaba

This article explains how to design and implement a scalable multi‑agent architecture for AI‑driven story creation using Spring AI Alibaba, covering core design principles, engineering optimizations, orchestration, high‑concurrency handling, observability, and deployment best practices.

Ray's Galactic Tech

Apr 3, 2026

Building a Production‑Ready High‑Concurrency Story Generation System with Spring AI Alibaba

Many teams stop at demo‑level multi‑agent implementations, but real production requires solving latency, cost, context consistency, orchestration, and stability challenges.

Business Scenario and Requirements

The goal is an online AI story‑creation platform that accepts user inputs (theme, genre, style, audience, chapters, keywords) and generates a complete story with planning, character design, scene setting, chapter writing, style unification, and quality review. A sample JSON request is provided.

Why Multi‑Agent?

Complex story creation involves multiple sub‑tasks (world building, conflict design, character arcs, scene progression, style polishing, quality checks). A single monolithic prompt leads to unstable length, loose structure, style drift, and contradictory outputs. The solution is to decompose the task into specialized agents coordinated by an orchestration layer.

Core System Design

Access Layer – handles authentication, rate limiting, and request validation.

Orchestration Layer – splits tasks, controls parallelism, timeouts, retries, and review loops.

Agent Layer – implements domain‑specific capabilities (plot, character, scene, chapter, style, review).

Infrastructure Layer – model gateway, caching, database, message queue, monitoring, and tracing.

Agent Collaboration Modes

Serial chain: plot → character → chapter (ensures context order).

Parallel: character and scene generation can run concurrently after the outline.

Review loop: a review agent validates output and triggers a rewrite agent if needed.

Domain Modeling

Shared StoryContext carries request ID, request data, outline, characters, scenes, chapters, review results, and timestamps, preventing context loss across agents.

public record StoryRequest(String requestId, String theme, String genre, String style, String targetAudience, int chapters, List<String> keywords, String language) {}

Agents use a unified StoryModelGateway to call LLMs with consistent prompt templates, parameters, timeout handling, and metric reporting.

Orchestration Implementation

The orchestrator runs the plot agent synchronously, then launches character and scene agents in parallel using a dedicated executor, followed by chapter, style, and review agents. If the review fails, the style agent rewrites and the review repeats.

public StoryResult execute(StoryRequest request) { /* orchestration logic */ }

High‑Concurrency Engineering

For low‑latency user experience, the system separates synchronous and asynchronous APIs. Synchronous endpoints handle quick tasks (outline, character), while long‑running story generation runs as an asynchronous job submitted to Kafka and processed by worker pods.

Key engineering measures include:

Explicit thread‑pool configuration (core 16, max 64, queue 200) to avoid uncontrolled thread growth.

Rate limiting and circuit breaking (Sentinel or Resilience4j) per tenant, per API, and per model call.

Redis caching for hot outlines, character templates, prompt results, and idempotent request handling.

Comprehensive metrics (CPU, memory, thread pool, Kafka lag, per‑agent latency, token usage, success rates) collected via Micrometer, Prometheus, Grafana, and Zipkin/Tempo.

Structured audit logs containing requestId, taskId, tenantId, agentName, modelName, latency, token usage, and result status.

Deployment and Operations

Configuration uses Spring Boot 3.x, Spring Cloud Alibaba for service governance, Docker/Kubernetes for containerization, and HPA based on CPU utilization. Health checks, readiness probes, and resource limits ensure reliable scaling.

Quality Assurance

Agents output structured JSON to simplify parsing and validation. A review agent enforces rules (character completeness, chapter progression, consistency, no duplication, no sensitive content). A hybrid approach combines rule‑engine checks with LLM‑based semantic evaluation.

Evolution Roadmap

Four stages guide the transition from a single‑node demo to a full platform: initial verification, service‑oriented refactor, high‑concurrency upgrade with async processing, and platformization with centralized prompt management, model routing, multi‑tenant isolation, and cost governance.

Conclusion

Building a production‑grade multi‑agent story generation system requires a unified context object, a robust orchestration layer, async processing with caching and rate limiting, and full observability. Treat the system as a distributed application rather than a collection of prompts to achieve scalability, reliability, and business readiness.

observability Kubernetes Spring AI Orchestration Multi-agent architecture

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.