Artificial Intelligence 43 min read

From Demo to Production: Building a Scalable AI Agent Web App with LangChain4j

Learn how to transform a simple LangChain4j demo into a production‑ready AI agent web application by designing a robust architecture, implementing multi‑agent orchestration, RAG, tool integration, session management, observability, security, and scalable deployment with Spring Boot, PostgreSQL, Redis, Kafka, Docker and Kubernetes.

Ray's Galactic Tech

Apr 21, 2026

From Demo to Production: Building a Scalable AI Agent Web App with LangChain4j

Why a Demo Is Easy and Production Is Hard

The article starts by showing a typical four‑step demo flow: the front‑end sends a user query, the back‑end forwards it directly to a large language model (LLM), the model decides whether to call a tool, and the response is returned. This works for proof‑of‑concept because there is no real business logic, no concurrency, and no error handling.

When the same pattern is moved to a real e‑commerce customer‑service scenario, four structural problems appear: (1) a single Agent becomes overloaded as more capabilities are added; (2) tool calls lack engineering constraints such as parameter validation, permission checks, and idempotency; (3) conversation history grows linearly, causing token cost explosion and context dilution; (4) treating the LLM call as a normal HTTP request ignores its high latency, variable cost, and need for fallback strategies.

Business Scenario and Requirements

The target use case is an intelligent e‑commerce客服系统 that must handle order queries, product inquiries, and complaint escalations. The requirements include read‑heavy workloads, occasional strong transactional operations, sub‑second response times, peak‑shaving during sales events, and strict business rule compliance.

Core Design Principles

Agent responsibility vs. business truth : the Agent only decides *what* to do; the actual state (order status, inventory) must be fetched from authoritative services.

Tool boundaries : each tool must define clear input constraints, permission scope, timeout, idempotency key, audit logging, and circuit‑breaker protection.

Separate RAG and tool execution : knowledge‑base queries use vector search, while real‑time data (price, stock) are obtained via tools.

Prompt is not a dumping ground : prompts contain only role, goal, and style; business constraints are enforced in code.

Failure paths first : every critical step has timeout, retry, circuit‑breaker, fallback, and manual‑intervention options.

Production‑Level Architecture

The architecture is layered as follows (illustrated in the original ASCII diagram):

┌──────────────────────┐
│   Web Frontend       │
│ React/Vue + SSE/WS   │
└──────────┬───────────┘
           │
┌──────────▼───────────┐
│   API Gateway        │
│ Auth / RateLimit /   │
│ Trace / Routing      │
└──────────┬───────────┘
           │
┌───────────────────────────────┐
│   Agent Application (Spring Boot)│
│ 1. Session & Memory            │
│ 2. Intent Router / Supervisor │
│ 3. Tool Invocation Layer      │
│ 4. RAG Orchestrator           │
│ 5. Stream Response Layer      │
│ 6. Audit / Metrics / Trace    │
└───────┬───────────────┬───────┘
        │               │
   ┌────▼─────┐   ┌─────▼─────┐
   │ Redis    │   │ Kafka     │
   │ Session/ │   │ Async Queue│
   │ Cache    │   │ Peak‑shave│
   └────┬─────┘   └─────┬─────┘
        │               │
   ┌────▼─────────────────▼─────┐
   │ PostgreSQL + pgvector      │
   │ Orders / Products / Knowledge│
   └─────────────────────────────┘
        │
   ┌────▼─────┐
   │ Model Service │
   │ OpenAI API / vLLM │
   └───────────────┘

This layout isolates the LLM to the "Model Service" layer, ensuring that business state never lives inside the model.

Step‑by‑Step Implementation Details

1. Project Structure

ai-agent-webapp/
├── backend/
│   └── src/main/java/com/example/agentapp
│       ├── AgentApplication.java
│       ├── api/ChatController.java
│       ├── application/AgentGatewayService.java
│       ├── domain/ (order, product, ticket, session)
│       ├── agent/ (OrderAgent, ProductAgent, EscalationAgent)
│       ├── tool/ (OrderTools, ProductTools, TicketTools)
│       ├── rag/ (KnowledgeRepository, KnowledgeChunk)
│       └── infrastructure/ (Redis, Kafka, Resilience4j clients)
├── frontend/ (React app)
├── deploy/ (Dockerfiles, docker‑compose.yml)
└── docs/

The separation guarantees that Agent‑related code does not pollute domain services, making unit testing straightforward.

2. Multi‑Agent Design

Three specialized Agents are defined:

OrderAgent : answers order status, logistics, and refund queries. Its SystemMessage forces the model to call queryOrder before responding.

ProductAgent : handles product specs, stock, price, and policy questions. It can call a product‑info tool or fall back to the knowledge base.

EscalationAgent : detects complaints or requests for human assistance and creates a ticket via a dedicated tool.

Each Agent is an interface annotated with @SystemMessage, keeping the prompt concise and the business rule in code.

3. Tool Definition and Safety

Example of an order‑query tool:

package com.example.agentapp.tool;

import dev.langchain4j.agent.tool.Tool;
import jakarta.validation.constraints.NotBlank;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;

@Slf4j
@Component
@RequiredArgsConstructor
public class OrderTools {
    private final OrderQueryService orderQueryService;

    @Tool("""
        查询订单状态、物流信息、预计送达日期和是否可退款。
        仅当用户提供订单号或上下文已确认订单号时调用。
        """)
    public OrderInfo queryOrder(@NotBlank String orderId, @NotBlank String userId) {
        log.info("tool=queryOrder orderId={} userId={}", orderId, userId);
        return orderQueryService.queryReadableOrder(orderId, userId);
    }
}

The tool description is part of the prompt, so the model knows *when* and *how* to invoke it. Parameter validation, logging, and a dedicated service layer guarantee that business rules (permission, data correctness) are enforced outside the model.

4. Domain Service Layer

Business logic such as permission checks and error handling lives in services like OrderQueryService. This ensures that the Agent never makes decisions based on unreliable model output.

package com.example.agentapp.domain.order;

import org.springframework.stereotype.Service;
import lombok.RequiredArgsConstructor;

@Service
@RequiredArgsConstructor
public class OrderQueryService {
    private final OrderCenterClient orderCenterClient;

    public OrderInfo queryReadableOrder(String orderId, String userId) {
        var detail = orderCenterClient.queryOrder(orderId);
        if (detail == null) {
            throw new BusinessException("订单不存在");
        }
        if (!userId.equals(detail.userId())) {
            throw new BusinessException("无权查询该订单");
        }
        return OrderInfo.builder()
                .orderId(detail.orderId())
                .status(detail.status())
                .logisticsNo(detail.logisticsNo())
                .logisticsCompany(detail.logisticsCompany())
                .estimatedDeliveryDate(detail.estimatedDeliveryDate())
                .refundable(detail.refundable())
                .build();
    }
}

5. Intent Routing and Supervisor

A lightweight IntentRouter performs keyword‑based routing for high‑frequency intents, while ambiguous cases fall back to the LLM. The AgentGatewayService then dispatches the request to the appropriate Agent.

public AgentResult process(String message) {
    AgentRoute route = intentRouter.route(message);
    String answer = switch (route) {
        case ORDER -> orderAgent.chat(message);
        case PRODUCT -> productAgent.chat(message);
        case ESCALATION -> escalationAgent.chat(message);
        default -> generalAgent.chat(message);
    };
    return new AgentResult(route.name(), answer);
}

This explicit routing prevents the model from becoming a monolithic "super‑prompt" and reduces latency for common queries.

6. Retrieval‑Augmented Generation (RAG)

Knowledge is stored in a knowledge_document table with a vector column for embeddings. The KnowledgeRepository runs a SELECT … ORDER BY embedding <-> ? LIMIT ? query to fetch the top‑k relevant chunks.

SELECT id, title, content, source
FROM knowledge_document
WHERE tenant_id = ? AND biz_type = ?
ORDER BY embedding <-> cast(? as vector)
LIMIT ?;

Only policy‑type information is placed in the vector store; real‑time data (stock, price) are always fetched via tools, keeping the vector index small and the retrieval latency low.

7. Session Memory Management

Three‑tier memory is used:

Short‑term memory : the last N messages are included directly in the prompt.

Summary memory : a periodic summarizer condenses older history into a short paragraph.

Business memory : key entities such as the current order ID are stored in Redis and referenced by the Agent instead of raw text.

Redis is chosen for fast read/write, TTL, and idempotency support, while PostgreSQL holds the long‑term audit trail.

8. Streaming Responses with SSE

The back‑end exposes /api/chat/stream that returns text/event-stream. The StreamingChatApplicationService forwards tokens from the StreamingChatModel to the client via SseEmitter, providing sub‑second perceived latency.

@PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter stream(@Valid @RequestBody ChatRequest request) {
    SseEmitter emitter = new SseEmitter(60_000L);
    streamingChatModel.generate(request.message(), new StreamingResponseHandler<>() {
        @Override public void onNext(String token) {
            emitter.send(SseEmitter.event().name("token").data(token));
        }
        @Override public void onComplete(Response<AiMessage> response) {
            emitter.send(SseEmitter.event().name("done").data("[DONE]"));
            emitter.complete();
        }
        @Override public void onError(Throwable error) {
            emitter.completeWithError(error);
        }
    });
    return emitter;
}

9. High‑Concurrency Safeguards

Because an Agent call may involve multiple model invocations and tool calls, the system adopts three processing modes: synchronous direct response, synchronous streaming, and asynchronous queue (Kafka). Resilience4j provides timeout, retry, circuit‑breaker, and rate‑limit policies for every external call.

10. Security and Governance

Prompt‑injection protection via input length limits and explicit tool permission checks.

Tool permission tiers (read‑only, low‑risk write, high‑risk write) with optional manual approval for the latter.

Comprehensive audit logs record user input, selected Agent, invoked tool, parameters, model output, TraceId, session ID, user ID, and tenant ID.

11. Observability Stack

Metrics (request count, latency percentiles, model/tool call counts, token usage, cache hit rate, queue depth) are exported via Micrometer to Prometheus. Distributed tracing (Spring Cloud Sleuth + OpenTelemetry) captures the full request flow from API gateway through intent router, model call, retrieval, tool call, and final response. Prompt and tool‑result versions are stored for regression analysis.

12>Deployment Strategies

For development, a Docker‑Compose file spins up PostgreSQL+pgvector, Redis, Kafka, the Spring Boot back‑end, and an Nginx‑served React front‑end. Production uses a Kubernetes Deployment with resource requests/limits, liveness/readiness probes, and an HPA that scales the back‑end based on CPU utilization. Secrets (LLM endpoint, API key) are injected via Kubernetes secrets.

13>Testing Pyramid

Domain service unit tests (order permission, product stock logic).

Tool integration tests (parameter validation, fallback behavior).

API contract tests using Testcontainers for PostgreSQL and Redis.

Prompt regression suite that verifies routing, tool invocation, and forbidden content.

14>Evolution Path

The article outlines a four‑stage roadmap: (1) minimal MVP with a single Agent and one read‑only tool; (2) stability layer (session store, streaming, Resilience4j, Redis cache); (3) scalability layer (multiple Agents, async queue, multi‑tenant support, Prompt versioning); (4) production governance (full observability, audit, gray‑release, load‑testing).

By following this systematic analysis—starting from the problem definition, evaluating trade‑offs, selecting concrete technologies, and incrementally adding safety, observability, and scalability—the reader gains a reproducible blueprint for turning an AI Agent prototype into a production‑grade web service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices AI observability RAG springboot LangChain4j

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.