27 min read

Spring AI Agent Demo: Architecture, RAG, Tools & Sub‑Agents Explained

An in‑depth walkthrough of a Spring AI‑based AI Agent demo showcases its core modules—including AgentCore orchestration, multi‑layer conversation memory compression, function‑calling tool registration, RAG retrieval pipelines, markdown‑driven Commands and Skills, Sub‑Agent isolation, and MCP integration—complete with code snippets, design rationale, and runtime configuration details.

Alibaba Cloud Developer

Apr 22, 2026

Spring AI Agent Demo: Architecture, RAG, Tools & Sub‑Agents Explained

Overview

This article provides a complete technical walkthrough of an AI Agent demo built with Spring AI. The demo showcases a full‑stack agent that integrates intent recognition, layered conversation memory compression, large‑model calls, function calling, Retrieval‑Augmented Generation (RAG), Sub‑Agents, and Model Context Protocol (MCP) support.

Quick Start

Source code: https://github.com/q644266189/aiagentdemo. The project requires Java 21+ and Maven 3.9+. Configuration lives in src/main/resources/application.properties where the base URL, API key, and model names for the chat and embedding models are defined. After cloning the repository, run the Spring Boot application and open http://localhost:8080 to use the built‑in web chat UI.

Core Modules

AgentCore – central orchestrator that performs intent recognition, optional RAG injection, memory management, model invocation, and tool execution.

ChatMemory – per‑session conversation store with three‑layer context compression (summary, assistant trimming, sliding window).

Tool (Function Calling) – pluggable tool registration via the InnerTool interface and ToolCallback objects.

RAG – end‑to‑end retrieval‑augmented generation pipeline: document loading → chunking → embedding → vector store → multi‑recall (semantic + BM25 + query rewrite) → Reciprocal Rank Fusion (RRF) → rerank → LLM generation.

Command & Skill – two markdown‑driven prompt mechanisms. Commands are user‑initiated shortcuts; Skills are LLM‑decided tools.

SubAgent – independent agents with their own ChatMemory, enabling isolated multi‑turn tasks.

MCP – Model Context Protocol server and client for standardized external tool access.

AgentCore Orchestration

The main workflow can be expressed as:

User Input → IntentRecognizer → (if RAG) RagService → ChatMemory (compression) → Prompt Construction → ChatClient Call → (if tool call) ToolExecution → ReAct Loop → Final Reply

The core method implementing this flow is AgentCore.chat(String sessionId, String userInput):

public String chat(String sessionId, String userInput) {
    ChatMemory memory = getOrCreateMemory(sessionId);
    // 1. Intent recognition
    Intent intent = intentRecognizer.recognize(userInput);
    // 2. RAG injection if needed
    if (intent == Intent.RAG && ragService.isKnowledgeLoaded()) {
        String ragContext = ragService.query(userInput);
        if (ragContext != null && !ragContext.isBlank()) {
            String enrichedInput = "以下是从知识库中检索到的相关参考资料，
请结合这些资料回答用户的问题：

" + ragContext + "

用户问题：" + userInput;
            memory.addMessage(new UserMessage(enrichedInput));
        } else {
            memory.addMessage(new UserMessage(userInput));
        }
    } else {
        memory.addMessage(new UserMessage(userInput));
    }
    // 3. Build prompt and invoke model (tool callbacks attached if any)
    List<Message> messages = memory.getMessages();
    Prompt prompt = new Prompt(messages, buildChatOptions());
    ChatClient.ChatClientRequestSpec requestSpec = chatClient.prompt(prompt);
    if (!toolCallbacks.isEmpty()) {
        requestSpec.toolCallbacks(toolCallbacks.toArray(new ToolCallback[0]));
    }
    String response = requestSpec.call().content();
    memory.addMessage(new AssistantMessage(response != null ? response : ""));
    return response != null ? response : "";
}

Conversation Memory Compression

Each session is stored in a ConcurrentHashMap<String, ChatMemory>, allowing concurrent users. Compression works in three layers:

Layer 1 – Summary Compression : When the history exceeds 16 messages, the oldest messages are summarized by the LLM into a ≤300‑character summary and injected into the system prompt.

Layer 2 – Assistant Trimming : Only the most recent three assistant replies are kept to reduce token consumption.

Layer 3 – Sliding Window : If total messages exceed maxRounds × 4, the earliest messages are dropped as a hard safeguard.

Key methods:

public List<Message> getMessages() {
    compressIfNeeded();
    List<Message> messages = new ArrayList<>();
    if (systemMessage != null || (summaryText != null && !summaryText.isBlank())) {
        String systemContent = systemMessage != null ? systemMessage.getText() : "";
        if (summaryText != null && !summaryText.isBlank()) {
            systemContent += "

【以下是之前对话的摘要，请参考】
" + summaryText;
        }
        messages.add(new SystemMessage(systemContent));
    }
    // ... add remaining history (assistant trimming applied) ...
    return Collections.unmodifiableList(messages);
}

private void compressIfNeeded() {
    if (chatClient == null || history.size() <= COMPRESS_THRESHOLD_MESSAGES) {
        return;
    }
    int compressEndIndex = history.size() - PRESERVE_RECENT_MESSAGES;
    while (compressEndIndex < history.size() && history.get(compressEndIndex).getMessageType() == MessageType.TOOL) {
        compressEndIndex--;
    }
    if (compressEndIndex <= 0) return;
    List<Message> messagesToCompress = new ArrayList<>(history.subList(0, compressEndIndex));
    String newSummary = SummaryCompressor.compress(chatClient, messagesToCompress, summaryText);
    if (newSummary != null && !newSummary.isBlank()) {
        this.summaryText = newSummary;
        history.subList(0, compressEndIndex).clear();
    }
}

Tool (Function Calling) Mechanism

All tools implement InnerTool:

public interface InnerTool {
    List<ToolCallback> loadToolCallbacks();
}

During Spring Boot startup, the framework scans beans of type InnerTool, collects their callbacks via loadToolCallbacks(), and registers them with AgentCore. Adding a new tool only requires a new InnerTool implementation; no other code changes are needed.

Example registration:

public class WeatherTool implements InnerTool {
    @Override
    public List<ToolCallback> loadToolCallbacks() {
        return List.of(ToolCallback.builder()
            .name("get_weather")
            .description("查询指定城市的天气")
            .parameters(JsonSchema.builder()
                .addProperty("city", "string", "城市名称")
                .required("city")
                .build())
            .function(args -> {
                String city = args.get("city");
                // Simulated weather service
                return city + "，晴，22°C";
            })
            .build());
    }
}

Execution flow (weather query example):

用户："杭州今天天气怎么样？"
→ LLM decides to call <code>get_weather</code> with {"city":"杭州"}
→ Spring AI executes <code>get_weather</code> and returns "杭州，晴，22°C"
→ LLM generates final answer: "杭州今天天气晴朗，气温 22°C，适合出行。"

The ReAct loop repeatedly invokes the model until no further tool calls are detected, enabling multi‑tool chains without developer intervention.

RAG Retrieval Pipeline

Default TextSplitter performs recursive semantic splitting (title → paragraph → sentence → fixed length) with a chunk size of 500 characters and 50‑character overlap. Other splitters are available: FixedSizeSplitter – fixed character count. ParagraphSplitter – split by blank lines. SentenceSplitter – split by punctuation. SlidingWindowSplitter – overlapping windows for continuity. SemanticChunkSplitter – LLM‑driven semantic boundaries. PropositionSplitter – split into independent propositions. AgenticSplitter – LLM decides the best splitting strategy.

The retrieval process uses a multi‑recall strategy followed by RRF fusion and a rerank model:

public String query(String question) {
    // 1. Multi‑recall (semantic + BM25 + query rewrite, 9 candidates total)
    List<Document> candidates = multiRecaller.recall(question, RECALL_CANDIDATE_COUNT);
    // 2. Rerank to top 3
    List<Document> relevantDocuments = llmReranker.rerank(question, candidates, TOP_K);
    // 3. Build context
    StringBuilder contextBuilder = new StringBuilder();
    for (int i = 0; i < relevantDocuments.size(); i++) {
        contextBuilder.append("【参考资料 ").append(i + 1).append("】
");
        contextBuilder.append(relevantDocuments.get(i).getContent()).append("

");
    }
    return contextBuilder.toString().trim();
}

Multi‑recaller with RRF:

public List<Document> retrieve(String query, int topK) {
    Map<String, Double> rrfScores = new HashMap<>();
    Map<String, Document> keyToDocument = new LinkedHashMap<>();
    for (Recaller retriever : retrievers) {
        List<Document> results = retriever.retrieve(query, PER_ROUTE_CANDIDATE_COUNT);
        // RRF formula: score(d) = Σ 1 / (k + rank), k=60
        accumulateRrfScores(results, rrfScores, keyToDocument);
    }
    return rrfScores.entrySet().stream()
        .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
        .limit(topK)
        .map(entry -> keyToDocument.get(entry.getKey()))
        .toList();
}

The rerank step uses a dedicated LLM to reorder the fused candidates and keeps the top three for injection into the prompt.

Commands vs. Skills

Design philosophy : Commands are user‑initiated shortcuts; Skills are LLM‑decided capabilities.

File format : Commands are plain markdown files where the filename is the command name. Skills are markdown files with a YAML front‑matter containing name and description, followed by the prompt template.

Tool registration : Commands are **not** registered as tools; Skills are registered as ToolCallback objects.

Invocation trigger : Commands are called explicitly via a REST endpoint (e.g., POST /api/command/execute). Skills are invoked automatically when the LLM decides the description matches the current context.

Execution path : Command → Controller → AgentCore. Skill → AgentCore → LLM decision → SkillTool.

Suitable scenario : Use Commands when the user knows the exact function needed; use Skills when the LLM should infer the appropriate capability.

In short, Commands provide deterministic entry points, while Skills enable intelligent, context‑aware extensions.

Sub‑Agent Design

Sub‑Agents address tasks that require isolated context (e.g., drafting a technical article). Each Sub‑Agent owns its own ChatMemory instance but shares the same ChatClient (model connection) with the main agent.

public SubAgent(String id, String name, String systemPrompt, ChatClient chatClient) {
    this.memory = ChatMemory.forSubAgent(); // independent memory!
    this.memory.setSystemPrompt(systemPrompt);
    // other initialization …
}

Three tool definitions expose Sub‑Agent lifecycle to the LLM: create_sub_agent – parameters: name, system_prompt, task. Creates a Sub‑Agent and runs the first task. chat_with_sub_agent – parameters: agent_id, message. Sends a message to the specified Sub‑Agent. destroy_sub_agent – parameter: agent_id. Releases the Sub‑Agent’s resources.

The main LLM decides when to invoke these tools, allowing automatic creation, usage, and destruction of isolated agents.

MCP Integration

MCP Server

The demo implements an MCP server that registers a knowledge_query tool. The tool accepts keyword, category, and maxResults and returns RAG search results, making the demo’s retrieval capability reusable by any MCP‑compatible AI.

MCP Client

The client discovers remote tools, converts them into ToolCallback instances, and registers them with the local agent. Connection logic prefers Streamable HTTP (per the 2025 spec) and falls back to Server‑Sent Events (SSE) if needed.

public ToolCallback[] connect(String serverUrl) {
    McpSyncClient mcpClient;
    McpSchema.InitializeResult initResult;
    try {
        mcpClient = connectWithStreamableHttp(serverUrl);
        initResult = mcpClient.initialize();
    } catch (Exception e) {
        mcpClient = connectWithSse(serverUrl);
        initResult = mcpClient.initialize();
    }
    SyncMcpToolCallbackProvider provider = SyncMcpToolCallbackProvider.builder()
        .mcpClients(mcpClient)
        .build();
    ToolCallback[] toolCallbacks = provider.getToolCallbacks();
    store.add(serverUrl); // persist for later restart
    return toolCallbacks;
}

Runtime management APIs allow adding, removing, and listing MCP connections without restarting the application.

Runtime Configuration & Parameters

Key properties (example values):

spring.ai.openai.base-url=https://open.bigmodel.cn/api/paas/v4
spring.ai.openai.api-key=YOUR_API_KEY
spring.ai.openai.chat.options.model=glm-4
spring.ai.openai.embedding.options.model=embedding-3

These values are placed in src/main/resources/application.properties. The demo also supports dynamic model switching and runtime adjustment of generation parameters such as temperature, maxTokens, and topP via the same properties file or environment variables.

API Usage

Non‑streaming request:

curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"你好，介绍一下你的能力","sessionId":"test-001"}'

Streaming (Server‑Sent Events) request:

curl -X POST http://localhost:8080/api/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message":"请帮我写一篇关于 Spring AI 的技术博客","sessionId":"blog-001"}'

Key Design Trade‑offs

Memory compression strategy : Summary compression preserves long‑term context with minimal token cost; assistant trimming reduces token waste from verbose LLM replies; sliding window guarantees a hard token ceiling.

Multi‑recall + RRF : Combining semantic, BM25, and query‑rewrite recall mitigates blind spots of any single method. RRF fuses rankings without needing absolute scores, making it robust across heterogeneous retrievers.

Tool registration : Using InnerTool decouples tool implementation from the core agent, enabling zero‑touch addition of new capabilities.

Command vs. Skill : Commands give users explicit control; Skills let the LLM decide, reducing user friction for complex workflows.

Sub‑Agent isolation : Separate ChatMemory per Sub‑Agent prevents cross‑task contamination while sharing the same LLM client for efficiency.

MCP flexibility : Automatic protocol fallback (HTTP → SSE) and persistent server list ( mcp-servers.json) ensure reliable integration with external AI services.

Conclusion

The demo demonstrates how a Spring AI‑based agent can be assembled from modular components: intent detection, three‑layer memory compression, pluggable function‑calling tools, a robust RAG pipeline with multi‑recall, RRF fusion, and reranking, markdown‑driven Commands and Skills, isolated Sub‑Agents, and standardized MCP integration. By exposing concrete code, design rationales, and performance‑oriented choices, the article provides a reproducible blueprint for building production‑ready AI agents.

AI MCP RAG Spring Agent FunctionCalling subagent

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.