Mastering Retrieval‑Augmented Generation with Spring AI: A Complete Guide

This article explains the Retrieval‑Augmented Generation (RAG) paradigm, walks through its four core steps, and provides a detailed Spring AI implementation—including configuration, vector storage, REST controller, multi‑query expansion, query rewriting, document joining, and error handling—plus best‑practice recommendations for production deployments.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Mastering Retrieval‑Augmented Generation with Spring AI: A Complete Guide

What is Retrieval‑Augmented Generation (RAG)?

RAG combines a traditional information‑retrieval step with large‑language‑model (LLM) generation. The system first searches a knowledge base for relevant fragments, then feeds those fragments to the LLM so the answer is grounded in up‑to‑date documents rather than the model’s static parameters.

Core RAG workflow

1. Document splitting → knowledge base construction

Break large source files (PDFs, HTML, plain text, etc.) into smaller, semantically coherent chunks.

Assign metadata tags (e.g., type=specification, section=FAQ) to each chunk to support later filtering.

2. Vector encoding → semantic map

Pass each chunk through an embedding model (e.g., SentenceTransformer, OpenAIEmbeddingModel) to obtain a dense vector.

Store the vectors in a vector store (in‑memory, Redis, MongoDB, etc.) and build a similarity index (FAISS, HNSW, etc.).

3. Similarity search → intelligent retriever

Encode the user query into a query vector.

Search the vector store using cosine similarity (or other distance) and optional metadata filters (recency, type, etc.).

Return the top‑N most relevant chunks together with their source citations.

4. Generation augmentation → answer synthesis

Provide the retrieved chunks to the LLM as part of the prompt (e.g., using a system or context section).

The model generates a response that cites the retrieved material.

The final output can include natural‑language text and explicit citation paths.

Spring AI reference implementation

Configuration class

@Configuration
public class RagConfig {
    @Bean
    ChatClient chatClient(ChatClient.Builder builder) {
        return builder.defaultSystem("你将作为一名机器人产品的专家,对于用户的使用需求作出解答")
                      .build();
    }

    @Bean
    VectorStore vectorStore(EmbeddingModel embeddingModel) {
        SimpleVectorStore store = SimpleVectorStore.builder(embeddingModel).build();
        List<Document> documents = List.of(
            new Document("产品说明书:产品名称:智能机器人
"
                + "产品描述:智能机器人是一个智能设备,能够自动完成各种任务。
"
                + "功能:
"
                + "1. 自动导航:机器人能够自动导航到指定位置。
"
                + "2. 自动抓取:机器人能够自动抓取物品。
"
                + "3. 自动放置:机器人能够自动放置物品。
"));
        store.add(documents);
        return store;
    }
}

REST controller (retrieval‑augmentation service)

@RestController
@RequestMapping("/ai")
public class RagController {
    @Autowired private ChatClient chatClient;
    @Autowired private VectorStore vectorStore;

    @PostMapping(value = "/chat", produces = "text/plain; charset=UTF-8")
    public String generation(String userInput) {
        return chatClient.prompt()
                .user(userInput)
                .advisors(new QuestionAnswerAdvisor(vectorStore))
                .call()
                .content();
    }
}

Run the Spring Boot application and call

POST http://localhost:8080/ai/chat?userInput=机器人有哪些功能?

. The response lists the three capabilities with citations from the product manual.

Advanced query‑enhancement features in Spring AI

Multi‑query expansion

// Build an expander that generates three alternative queries
MultiQueryExpander expander = MultiQueryExpander.builder()
        .chatClientBuilder(builder)
        .includeOriginal(false)
        .numberOfQueries(3)
        .build();
List<Query> queries = expander.expand(new Query("请提供几种推荐的装修风格?"));

Query rewrite

QueryTransformer rewrite = RewriteQueryTransformer.builder()
        .chatClientBuilder(builder)
        .build();
Query transformed = rewrite.transform(new Query("我正在学习人工智能,什么是大语言模型?"));
System.out.println(transformed.text()); // 输出: 什么是大语言模型?

Query translation (cross‑language retrieval)

QueryTransformer translator = TranslationQueryTransformer.builder()
        .chatClientBuilder(builder)
        .targetLanguage("chinese")
        .build();
Query zh = translator.transform(new Query("What is LLM?"));
System.out.println(zh.text()); // 输出: 什么是大语言模型?

Context‑aware queries

Query query = Query.builder()
        .text("那这个小区的二手房均价是多少?")
        .history(new UserMessage("深圳市南山区的碧海湾小区在哪里?"),
                 new AssistantMessage("碧海湾小区位于深圳市南山区后海中心区,临近后海地铁站。"))
        .build();
QueryTransformer ct = CompressionQueryTransformer.builder()
        .chatClientBuilder(builder)
        .build();
Query resolved = ct.transform(query);
System.out.println(resolved.text()); // 深圳市南山区碧海湾小区的二手房均价是多少?

Document joiner (de‑duplication & score preservation)

DocumentJoiner joiner = new ConcatenationDocumentJoiner();
List<Document> merged = joiner.join(documentsForQuery);

RetrievalAugmentationAdvisor usage

Basic usage

SimpleVectorStore store = SimpleVectorStore.builder(embeddingModel).build();
store.add(documents);
Advisor advisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
                .vectorStore(store)
                .similarityThreshold(0.5)
                .topK(3)
                .build())
        .build();
String answer = chatClient.prompt()
        .user("机器人有哪些功能?")
        .advisors(advisor)
        .call()
        .content();

Advanced configuration

Advisor advisor = RetrievalAugmentationAdvisor.builder()
        .queryAugmenter(ContextualQueryAugmenter.builder()
                .allowEmptyContext(true)
                .maxTokens(300)
                .temperature(0.7)
                .build())
        .documentRetriever(VectorStoreDocumentRetriever.builder()
                .vectorStore(store)
                .similarityThreshold(0.5)
                .topK(3)
                .minScore(0.1)
                .maxDistance(0.8)
                .build())
        .build();

Error handling & edge cases

If no relevant documents are found, the advisor can return a friendly Chinese prompt asking the user for more details while avoiding null‑pointer exceptions by allowing empty context.

return chatClient.prompt()
        .user(query)
        .advisors(advisor)
        .call()
        .getContent();

Best‑practice checklist for production RAG systems

Document design : store each fragment with clear identifiers and rich metadata (type, year, tags).

Chunking strategy : use semantic splitters that keep context, avoid overly short or long chunks, and tag each chunk.

Vector store selection : choose in‑memory for prototypes, Redis/MongoDB/FAISS for larger corpora; configure persistence if needed.

Retriever tuning : set similarity thresholds, topK, and metadata filters to balance relevance and latency.

Query enhancement : enable multi‑query expansion, rewrite, and translation to improve recall and support multilingual use cases.

Advisor integration : use RetrievalAugmentationAdvisor to inject retrieved context automatically into LLM prompts.

Error handling : allow empty context, provide clear fallback messages, and log missing‑document events for continuous improvement.

Performance optimisation : limit document loading size, monitor memory usage, and enable caching of vector indexes.

JavaAIRAGSpring AIRetrieval-Augmented GenerationVector Store
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.