Artificial Intelligence 32 min read

Mastering RAG with Spring AI: Build a Retrieval‑Augmented Generation System from Scratch

This article explains the background, principles, and step‑by‑step implementation of Retrieval‑Augmented Generation (RAG) using Spring AI, covering embedding models, vector databases, chunking strategies, indexing algorithms, similarity metrics, re‑ranking, prompt templates, and a complete Java code example.

Architect

Dec 14, 2025

Mastering RAG with Spring AI: Build a Retrieval‑Augmented Generation System from Scratch

Retrieval‑Augmented Generation (RAG) Overview

RAG enhances the context of a large language model (LLM) by retrieving relevant knowledge from an external store before generation. The typical flow is:

User submits a query.

A retrieval component searches a vector‑based knowledge base.

The top‑k matching text chunks are returned.

The query and retrieved chunks are combined into a prompt.

The LLM generates the final answer.

Embedding Models

An embedding model maps a piece of text to a fixed‑length floating‑point vector (e.g., OpenAI text‑embedding‑3‑small → 1536 dimensions). Vectors that are close in the high‑dimensional space correspond to semantically similar texts, enabling similarity search.

Vector Databases

Because traditional relational databases excel at exact matches, RAG stores embeddings in a vector database (e.g., Pinecone, Chroma, PostgreSQL + PGVector). Each record keeps the original text, optional metadata, and the embedding vector, allowing fast approximate nearest‑neighbor (ANN) queries.

Chunking Strategies

Documents are split into manageable pieces before embedding. Common strategies:

Fixed‑size split : divide by a fixed number of characters or tokens. Simple but may cut sentences.

Structural split : use markup (HTML, Markdown) to keep logical sections.

Semantic split : detect semantic boundaries with NLP or clustering for highest coherence (computationally heavy).

Recursive split : hierarchical splitting (paragraph → sentence) until length limits are satisfied; balances length control and semantics.

In practice a hybrid approach is often chosen based on document type.

Indexing Algorithms (ANN)

To accelerate similarity search, vectors are indexed with Approximate Nearest Neighbor (ANN) structures. Popular methods:

LSH (Locality Sensitive Hashing) – hashes similar vectors into the same bucket.

Annoy – builds random‑projection trees; queries traverse the trees.

HNSW (Hierarchical Navigable Small World) – constructs a multi‑layer graph for fast navigation.

IVF (Inverted File Index) – clusters vectors with k‑means and searches only the nearest clusters.

ANN trades a small loss in recall for large gains in latency; the choice depends on the required speed‑accuracy balance.

Similarity Metrics

RAG typically uses one of the following distance measures:

Euclidean distance – geometric distance; sensitive to vector length.

Dot product – works well when embedding magnitude encodes importance; fast to compute.

Cosine similarity – measures the angle between vectors; robust to length differences and widely used in NLP.

Formula images are retained for reference:

Re‑ranking

ANN returns a set of top‑k candidates based on vector similarity, but similarity does not guarantee relevance to the query. A second‑stage re‑ranking model (often a cross‑encoder) evaluates actual relevance and selects the final top‑n documents for the LLM.

Prompt Template

The retrieved context and the user query are merged into a formatted prompt. A common template is:

You are a helpful assistant. Based on the following context, answer the question.
Context:
{retrieved_documents}
Question:
{user_query}
Answer:

This forces the LLM to focus on the supplied knowledge and reduces hallucinations.

Full RAG Process

Offline phase – ingest documents, chunk them, embed each chunk, store embeddings in the vector database, and build an ANN index.

Online phase – embed the user query, perform similarity search, re‑rank the candidates, construct the prompt, and invoke the LLM.

Practical Implementation with Spring AI

The example uses the following stack:

JDK 17

Spring Boot 3.5.0

Spring AI 1.0.0

Maven

LLM: Qwen2.5‑72B‑Instruct

Embedding model: text‑embedding‑ada‑002 Vector store: PostgreSQL + PGVector

Dependencies

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>

Embedding Configuration

spring:
  ai:
    openai:
      base-url: YOUR_URL
      api-key: YOUR_KEY
    embedding:
      options:
        model: text-embedding-ada-002

Inject EmbeddingModel as a Spring bean and call embed() to obtain a 1536‑dimensional vector.

Vector Store Configuration

spring:
  ai:
    vectorstore:
      pgvector:
        initialize-schema: true
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536
        max-document-batch-size: 10000
    datasource:
      url: jdbc:postgresql://localhost/postgres
      username: YOUR_USER
      password: YOUR_PASSWORD

When initialize-schema is true, Spring AI creates the table and index automatically:

CREATE TABLE IF NOT EXISTS vector_store (
  id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
  content text,
  metadata json,
  embedding vector(1536)
);
CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);

ETL Pipeline Components

DocumentReader – extracts raw documents (e.g., TextReader for plain‑text files).

DocumentTransformer – processes documents; TokenTextSplitter performs chunking.

DocumentWriter – loads processed documents into a store (e.g., PgVectorStore).

Example endpoint that reads a file, splits it, and stores the chunks:

@Autowired
private VectorStore vectorStore;

@Value("classpath:/file.txt")
private Resource resource;

@GetMapping("/etl")
public void etl() {
    TextReader reader = new TextReader(resource);
    List<Document> extracted = reader.read();
    TokenTextSplitter splitter = new TokenTextSplitter(200, 200, 5, 10000, true);
    List<Document> transformed = splitter.apply(extracted);
    vectorStore.add(transformed);
}

Key TokenTextSplitter parameters: chunkSize: target token count per chunk (default 800). minChunkSizeChars: minimum characters to avoid overly short chunks (default 350). minChunkLengthToEmbed: ignore chunks shorter than this length for embedding (default 5). maxNumChunks: maximum chunks per document (default 1000). keepSeparator: retain original separators (default true).

RAG Interaction

@Autowired
private ChatClient chatClient;
@Autowired
private VectorStore vectorStore;

@GetMapping("/rag")
public void chatWithRag(String input) {
    Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
            .similarityThreshold(0.5)
            .vectorStore(vectorStore)
            .build())
        .queryAugmenter(ContextualQueryAugmenter.builder()
            .allowEmptyContext(true)
            .build())
        .build();
    String result = chatClient.prompt()
        .advisors(ragAdvisor)
        .user(input)
        .call()
        .content();
    System.out.println("result: " + result);
}

The LLM is configured to use Qwen2.5‑72B‑Instruct. Setting the logger to DEBUG shows the full retrieval‑generation pipeline.

Outlook

RAG combines external knowledge with the generative power of LLMs, making it a cornerstone for enterprise AI applications. As vector‑store implementations and ANN algorithms mature, RAG is expected to become more reliable, scalable, and widely adopted.

Java RAG vector database Embedding Spring AI

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Retrieval‑Augmented Generation (RAG) Overview

Embedding Models

Vector Databases

Chunking Strategies

Indexing Algorithms (ANN)

Similarity Metrics

Re‑ranking

Prompt Template

Full RAG Process

Practical Implementation with Spring AI

Dependencies

Embedding Configuration

Vector Store Configuration

ETL Pipeline Components

RAG Interaction

Outlook

Architect

How this landed with the community

Was this worth your time?

0 Comments

Practical Implementation with Spring AI