Mastering RAG with Spring AI: Build a Retrieval‑Augmented Generation System from Scratch
This article explains the background, principles, and step‑by‑step implementation of Retrieval‑Augmented Generation (RAG) using Spring AI, covering embedding models, vector databases, chunking strategies, indexing algorithms, similarity metrics, re‑ranking, prompt templates, and a complete Java code example.
Retrieval‑Augmented Generation (RAG) Overview
RAG enhances the context of a large language model (LLM) by retrieving relevant knowledge from an external store before generation. The typical flow is:
User submits a query.
A retrieval component searches a vector‑based knowledge base.
The top‑k matching text chunks are returned.
The query and retrieved chunks are combined into a prompt.
The LLM generates the final answer.
Embedding Models
An embedding model maps a piece of text to a fixed‑length floating‑point vector (e.g., OpenAI text‑embedding‑3‑small → 1536 dimensions). Vectors that are close in the high‑dimensional space correspond to semantically similar texts, enabling similarity search.
Vector Databases
Because traditional relational databases excel at exact matches, RAG stores embeddings in a vector database (e.g., Pinecone, Chroma, PostgreSQL + PGVector). Each record keeps the original text, optional metadata, and the embedding vector, allowing fast approximate nearest‑neighbor (ANN) queries.
Chunking Strategies
Documents are split into manageable pieces before embedding. Common strategies:
Fixed‑size split : divide by a fixed number of characters or tokens. Simple but may cut sentences.
Structural split : use markup (HTML, Markdown) to keep logical sections.
Semantic split : detect semantic boundaries with NLP or clustering for highest coherence (computationally heavy).
Recursive split : hierarchical splitting (paragraph → sentence) until length limits are satisfied; balances length control and semantics.
In practice a hybrid approach is often chosen based on document type.
Indexing Algorithms (ANN)
To accelerate similarity search, vectors are indexed with Approximate Nearest Neighbor (ANN) structures. Popular methods:
LSH (Locality Sensitive Hashing) – hashes similar vectors into the same bucket.
Annoy – builds random‑projection trees; queries traverse the trees.
HNSW (Hierarchical Navigable Small World) – constructs a multi‑layer graph for fast navigation.
IVF (Inverted File Index) – clusters vectors with k‑means and searches only the nearest clusters.
ANN trades a small loss in recall for large gains in latency; the choice depends on the required speed‑accuracy balance.
Similarity Metrics
RAG typically uses one of the following distance measures:
Euclidean distance – geometric distance; sensitive to vector length.
Dot product – works well when embedding magnitude encodes importance; fast to compute.
Cosine similarity – measures the angle between vectors; robust to length differences and widely used in NLP.
Formula images are retained for reference:
Re‑ranking
ANN returns a set of top‑k candidates based on vector similarity, but similarity does not guarantee relevance to the query. A second‑stage re‑ranking model (often a cross‑encoder) evaluates actual relevance and selects the final top‑n documents for the LLM.
Prompt Template
The retrieved context and the user query are merged into a formatted prompt. A common template is:
You are a helpful assistant. Based on the following context, answer the question.
Context:
{retrieved_documents}
Question:
{user_query}
Answer:This forces the LLM to focus on the supplied knowledge and reduces hallucinations.
Full RAG Process
Offline phase – ingest documents, chunk them, embed each chunk, store embeddings in the vector database, and build an ANN index.
Online phase – embed the user query, perform similarity search, re‑rank the candidates, construct the prompt, and invoke the LLM.
Practical Implementation with Spring AI
The example uses the following stack:
JDK 17
Spring Boot 3.5.0
Spring AI 1.0.0
Maven
LLM: Qwen2.5‑72B‑Instruct
Embedding model: text‑embedding‑ada‑002 Vector store: PostgreSQL + PGVector
Dependencies
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>Embedding Configuration
spring:
ai:
openai:
base-url: YOUR_URL
api-key: YOUR_KEY
embedding:
options:
model: text-embedding-ada-002Inject EmbeddingModel as a Spring bean and call embed() to obtain a 1536‑dimensional vector.
Vector Store Configuration
spring:
ai:
vectorstore:
pgvector:
initialize-schema: true
index-type: HNSW
distance-type: COSINE_DISTANCE
dimensions: 1536
max-document-batch-size: 10000
datasource:
url: jdbc:postgresql://localhost/postgres
username: YOUR_USER
password: YOUR_PASSWORDWhen initialize-schema is true, Spring AI creates the table and index automatically:
CREATE TABLE IF NOT EXISTS vector_store (
id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
content text,
metadata json,
embedding vector(1536)
);
CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);ETL Pipeline Components
DocumentReader – extracts raw documents (e.g., TextReader for plain‑text files).
DocumentTransformer – processes documents; TokenTextSplitter performs chunking.
DocumentWriter – loads processed documents into a store (e.g., PgVectorStore).
Example endpoint that reads a file, splits it, and stores the chunks:
@Autowired
private VectorStore vectorStore;
@Value("classpath:/file.txt")
private Resource resource;
@GetMapping("/etl")
public void etl() {
TextReader reader = new TextReader(resource);
List<Document> extracted = reader.read();
TokenTextSplitter splitter = new TokenTextSplitter(200, 200, 5, 10000, true);
List<Document> transformed = splitter.apply(extracted);
vectorStore.add(transformed);
}Key TokenTextSplitter parameters: chunkSize: target token count per chunk (default 800). minChunkSizeChars: minimum characters to avoid overly short chunks (default 350). minChunkLengthToEmbed: ignore chunks shorter than this length for embedding (default 5). maxNumChunks: maximum chunks per document (default 1000). keepSeparator: retain original separators (default true).
RAG Interaction
@Autowired
private ChatClient chatClient;
@Autowired
private VectorStore vectorStore;
@GetMapping("/rag")
public void chatWithRag(String input) {
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.similarityThreshold(0.5)
.vectorStore(vectorStore)
.build())
.queryAugmenter(ContextualQueryAugmenter.builder()
.allowEmptyContext(true)
.build())
.build();
String result = chatClient.prompt()
.advisors(ragAdvisor)
.user(input)
.call()
.content();
System.out.println("result: " + result);
}The LLM is configured to use Qwen2.5‑72B‑Instruct. Setting the logger to DEBUG shows the full retrieval‑generation pipeline.
Outlook
RAG combines external knowledge with the generative power of LLMs, making it a cornerstone for enterprise AI applications. As vector‑store implementations and ANN algorithms mature, RAG is expected to become more reliable, scalable, and widely adopted.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
