From Demo to Production: Building an Enterprise‑Grade RAG System with Spring AI & PGVector
This comprehensive guide explains how to design, implement, and operate a production‑ready Retrieval‑Augmented Generation (RAG) platform using Spring AI and PostgreSQL PGVector, covering architecture, indexing, hybrid retrieval, prompt engineering, scaling, security, observability, deployment, and common pitfalls for enterprise knowledge‑base applications.
Why Enterprises Need a Full RAG System
Connecting a large language model (LLM) to a chat UI is only the first step. In real business scenarios three problems appear:
Private knowledge (policies, SOPs, product docs, logs) is not stored in the model.
LLMs hallucinate, especially in high‑risk domains.
Enterprises require traceability, governance, observability and strict performance guarantees.
Enterprise RAG Goals
Accuracy : minimise hallucinations and make every answer traceable.
Latency : keep retrieval and generation delay predictable.
Throughput : support high‑concurrency queries and bulk ingestion.
Cost : control embedding, inference, storage and cache expenses.
Scalability : grow with document volume, user count and tenant count.
Governance : permissions, audit, gray‑release, evaluation and replay.
Operability : metrics, logs, tracing and alerts.
Overall Architecture
The system is built as a layered, closed‑loop pipeline (bottom‑up):
Offline/async indexing pipeline
Application layer (Spring Boot + Spring AI)
Authentication & authorization
Rate limiting & circuit breaking
RAG orchestrator
Conversation memory
Cache (local Caffeine + Redis)
Observability (Micrometer, Prometheus)
Query rewrite service
Hybrid retrieval (vector + BM25)
Reranker (cross‑encoder)
Prompt builder
LLM generation
PostgreSQL + PGVector (vector store)
Full‑text search (BM25)
Object storage (MinIO / OSS)
Document parsing, cleaning, chunking, embedding, metadata storage
Why Spring AI + PGVector
Spring AI integrates AI capabilities into the Spring ecosystem, offering unified access to ChatModel, EmbeddingModel and VectorStore, native Spring Boot configuration, lifecycle management, monitoring and transaction support. This makes the stack easy for Java teams.
PGVector extends PostgreSQL with a vector column, allowing documents, permissions, versions and embeddings to live in a single relational store. It provides ACID guarantees, powerful SQL joins for metadata filtering and lower operational complexity compared with dedicated vector databases.
Core Technical Principles
Embedding
Embedding maps text to dense vectors in a semantic space. Quality depends on model language support, chunking strategy, noise removal and query rewriting.
Chunking
Long enterprise documents must be split into manageable chunks. Three common strategies:
Fixed‑length chunks (simple but may break semantics).
Recursive chunking based on headings, paragraphs and sentences (default for most cases).
Semantic chunking using embedding similarity (best accuracy, higher cost).
Typical recommendations:
FAQ – small chunks (≈200 tokens).
Policy documents – medium chunks (≈500 tokens).
Technical manuals – hierarchical chunking.
Log files – event‑based chunks.
PGVector Index Types
IVFFlat: fast approximate search for very large collections, requires training and careful parameter tuning. HNSW: high recall, stable performance, higher memory usage; the default choice for most enterprise RAG workloads.
Guideline: use HNSW for 100 k–5 M chunks; consider IVFFlat only for extremely large, frequently updated datasets.
Data Model Design
Separate tables for raw documents, document chunks, indexing jobs, conversation sessions and evaluation feedback. This separation simplifies version switching, partial re‑indexing and A/B experiments.
CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE kb_document ( ... ); CREATE TABLE kb_document_chunk ( ... ); CREATE TABLE rag_query_log ( ... );Production‑Ready Retrieval Pipeline
Hybrid Retrieval Service
The service first performs a vector similarity search, optionally a BM25 keyword search, then fuses the results with Reciprocal Rank Fusion (RRF). The top‑K results are optionally re‑ranked.
public List<RetrievedChunk> retrieve(String tenantId, String knowledgeBaseId, String query, String docType) { ... }Reciprocal Rank Fusion (RRF)
RRF scores each document by summing 1 / (k + rank) across all result lists, allowing heterogeneous scores to be combined without normalisation.
score(d) = Σ 1 / (k + rank_i(d))Query Rewrite Service
A lightweight LLM prompt rewrites the user question into a retrieval‑friendly statement while preserving intent and adding missing context.
public String rewrite(String question, String historySummary) { ... }Prompt Builder
The system prompt forces the LLM to answer **only** from the provided references, include citations and never fabricate policies or numbers.
public String build(String userQuestion, String rewrittenQuery, String history, List<RetrievedChunk> chunks) { ... }Answer Orchestration Service
Combines cache lookup, query rewrite, hybrid retrieval, prompt construction, LLM generation, answer caching and query logging into a single workflow.
public RagAnswer answer(String tenantId, String knowledgeBaseId, String sessionId, String userId, String question, String docType) { ... }REST Controllers (example)
@PostMapping("/documents")
public ResponseEntity<Map<String,Object>> upload(@RequestParam("file") MultipartFile file,
@RequestParam String tenantId,
@RequestParam String knowledgeBaseId,
@RequestParam String docType,
@RequestParam String operator) throws IOException {
UUID docId = ingestionService.submit(tenantId, knowledgeBaseId, docType, operator, file);
return ResponseEntity.accepted().body(Map.of("documentId", docId,
"message", "Document received, indexing asynchronously"));
}
@PostMapping("/chat")
public ResponseEntity<RagAnswer> chat(@RequestBody ChatRequest request) {
RagAnswer answer = ragAnswerService.answer(request.tenantId(), request.knowledgeBaseId(),
request.sessionId(), request.userId(), request.question(), request.docType());
return ResponseEntity.ok(answer);
}Real‑World Business Scenario: After‑Sales Knowledge Assistant
A user asks about a returned product. The query is rewritten, hybrid‑retrieved, re‑ranked and answered with citations, demonstrating the end‑to‑end flow.
High Concurrency & High Availability Design
Key bottlenecks: vector search latency, LLM inference time, embedding throughput and prompt size. Optimisation checklist:
Reduce topK and recallK values.
Introduce two‑level caching (Caffeine + Redis).
Limit context length.
Offload heavy tasks (document upload, OCR, embedding) to asynchronous workers.
Apply rate limiting, bulkheads and graceful degradation.
Caching Strategy
Cache keys should contain tenantId, knowledgeBaseId, docType, a hash of the rewritten query and the knowledge‑base version to avoid stale answers after updates.
Rate Limiting, Circuit Breaking, Isolation
Three protection layers:
API‑gateway rate limiting per tenant/user.
Separate thread pools for query handling and ingestion.
Resilience4j circuit breakers with timeouts and retries for external LLM and embedding services.
Database Optimisations for PGVector
Separate vector index from business indexes.
Add high‑frequency filter columns ( tenant_id, knowledge_base_id, enabled, doc_type).
Consider logical partitioning by tenant or knowledge base.
Batch inserts for chunk data.
Regular VACUUM / ANALYZE to keep statistics fresh.
Deployment Strategies
Local Docker‑Compose
version: "3.9"
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: rag
POSTGRES_USER: rag
POSTGRES_PASSWORD: rag
ports: ["5432:5432"]
volumes:
- pg_data:/var/lib/postgresql/data
redis:
image: redis:7
ports: ["6379:6379"]
minio:
image: minio/minio
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: minio
MINIO_ROOT_PASSWORD: minio123
ports: ["9000:9000", "9001:9001"]
volumes:
pg_data:Kubernetes Deployment
Separate deployments for query service (low latency) and ingestion worker (high throughput). Example deployment for the query service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-query-service
spec:
replicas: 3
selector:
matchLabels:
app: rag-query-service
template:
metadata:
labels:
app: rag-query-service
spec:
containers:
- name: app
image: example/enterprise-rag:1.0.0
ports:
- containerPort: 8080
env:
- name: SPRING_PROFILES_ACTIVE
value: prod
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080Horizontal Pod Autoscaling
Scale on CPU, memory, request rate (QPS) and average response time because LLM calls often saturate threads before CPU spikes.
Observability & Evaluation
Collect the following metrics with Micrometer/Prometheus:
QPS, latency percentiles (P50/P95/P99).
Vector‑search latency, LLM generation latency.
Cache hit/miss rates.
Token usage and error rate.
Log detailed request data: original question, rewritten query, retrieved chunks, prompt length, LLM latency, final answer, citations and a traceId. Build a small regression test set (high‑frequency, high‑risk, boundary, multi‑turn queries) and run it after any change (embedding model, chunking, top‑K, reranker, prompt).
Security, Permissions & Compliance
Enforce permission filtering at the retrieval stage using tenant, department, role and security‑level tags. Prevent prompt injection by sanitising documents and adding a system prompt that tells the model to ignore instruction‑like content. Mask or redact sensitive fields before sending data to external LLM services.
Troubleshooting Guide
Inaccurate Answers
Check query‑rewrite fidelity, retrieval relevance, chunk size, reranker effectiveness, prompt clarity and model temperature.
High Latency
Break down latency into rewrite, vector search, keyword search, rerank and LLM generation. Optimise prompt length, enable caching and verify that LLM request queuing is not the bottleneck.
Low Cache Hit Rate
Use coarser cache keys, include knowledge‑base version, and improve query normalisation.
Common Pitfalls (Top 10)
Relying only on vector search without a keyword fallback.
Chunking too coarsely – loss of relevance.
Chunking too finely – broken context.
Synchronous indexing during upload – timeouts.
Missing version isolation – mixed results during updates.
Not logging retrieval details – impossible to debug.
Applying permission filters after generation – security breach.
Prompt without citation constraints – hallucinations.
Chasing larger models instead of improving retrieval.
Operating without a systematic evaluation set.
Conclusion
Enterprise‑grade RAG is more than a chat API. It requires robust document ingestion, intelligent chunking, hybrid and filtered retrieval, disciplined prompt engineering, multi‑level caching, rate limiting, observability and security. When these engineering foundations are solid, Spring AI + PGVector can scale from a proof‑of‑concept to a cloud‑native production system that delivers reliable, traceable and cost‑effective AI‑powered knowledge answers.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
