Artificial Intelligence 42 min read

From Demo to Production: Building an Enterprise‑Grade RAG System with Spring AI & PGVector

This comprehensive guide explains how to design, implement, and operate a production‑ready Retrieval‑Augmented Generation (RAG) platform using Spring AI and PostgreSQL PGVector, covering architecture, indexing, hybrid retrieval, prompt engineering, scaling, security, observability, deployment, and common pitfalls for enterprise knowledge‑base applications.

Ray's Galactic Tech

Mar 30, 2026

From Demo to Production: Building an Enterprise‑Grade RAG System with Spring AI & PGVector

Why Enterprises Need a Full RAG System

Connecting a large language model (LLM) to a chat UI is only the first step. In real business scenarios three problems appear:

Private knowledge (policies, SOPs, product docs, logs) is not stored in the model.

LLMs hallucinate, especially in high‑risk domains.

Enterprises require traceability, governance, observability and strict performance guarantees.

Enterprise RAG Goals

Accuracy : minimise hallucinations and make every answer traceable.

Latency : keep retrieval and generation delay predictable.

Throughput : support high‑concurrency queries and bulk ingestion.

Cost : control embedding, inference, storage and cache expenses.

Scalability : grow with document volume, user count and tenant count.

Governance : permissions, audit, gray‑release, evaluation and replay.

Operability : metrics, logs, tracing and alerts.

Overall Architecture

The system is built as a layered, closed‑loop pipeline (bottom‑up):

Offline/async indexing pipeline

Application layer (Spring Boot + Spring AI)

Authentication & authorization

Rate limiting & circuit breaking

RAG orchestrator

Conversation memory

Cache (local Caffeine + Redis)

Observability (Micrometer, Prometheus)

Query rewrite service

Hybrid retrieval (vector + BM25)

Reranker (cross‑encoder)

Prompt builder

LLM generation

PostgreSQL + PGVector (vector store)

Full‑text search (BM25)

Object storage (MinIO / OSS)

Document parsing, cleaning, chunking, embedding, metadata storage

Why Spring AI + PGVector

Spring AI integrates AI capabilities into the Spring ecosystem, offering unified access to ChatModel, EmbeddingModel and VectorStore, native Spring Boot configuration, lifecycle management, monitoring and transaction support. This makes the stack easy for Java teams.

PGVector extends PostgreSQL with a vector column, allowing documents, permissions, versions and embeddings to live in a single relational store. It provides ACID guarantees, powerful SQL joins for metadata filtering and lower operational complexity compared with dedicated vector databases.

Core Technical Principles

Embedding

Embedding maps text to dense vectors in a semantic space. Quality depends on model language support, chunking strategy, noise removal and query rewriting.

Chunking

Long enterprise documents must be split into manageable chunks. Three common strategies:

Fixed‑length chunks (simple but may break semantics).

Recursive chunking based on headings, paragraphs and sentences (default for most cases).

Semantic chunking using embedding similarity (best accuracy, higher cost).

Typical recommendations:

FAQ – small chunks (≈200 tokens).

Policy documents – medium chunks (≈500 tokens).

Technical manuals – hierarchical chunking.

Log files – event‑based chunks.

PGVector Index Types

IVFFlat

: fast approximate search for very large collections, requires training and careful parameter tuning. HNSW: high recall, stable performance, higher memory usage; the default choice for most enterprise RAG workloads.

Guideline: use HNSW for 100 k–5 M chunks; consider IVFFlat only for extremely large, frequently updated datasets.

Data Model Design

Separate tables for raw documents, document chunks, indexing jobs, conversation sessions and evaluation feedback. This separation simplifies version switching, partial re‑indexing and A/B experiments.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE kb_document ( ... );

CREATE TABLE kb_document_chunk ( ... );

CREATE TABLE rag_query_log ( ... );

Production‑Ready Retrieval Pipeline

Hybrid Retrieval Service

The service first performs a vector similarity search, optionally a BM25 keyword search, then fuses the results with Reciprocal Rank Fusion (RRF). The top‑K results are optionally re‑ranked.

public List<RetrievedChunk> retrieve(String tenantId, String knowledgeBaseId, String query, String docType) { ... }

Reciprocal Rank Fusion (RRF)

RRF scores each document by summing 1 / (k + rank) across all result lists, allowing heterogeneous scores to be combined without normalisation.

score(d) = Σ 1 / (k + rank_i(d))

Query Rewrite Service

A lightweight LLM prompt rewrites the user question into a retrieval‑friendly statement while preserving intent and adding missing context.

public String rewrite(String question, String historySummary) { ... }

Prompt Builder

The system prompt forces the LLM to answer **only** from the provided references, include citations and never fabricate policies or numbers.

public String build(String userQuestion, String rewrittenQuery, String history, List<RetrievedChunk> chunks) { ... }

Answer Orchestration Service

Combines cache lookup, query rewrite, hybrid retrieval, prompt construction, LLM generation, answer caching and query logging into a single workflow.

public RagAnswer answer(String tenantId, String knowledgeBaseId, String sessionId, String userId, String question, String docType) { ... }

REST Controllers (example)

@PostMapping("/documents")
public ResponseEntity<Map<String,Object>> upload(@RequestParam("file") MultipartFile file,
    @RequestParam String tenantId,
    @RequestParam String knowledgeBaseId,
    @RequestParam String docType,
    @RequestParam String operator) throws IOException {
    UUID docId = ingestionService.submit(tenantId, knowledgeBaseId, docType, operator, file);
    return ResponseEntity.accepted().body(Map.of("documentId", docId,
        "message", "Document received, indexing asynchronously"));
}

@PostMapping("/chat")
public ResponseEntity<RagAnswer> chat(@RequestBody ChatRequest request) {
    RagAnswer answer = ragAnswerService.answer(request.tenantId(), request.knowledgeBaseId(),
        request.sessionId(), request.userId(), request.question(), request.docType());
    return ResponseEntity.ok(answer);
}

Real‑World Business Scenario: After‑Sales Knowledge Assistant

A user asks about a returned product. The query is rewritten, hybrid‑retrieved, re‑ranked and answered with citations, demonstrating the end‑to‑end flow.

High Concurrency & High Availability Design

Key bottlenecks: vector search latency, LLM inference time, embedding throughput and prompt size. Optimisation checklist:

Reduce topK and recallK values.

Introduce two‑level caching (Caffeine + Redis).

Limit context length.

Offload heavy tasks (document upload, OCR, embedding) to asynchronous workers.

Apply rate limiting, bulkheads and graceful degradation.

Caching Strategy

Cache keys should contain tenantId, knowledgeBaseId, docType, a hash of the rewritten query and the knowledge‑base version to avoid stale answers after updates.

Rate Limiting, Circuit Breaking, Isolation

Three protection layers:

API‑gateway rate limiting per tenant/user.

Separate thread pools for query handling and ingestion.

Resilience4j circuit breakers with timeouts and retries for external LLM and embedding services.

Database Optimisations for PGVector

Separate vector index from business indexes.

Add high‑frequency filter columns ( tenant_id, knowledge_base_id, enabled, doc_type).

Consider logical partitioning by tenant or knowledge base.

Batch inserts for chunk data.

Regular VACUUM / ANALYZE to keep statistics fresh.

Deployment Strategies

Local Docker‑Compose

version: "3.9"
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: rag
      POSTGRES_USER: rag
      POSTGRES_PASSWORD: rag
    ports: ["5432:5432"]
    volumes:
      - pg_data:/var/lib/postgresql/data
  redis:
    image: redis:7
    ports: ["6379:6379"]
  minio:
    image: minio/minio
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: minio
      MINIO_ROOT_PASSWORD: minio123
    ports: ["9000:9000", "9001:9001"]
volumes:
  pg_data:

Kubernetes Deployment

Separate deployments for query service (low latency) and ingestion worker (high throughput). Example deployment for the query service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-query-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rag-query-service
  template:
    metadata:
      labels:
        app: rag-query-service
    spec:
      containers:
        - name: app
          image: example/enterprise-rag:1.0.0
          ports:
            - containerPort: 8080
          env:
            - name: SPRING_PROFILES_ACTIVE
              value: prod
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080

Horizontal Pod Autoscaling

Scale on CPU, memory, request rate (QPS) and average response time because LLM calls often saturate threads before CPU spikes.

Observability & Evaluation

Collect the following metrics with Micrometer/Prometheus:

QPS, latency percentiles (P50/P95/P99).

Vector‑search latency, LLM generation latency.

Cache hit/miss rates.

Token usage and error rate.

Log detailed request data: original question, rewritten query, retrieved chunks, prompt length, LLM latency, final answer, citations and a traceId. Build a small regression test set (high‑frequency, high‑risk, boundary, multi‑turn queries) and run it after any change (embedding model, chunking, top‑K, reranker, prompt).

Security, Permissions & Compliance

Enforce permission filtering at the retrieval stage using tenant, department, role and security‑level tags. Prevent prompt injection by sanitising documents and adding a system prompt that tells the model to ignore instruction‑like content. Mask or redact sensitive fields before sending data to external LLM services.

Troubleshooting Guide

Inaccurate Answers

Check query‑rewrite fidelity, retrieval relevance, chunk size, reranker effectiveness, prompt clarity and model temperature.

High Latency

Break down latency into rewrite, vector search, keyword search, rerank and LLM generation. Optimise prompt length, enable caching and verify that LLM request queuing is not the bottleneck.

Low Cache Hit Rate

Use coarser cache keys, include knowledge‑base version, and improve query normalisation.

Common Pitfalls (Top 10)

Relying only on vector search without a keyword fallback.

Chunking too coarsely – loss of relevance.

Chunking too finely – broken context.

Synchronous indexing during upload – timeouts.

Missing version isolation – mixed results during updates.

Not logging retrieval details – impossible to debug.

Applying permission filters after generation – security breach.

Prompt without citation constraints – hallucinations.

Chasing larger models instead of improving retrieval.

Operating without a systematic evaluation set.

Conclusion

Enterprise‑grade RAG is more than a chat API. It requires robust document ingestion, intelligent chunking, hybrid and filtered retrieval, disciplined prompt engineering, multi‑level caching, rate limiting, observability and security. When these engineering foundations are solid, Spring AI + PGVector can scale from a proof‑of‑concept to a cloud‑native production system that delivers reliable, traceable and cost‑effective AI‑powered knowledge answers.

observability RAG vector search Spring AI Enterprise AI Hybrid Retrieval pgvector

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Enterprises Need a Full RAG System

Enterprise RAG Goals

Overall Architecture

Why Spring AI + PGVector

Core Technical Principles

Embedding

Chunking

PGVector Index Types

Data Model Design

Production‑Ready Retrieval Pipeline

Hybrid Retrieval Service

Reciprocal Rank Fusion (RRF)

Query Rewrite Service

Prompt Builder

Answer Orchestration Service

REST Controllers (example)

Real‑World Business Scenario: After‑Sales Knowledge Assistant

High Concurrency & High Availability Design

Caching Strategy

Rate Limiting, Circuit Breaking, Isolation

Database Optimisations for PGVector

Deployment Strategies

Local Docker‑Compose

Kubernetes Deployment

Horizontal Pod Autoscaling

Observability & Evaluation

Security, Permissions & Compliance

Troubleshooting Guide

Inaccurate Answers

High Latency

Low Cache Hit Rate

Common Pitfalls (Top 10)

Conclusion

Ray's Galactic Tech

How this landed with the community

Was this worth your time?

0 Comments

Common Pitfalls (Top 10)