Designing a Scalable RAG Storage Architecture: Lessons from a Real‑World Project
The article explains why RAG storage must be layered, describes the four data types involved, presents a typical three‑layer architecture with vector, content, and metadata stores, and shows how the design evolves with scale, multi‑level indexing, update handling, and tenant isolation.
1. Problem Analysis
In Retrieval‑Augmented Generation (RAG) systems the storage layer is often underestimated; many treat it as a simple "store" between document chunking, vector embedding, retrieval, and LLM prompting. In production‑grade RAG, storage design goes far beyond choosing a vector database. The interview expects a diagram of the storage components, their responsibilities, data flow, and scaling considerations.
1.1 One Storage Cannot Satisfy All Requirements
RAG must store four distinct data categories, each with different access patterns:
Vector data : high‑dimensional embeddings for ANN search; writes are batch‑oriented, reads require millisecond‑level latency and specialized indexes (HNSW, IVF).
Original text : the chunk content that will be placed into LLM prompts; storing large texts or multiple formats in the vector payload harms index efficiency.
Structured metadata : source, author, timestamps, tags, etc.; accessed via precise queries and range filters, a relational DB excels here.
Document‑level management information : file paths, parent‑child relationships, parsing status, versioning; essential for operations but unrelated to retrieval.
Attempting to satisfy all needs with a single component inevitably leads to poor performance for at least one requirement.
1.2 Typical Three‑Layer Storage Architecture
The most common production design separates responsibilities into three layers linked by chunk_id and document_id:
Vector Retrieval Layer : dedicated vector DB (Milvus, Qdrant, Weaviate) handling ANN search. Milvus offers distributed deployment, sharding, and index choices (HNSW for high recall, IVF_PQ for memory‑constrained scenarios). For smaller scales, pgvector in PostgreSQL can be used.
Content Storage Layer : stores raw chunk text and parsed structures. Often a relational table (PostgreSQL/MySQL) with columns chunk_id, chunk_text, document_id. Some teams embed the text in the vector DB payload, but this limits full‑text search. Elasticsearch is another option, providing both storage and BM25 keyword search.
Management & Metadata Layer : relational tables for documents, chunk metadata, and document‑chunk mappings. These tables enable operations such as deleting all chunks of an expired document or tracing which document a particular answer references.
1.3 Architecture Evolution with Scale
Prototype (< 10 k chunks) : a single PostgreSQL database with pgvector stores vectors, text, and metadata in one table; a single SQL query handles filtering and ANN ranking.
Growth (10 k – 5 M chunks) : performance degrades; vectors are moved to a dedicated vector store (Milvus Lite or Qdrant) while text and metadata remain in PostgreSQL. Introducing Elasticsearch for keyword search becomes reasonable.
Scale‑out (> 5 M chunks) : adopt distributed vector stores with sharding and replication (Milvus). Partition data by tenant, knowledge base, or time to avoid full scans. Metadata layer adopts read/write separation and index tuning.
1.4 Multi‑Level Index Implementation
The design uses two chunk granularities:
Child chunk (≈256 tokens): indexed in the vector layer for precise ANN matching.
Parent chunk (≈1024 tokens): stores full context without a vector index.
Each record in the chunk table includes parent_chunk_id and chunk_level. Retrieval returns child chunk_id s, then follows parent_chunk_id to fetch the parent text for LLM prompting. A three‑level extension adds a document summary chunk (level “summary”) indexed in the vector DB to perform a coarse‑to‑fine search.
1.5 Index Updates
Document updates require replacing all chunks of the affected document because re‑chunking changes boundaries, making incremental chunk‑level diffs impossible. The process:
Delete old chunks, vectors, and metadata for the document_id across all three layers.
Insert newly chunked data, vectors, and metadata.
Wrap the three‑phase operation in a transaction (if using a single PostgreSQL instance) or a compensating state machine for multi‑database setups to guarantee eventual consistency.
Version numbers and timestamps stored in the management tables enable time‑decay weighting during retrieval.
1.6 Permission Isolation
For multi‑tenant or multi‑business‑line deployments, isolation can be achieved:
Physical isolation : separate vector collections or database instances per tenant (high isolation, higher cost).
Logical isolation : a shared vector collection with a tenant_id field in each payload; queries must filter on tenant_id. Milvus supports tenant_id as a Partition Key, combining the efficiency of logical isolation with the performance of physical partitioning.
2. Reference Answer
Our project uses the three‑layer design described above: Milvus for vector retrieval (storing only vectors and chunk_id), PostgreSQL for raw chunk text, and Elasticsearch for BM25 keyword search. The management layer in PostgreSQL holds document and chunk metadata, enabling pre‑filtering and operational tasks. Multi‑level indexing employs 256‑token child chunks for ANN and 1024‑token parent chunks for context retrieval. Updates are performed at the document level with a transactional or state‑machine approach to keep the three layers consistent. Tenant isolation is realized via Milvus Partition Key on tenant_id.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
