Mastering RAG: Classic Architecture, Challenges, and Evolution Explained

This article outlines the fundamental RAG workflow—from data indexing and querying to advanced modular designs—highlights key challenges such as retrieval accuracy, model robustness, context limits, and performance, and traces the evolution from naive to modular RAG systems.

Architect's Alchemy Furnace
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mastering RAG: Classic Architecture, Challenges, and Evolution Explained

1. Classic RAG Architecture and Process

After introducing basic RAG concepts, the minimal logical architecture consists of two main phases: data indexing and data querying, each containing several sub‑steps.

1.1 Data Indexing Phase

The core of RAG is retrieval, so the first step is to prepare searchable content. Traditional keyword search gives way to vector‑based semantic retrieval, where documents are split into chunks, embedded into high‑dimensional vectors, and stored in a vector database.

The indexing pipeline typically includes:

Loading: ingesting knowledge from various sources (structured, semi‑structured, unstructured, web, internal documents, Q&A pairs).

Splitting: breaking large documents into manageable chunks.

Embedding: converting each chunk into a high‑dimensional vector using models such as OpenAI’s text-embedding-3-small.

Indexing: persisting vectors in a vector store (or alternative indexes like knowledge graphs or keyword tables).

1.2 Data Query Phase

After indexing, the query phase consists of two core steps:

Retrieval: using the vector index to fetch the most relevant chunks (top‑K) based on similarity scores.

Generation: feeding the retrieved context and the user’s question to a large language model, guided by a carefully crafted prompt, to produce the final answer.

Additional optional stages include pre‑retrieval processing (query rewriting, routing) and post‑retrieval processing (re‑ranking, filtering) to improve relevance and quality.

2. Challenges Facing RAG Applications

Despite RAG’s simplicity and effectiveness, real‑world deployments encounter several hurdles:

Retrieval accuracy: noisy or contradictory retrieved chunks can degrade generation quality.

Model robustness: the LLM must discern and ignore irrelevant or conflicting information.

Context window limits: token limits restrict how many chunks can be supplied to the model.

RAG vs. fine‑tuning: deciding when to augment with retrieval versus fine‑tuning the model for a domain.

Response latency: additional retrieval and processing steps increase inference time, challenging latency‑sensitive applications.

3. Evolution of RAG Architectures

3.1 Naive RAG follows a linear pipeline: Index → Retrieve → Generate.

3.2 Advanced RAG adds pre‑ and post‑retrieval processing to the basic pipeline, improving relevance and quality.

3.3 Modular RAG breaks the workflow into interchangeable modules and algorithms, allowing flexible composition of custom pipelines that can incorporate diverse techniques such as query rewriting, hybrid indexing, or specialized re‑ranking models.

RAGsemantic searchAI Architecturevector indexingretrieval-augmented-generation
Architect's Alchemy Furnace
Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.