How RAG Works: From Data Prep to LLM Generation Explained
This article breaks down Retrieval‑Augmented Generation (RAG) into its three core stages—data preparation, data retrieval, and LLM generation—showing how document chunking, embedding, vector databases, similarity search, and optional re‑ranking combine to let large language models produce more accurate, knowledge‑grounded answers.
What is Retrieval‑Augmented Generation (RAG)?
RAG combines a traditional information‑retrieval system with a large language model (LLM). The workflow proceeds through three sequential modules: data preparation, data retrieval, and LLM generation.
1. Data preparation
1.1 Text chunking
Uploaded documents (e.g., .txt, .docx, .json, .pdf, .md) are split into small blocks. Chunking can follow paragraph boundaries or a fixed character count. Benefits:
Avoids processing overhead for very large files.
Ensures each block fits within the LLM’s context window.
Enables finer‑grained retrieval compared with treating an entire document as a single item.
1.2 Converting chunks to embedding vectors
Each chunk is passed through an embedding model to obtain a dense vector. Advantages:
Efficient similarity computation via cosine similarity or other metrics.
Semantic richness: vectors capture meaning, allowing the system to recognize that “you are a good person” is semantically closer to “you are great” than to “you are bad,” even when word overlap differs.
1.3 Storing vectors in a vector database
The resulting vectors, together with the original text and its location within the source file, are stored in a vector database. The database can accept new vectors at any time, keeping the knowledge base up‑to‑date.
2. Data retrieval
2.1 User query
Example query: “What are the specifications of product XXX?”
2.2 Query embedding
The query is encoded with the same embedding model used for the document chunks, producing a query vector.
2.3 Similarity search
Approximate nearest‑neighbor search retrieves the top K most similar chunks from the vector database. This operation is fast because it works on dense vectors rather than raw text.
2.4 Optional re‑ranking (ReRank)
Some deployments (e.g., RagFlow) apply a cross‑encoder ReRank model to rescore the retrieved chunks and promote the most relevant ones. Other open‑source systems such as AnythingLLM may skip this step.
3. LLM generation
The selected chunks are inserted into a prompt template (customizable by the user). The completed prompt is sent to the LLM, which generates the final answer.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
