Artificial Intelligence 9 min read

How RAG Works: From Data Prep to LLM Generation Explained

This article breaks down Retrieval‑Augmented Generation (RAG) into its three core stages—data preparation, data retrieval, and LLM generation—showing how document chunking, embedding, vector databases, similarity search, and optional re‑ranking combine to let large language models produce more accurate, knowledge‑grounded answers.

Fun with Large Models

Apr 18, 2025

How RAG Works: From Data Prep to LLM Generation Explained

What is Retrieval‑Augmented Generation (RAG)?

RAG combines a traditional information‑retrieval system with a large language model (LLM). The workflow proceeds through three sequential modules: data preparation, data retrieval, and LLM generation.

1. Data preparation

1.1 Text chunking

Uploaded documents (e.g., .txt, .docx, .json, .pdf, .md) are split into small blocks. Chunking can follow paragraph boundaries or a fixed character count. Benefits:

Avoids processing overhead for very large files.

Ensures each block fits within the LLM’s context window.

Enables finer‑grained retrieval compared with treating an entire document as a single item.

1.2 Converting chunks to embedding vectors

Each chunk is passed through an embedding model to obtain a dense vector. Advantages:

Efficient similarity computation via cosine similarity or other metrics.

Semantic richness: vectors capture meaning, allowing the system to recognize that “you are a good person” is semantically closer to “you are great” than to “you are bad,” even when word overlap differs.

1.3 Storing vectors in a vector database

The resulting vectors, together with the original text and its location within the source file, are stored in a vector database. The database can accept new vectors at any time, keeping the knowledge base up‑to‑date.

2. Data retrieval

2.1 User query

Example query: “What are the specifications of product XXX?”

2.2 Query embedding

The query is encoded with the same embedding model used for the document chunks, producing a query vector.

2.3 Similarity search

Approximate nearest‑neighbor search retrieves the top K most similar chunks from the vector database. This operation is fast because it works on dense vectors rather than raw text.

2.4 Optional re‑ranking (ReRank)

Some deployments (e.g., RagFlow) apply a cross‑encoder ReRank model to rescore the retrieved chunks and promote the most relevant ones. Other open‑source systems such as AnythingLLM may skip this step.

3. LLM generation

The selected chunks are inserted into a prompt template (customizable by the user). The completed prompt is sent to the LLM, which generates the final answer.

LLM RAG vector database Embedding retrieval

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.