Artificial Intelligence 15 min read

What Is Retrieval‑Augmented Generation (RAG) and How Does It Power Modern AI?

This article explains Retrieval‑Augmented Generation (RAG), an AI framework that combines traditional information retrieval with large language models, detailing its core workflow—from knowledge preparation, chunking, and embedding to vector database storage and the question‑answering stage—while highlighting key challenges, tools, and optimization strategies.

JD Tech Talk

Sep 28, 2025

What Is Retrieval‑Augmented Generation (RAG)?

RAG (Retrieval‑Augmented Generation) is an AI framework that combines the strengths of traditional information‑retrieval systems (e.g., databases) with generative large language models (LLMs). Instead of relying solely on the knowledge stored during LLM training, the system first “looks up” relevant external documents and then generates answers grounded in those sources.

Key challenges RAG addresses:

Knowledge freshness : Overcomes the time‑bound limitation of LLM training data.

Hallucination : Reduces the probability of fabricated answers by providing source references.

Information security : Uses external knowledge bases instead of internal training data, lowering privacy‑leak risk.

Vertical domain knowledge : Allows direct integration of specialized domain information without retraining.

RAG Core Workflow

2.1 Knowledge Preparation

Document parsing : Accept raw documents such as Markdown, PDF, or HTML and extract plain text, handling special formats like code blocks, tables, images, and videos.

Data cleaning & standardization : Remove special characters, tags, noise, and duplicate content; normalize dates and units (e.g., “today” → “2025‑07‑17”). Tools such as NLTK or spaCy are commonly used.

Metadata extraction : Capture auxiliary information (source URL, file name, creation time, author, document type, etc.) to enrich retrieval relevance.

complete_metadata_chunk1 = {
    'file_path': '/mydocs/roma_intro.md',
    'file_name': 'roma_intro.md',
    'chunk_id': 0,
    'section_title': '# 什么是 ROMA？',
    'subsection_title': '',
    'section_type': 'section',
    'chunking_strategy': 3,
    'content_type': 'product_description',
    'main_entity': 'ROMA',
    'language': 'zh-CN',
    'creation_date': '2025-07-02',
    'word_count': 42,
    'topics': ['ROMA', '前端框架', '跨平台开发'],
    'entities': {
        'products': ['ROMA', 'Jue语言'],
        'platforms': ['iOS', 'Android', 'Web']
    }
}

2.2 Chunking (Content Splitting)

Chunking breaks long documents into smaller pieces to fit LLM token limits and improve retrieval precision. Common strategies:

Size‑based : Fixed character count; simple but may split semantic units.

Paragraph‑based : Keeps whole paragraphs; respects natural structure but leads to uneven chunk sizes.

Semantic‑based : Uses similarity scores to create coherent chunks; computationally expensive.

Hybrid approaches combine multiple methods, and overlapping windows can ensure key information appears in several chunks.

Typical tools: LangChain splitters (RecursiveCharacterTextSplitter, MarkdownTextSplitter), NLTK, spaCy.

# Example of size‑based chunk
第一段：# ROMA框架介绍ROMA是一个全自主研发的前端开发框架，基于自定义DSL(Jue语言)。
第二段：一份代码，可在iOS、Android、Harmony、Web四端运行的跨平台解决方案。

2.3 Embedding (Vectorization)

Embedding maps high‑dimensional text into low‑dimensional vectors for efficient similarity search. Example models:

all‑minilm‑l6‑v2 (Hugging Face, 384‑dim): Efficient inference, suitable for resource‑constrained environments.

text‑embedding‑ada‑002 (OpenAI, 1536‑dim): High performance, but may have access restrictions in some regions.

BERT embedding (Google, 768‑dim base / 1024‑dim large): Widely used in NLP tasks.

BGE (Baidu’s General Embedding) (百度, 768‑dim): Top‑2 on HuggingFace MTEB benchmark.

2.4 Vector Database Ingestion

Vectors and their metadata are stored in a vector database with an index for fast similarity search. Popular choices:

ChromaDB – Low complexity, lightweight Python integration, best for prototypes or small projects.

FAISS – Medium complexity, supports billion‑scale vector retrieval with high performance; requires custom integration.

Milvus – High complexity, distributed and multi‑modal support; resource‑intensive, suited for enterprise production.

Pinecone – Low complexity, fully managed with auto‑scaling; higher cost and data stored on third‑party cloud.

Elasticsearch – High complexity, strong full‑text search ecosystem; vector search added later, performance lower than dedicated stores.

Question‑Answering Stage

3.1 Query Pre‑processing

Intent detection : Classify the query type (fact, recommendation, chit‑chat, etc.).

Query cleaning & standardization : Apply similar preprocessing as in the knowledge stage.

Query augmentation : Generate synonyms or expand context using a knowledge base or LLM.

3.2 Retrieval (Recall)

Three retrieval modes are typically combined:

Vector similarity search (cosine similarity).

Keyword (inverted‑index) search.

Hybrid search that merges both results.

{
    "vector": [0.052, -0.021, 0.075, ...],
    "top_k": 3,
    "score_threshold": 0.8,
    "filter": {"doc_type": "技术文档"}
}

3.3 Reranking

A reranker model assigns a relevance score to each retrieved chunk, normalizes the score to [0, 1], and reorders results for higher semantic fidelity.

3.4 Information Integration

Retrieved chunks are formatted into a prompt template, optionally truncating or summarizing long texts to fit the LLM context window. Sources are cited to improve transparency.

Prompt template:
You are a ROMA framework expert. Based on the following context, answer the question.
Reference:
[Doc1] 什么是 ROMA？ ROMA 是一个全自主研发的前端开发框架，基于自定义 DSL (Jue 语言)，一份代码可在 iOS、Android、Harmony、Web 四端运行。
...
Requirements:
1. Explain step‑by‑step with code examples.
2. Cite the source document version.
3. If the reference does not contain the answer, state that you cannot answer.
User question: ROMA 是什么？
Answer: {answer}

3.5 LLM Generation

The final prompt is sent to an LLM such as GPT‑4 or Claude, which generates the answer.

Overall Optimization Tips

Optimizations include mixed chunking strategies, dynamic overlap sizing, and careful prompt engineering to limit token usage and enforce source attribution.