Mastering RAG: From Quick Start to Deep Optimization Strategies
This article dives into the practical implementation of Retrieval‑Augmented Generation (RAG), covering document chunking, semantic and reverse HyDE indexing, embedding, hybrid search, and re‑ranking techniques, and provides concrete code examples and optimization tips for building high‑performance AI applications.
Overview of RAG and Its Challenges
Retrieval‑Augmented Generation (RAG) is widely used in AI applications, but it is often treated as a black box, making problem diagnosis difficult. Effective RAG pipelines must balance recall and precision while adapting each module to specific scenarios.
Document Chunking
Good retrieval starts with well‑structured knowledge documents. Chunking splits documents into manageable pieces, typically using token‑based or semantic methods. Parameters such as min_split_tokens and max_split_tokens control chunk size, while a similarity threshold ensures semantic cohesion.
# Set local data directory
local_data_dir = pathlib.Path("/Users/jiangdanyang/workspaces/python/MarioPython/src/RAG/dataset/ai-arxiv2")
# Load dataset (download if not cached)
dataset = load_dataset("jamescalam/ai-arxiv2", split="train", cache_dir=str(local_data_dir))
# Initialize encoder
encoder = OpenAIEncoder(name="text-embedding-3-small", openai_api_key=os.getenv("AI_API_KEY"), openai_base_url=os.getenv("AI_API_BASE_URL"))
# Create statistical chunker
chunker = StatisticalChunker(encoder=encoder, min_split_tokens=100, max_split_tokens=500, plot_chunks=True, enable_statistics=True)
chunks_0 = chunker(docs=[dataset["content"][0]], batch_size=500)Statistics for a sample paper:
Chunking Statistics:
- Total Documents: 474
- Total Chunks: 46
- Chunks by Threshold: 41
- Chunks by Max Size: 4
- Minimum Token Size of Chunk: 54
- Maximum Token Size of Chunk: 495
- Similarity Chunk Ratio: 0.89Indexing Enhancements
Two main indexing strategies improve retrieval:
Semantic Enhancement : Pass each chunk and its full document to a strong LLM, ask it to generate a concise context, and append this context to the chunk.
Reverse HyDE (Hypothetical Document Embeddings) : Generate possible questions for a chunk, embed those questions, and index them, allowing offline processing and better query expansion.
# Prompt for semantic enhancement
DOCUMENT_CONTEXT_PROMPT = """<document>{doc_content}</document>"""
CHUNK_CONTEXT_PROMPT = """Here is the chunk we want to situate within the whole document
<chunk>{chunk_content}</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""Embedding
Embedding converts text (or multimodal content) into vectors. Model choice matters: language coverage, vocabulary size, and the semantic space affect downstream retrieval quality. For Chinese text, a Chinese‑trained embedding model is preferred over English‑only models.
# Example embedding with SentenceTransformer
model = SentenceTransformer("/Users/jiangdanyang/workspaces/python/Model/all-MiniLM-L6-v2")
first_sentence = "直播分享会AI Living第一场"
second_sentence = "直播鲁班小组分享第77期"
tokenized_first = model.tokenize([first_sentence])
tokenized_second = model.tokenize([second_sentence])Hybrid Search
Hybrid search combines sparse (BM25/TF‑IDF) and dense (vector) retrieval to improve both keyword matching and semantic relevance. Sparse scores are computed with BM25, dense scores with a transformer encoder, then both are normalized and weighted (e.g., 0.2 for sparse, 0.8 for dense) to produce a final ranking.
# Sparse BM25 indexing
sparse_tokenizer = Tokenizer(stemmer=english_stemmer, lower=True, stopwords="english", splitter=r"\w+")
corpus_sparse_tokens = sparse_tokenizer.tokenize(corpus_text, update_vocab=True, return_as="ids")
sparse_index = bm25s.BM25(corpus=corpus_json)
sparse_index.index(corpus_sparse_tokens)
# Dense vector indexing with Qdrant
qdrant = QdrantClient(path="/Users/jiangdanyang/workspaces/python/MarioPython/src/RAG/dataset/qdrant_data")
dense_encoder = SentenceTransformer('/Users/jiangdanyang/workspaces/python/Model/all-MiniLM-L6-v2')
collection_name = "hybrid_search"
qdrant.recreate_collection(collection_name=collection_name, vectors_config=models.VectorParams(size=dense_encoder.get_sentence_embedding_dimension(), distance=models.Distance.COSINE))
points = [models.PointStruct(id=idx, vector=dense_encoder.encode(doc["text"]).tolist(), payload=doc) for idx, doc in enumerate(corpus_json)]
qdrant.upload_points(collection_name=collection_name, points=points)
# Normalization and weighting
alpha = 0.2
weighted_scores = (1 - alpha) * dense_scores_normalized + alpha * sparse_scores_normalizedRe‑ranking
After hybrid retrieval, a cross‑encoder (e.g., a BERT‑based model) scores each query‑document pair to produce a fine‑grained relevance score between 0 and 1. The top‑k results are then used as context for the final generation step.
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("/Users/jiangdanyang/workspaces/python/Model/jina-reranker-v1-tiny-en")
query = "What is context size of Mixtral?"
pairs = [[query, doc['text']] for doc in hybrid_search_results.values()]
scores = cross_encoder.predict(pairs)
# Select top‑k documents based on cross‑encoder scoresConclusion
RAG is a crucial component of modern AI systems, but achieving high performance requires careful tuning of each stage—from document chunking and indexing to hybrid retrieval and cross‑encoder re‑ranking. Starting with quick‑start pipelines is fine, yet deep optimization tailored to your data and use‑case yields the best results.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
