Mastering RAG: From Chunking to Hybrid Search for Better AI Retrieval
This article delves into the implementation details and optimization strategies of Retrieval‑Augmented Generation (RAG), covering document chunking, index enhancement, embedding, hybrid search, and re‑ranking, and provides practical code examples to help developers move from quick deployment to deep performance tuning.
Introduction
The article explores the practical challenges of Retrieval‑Augmented Generation (RAG) in AI applications, noting that RAG is often treated as a black box, making problem diagnosis and performance tuning difficult. It emphasizes the need to balance recall and precision while iteratively optimizing each component of the pipeline.
RAG Overview
RAG combines three core actions: Retrieve (search), Augment (enhance with additional context), and Generate (produce the final answer). An additional embedding step encodes text into vectors for retrieval.
1. Document Chunking
Effective retrieval starts with well‑structured knowledge documents. Semantic chunking (beyond simple token or paragraph splits) yields more relevant chunks. The article provides a Python example using StatisticalChunker and shows statistics such as total documents, number of chunks, token size ranges, and a similarity‑chunk ratio of 0.89, indicating that most chunks were created semantically.
# Set local data directory
local_data_dir = pathlib.Path("/Users/jiangdanyang/workspaces/python/MarioPython/src/RAG/dataset/ai-arxiv2")
# Load dataset
dataset = load_dataset("jamescalam/ai-arxiv2", split="train", cache_dir=str(local_data_dir))
# Initialize encoder
encoder = OpenAIEncoder(name="text-embedding-3-small", openai_api_key=os.getenv("AI_API_KEY"), openai_base_url=os.getenv("AI_API_BASE_URL"))
chunker = StatisticalChunker(encoder=encoder, min_split_tokens=100, max_split_tokens=500, plot_chunks=True, enable_statistics=True)
chunks_0 = chunker(docs=[dataset["content"][0]], batch_size=500)Key parameters include Threshold (similarity lower bound) and WindowSize (number of consecutive documents considered for similarity).
2. Index Enhancement
Two techniques are discussed:
Semantic enhancement : send each chunk together with its full document to a strong LLM, ask it to generate a concise context, and append this context to the chunk before indexing.
Reverse HyDE : generate plausible questions for a given chunk, index those questions, and later retrieve the chunk via the generated queries, enabling offline enrichment.
DOCUMENT_CONTEXT_PROMPT = """<document>{doc_content}</document>"""
CHUNK_CONTEXT_PROMPT = """Here is the chunk we want to situate within the whole document
<chunk>{chunk_content}</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""3. Embedding
Embedding converts text (or multimodal content) into vectors. Factors affecting quality include the language support of the model, vocabulary size, and the semantic space the model was trained on. The article shows a simple example using SentenceTransformer and highlights issues when using English‑centric models for Chinese text.
first_sentence = "直播分享会AI Living第一场"
second_sentence = "直播鲁班小组分享第77期"
model = SentenceTransformer("/Users/jiangdanyang/workspaces/python/Model/all-MiniLM-L6-v2")
tokenized_first_sentence = model.tokenize([first_sentence])
tokenized_second_sentence = model.tokenize([second_sentence])4. Hybrid Search
Hybrid search combines sparse (BM25/TF‑IDF) and dense (vector) retrieval to improve both keyword matching and semantic relevance. The article provides code for building a BM25 index and a dense vector index with Qdrant, then normalizes and weights the scores (e.g., 0.2 for sparse, 0.8 for dense) to produce a final ranking.
# Load the chunks
corpus_json = json.load(open('/Users/jiangdanyang/workspaces/python/MarioPython/src/RAG/dataset/corpus.json'))
corpus_text = [doc["text"] for doc in corpus_json]
# Sparse tokenization and BM25 indexing
english_stemmer = snowballstemmer.stemmer("english")
sparse_tokenizer = Tokenizer(stemmer=english_stemmer, lower=True, stopwords="english", splitter=r"\w+")
corpus_sparse_tokens = sparse_tokenizer.tokenize(corpus_text, update_vocab=True, return_as="ids")
sparse_index = bm25s.BM25(corpus=corpus_json)
sparse_index.index(corpus_sparse_tokens)
# Dense encoding with Qdrant
qdrant = QdrantClient(path="/Users/jiangdanyang/workspaces/python/MarioPython/src/RAG/dataset/qdrant_data")
dense_encoder = SentenceTransformer('/Users/jiangdanyang/workspaces/python/Model/all-MiniLM-L6-v2')
collection_name = "hybrid_search"
qdrant.recreate_collection(collection_name=collection_name, vectors_config=models.VectorParams(size=dense_encoder.get_sentence_embedding_dimension(), distance=models.Distance.COSINE))
qdrant.upload_points(collection_name=collection_name, points=[models.PointStruct(id=idx, vector=dense_encoder.encode(doc["text"]).tolist(), payload=doc) for idx, doc in enumerate(corpus_json)])
query_vector = dense_encoder.encode(query).tolist()
dense_results = qdrant.search(collection_name=collection_name, query_vector=query_vector, limit=10)Score normalization and weighting are performed as follows:
dense_scores = np.array([doc.get("dense_score", 0) for doc in documents_with_scores])
sparse_scores = np.array([doc.get("sparse_score", 0) for doc in documents_with_scores])
dense_scores_normalized = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))
sparse_scores_normalized = (sparse_scores - np.min(sparse_scores)) / (np.max(sparse_scores) - np.min(sparse_scores))
alpha = 0.2
weighted_scores = (1 - alpha) * dense_scores_normalized + alpha * sparse_scores_normalized5. Re‑ranking
After hybrid retrieval, a cross‑encoder (e.g., a BERT‑style model) re‑ranks the top documents by computing a relevance score for each (query, doc) pair.
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("/Users/jiangdanyang/workspaces/python/Model/jina-reranker-v1-tiny-en")
query = "What is context size of Mixtral?"
pairs = [[query, doc['text']] for doc in hybrid_search_results.values()]
scores = cross_encoder.predict(pairs)Conclusion
RAG pipelines benefit from a systematic approach: start with quick integration, then deepen understanding of each component—chunking, indexing, embedding, hybrid retrieval, and re‑ranking—and finally iterate based on measured recall and precision. This progression enables developers to move from prototype to production‑grade performance.
References
https://weaviate.io/blog/hybrid-search-explained
https://www.sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html
https://learning.oreilly.com/videos/advanced-rag/11122024VIDEOPAIML/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
