Artificial Intelligence 17 min read

Why Your RAG System Underperforms and How to Boost Its Effectiveness by 20%

This article analyzes common shortcomings of RAG pipelines—data preparation, retrieval, and LLM generation—and provides concrete optimization techniques such as advanced chunking, embedding model selection, retrieval parameter tuning, rerank models, and prompt engineering, promising up to a 20% performance gain.

Fun with Large Models

Apr 25, 2025

Why Your RAG System Underperforms and How to Boost Its Effectiveness by 20%

Open‑source RAG framework selection

Four open‑source frameworks – AnythingLLM, Cherry Studio, RAGFlow, and Dify – are compared on data‑privacy protection, development barrier, document‑preprocessing capability, customisation, and suitable scenarios. RAGFlow is recommended because it supports scanned‑document/table parsing, visual chunk adjustment, multi‑path retrieval for answer accuracy, and answer‑index highlighting.

Data preparation optimization

High‑quality raw data upload

Supported file formats include .txt, .docx, .json, .pdf, and .md. Directly dumping all files creates noisy data that hampers accurate answer location.

Recommended preprocessing steps:

Stop‑word and special‑character filtering – Use the jieba tokenizer with a stop‑word list (e.g., https://zhuanlan.zhihu.com/p/39437488) to remove low‑information tokens.

Data cleaning and deduplication – Remove structurally complex or duplicate documents with Unstructured.io (https://unstructured.io/).

Large‑model preprocessing – Prompt an LLM to summarise documents, e.g., “You are a document‑summarisation expert, please summarise the uploaded document and output key details.”

Finer text chunking

Fixed‑size chunking often breaks semantic continuity, leading to missed information (e.g., a query about the exact reading time of an article cannot be answered because the chunk splits the sentence). Overlap settings mitigate boundary issues, but more sophisticated strategies are needed for complex documents.

Five chunking strategies are presented, each with trade‑offs:

Sentence‑based chunking – Split by punctuation (e.g., .?!) or using NLP libraries like NLTK or SpaCy, then combine consecutive sentences to reach a target size. Preserves linguistic structure but may still cut across multi‑sentence semantics.

Recursive character chunking – Provide a prioritized delimiter list ( \n\n paragraph, \n newline, \s space, then characters) to iteratively split until size constraints are met. Keeps larger logical blocks but degrades to character splitting for dense text.

Document‑structure chunking – Use HTML/XML tags (e.g., <p>) or Markdown headings as chunk boundaries, preserving hierarchical semantics.

Hybrid strategy – First apply structure‑based splitting, then refine oversized blocks with sentence or recursive character methods.

Semantic‑information chunking – Compute embeddings for adjacent sentences/paragraphs; split where semantic similarity falls below a threshold. Offers the most semantically coherent chunks but incurs high computational cost.

Stronger embedding models

The nomic-embed-text model performs poorly for Chinese. Recommended Chinese‑optimized models include gte-large-zh, bge-large-zh-v1.5, m3e-base, and tao8k. Benchmark results (TOP@n) show these models outperform nomic-embed-text. The MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard) provides up‑to‑date rankings, but its English‑centric tests may not reflect Chinese performance.

Retrieval optimization

Retrieval parameter tuning (RAGFlow example)

Key parameters: similarity_threshold – similarity cutoff (0‑1). vector_similarity_weight – weight ratio between vector similarity and keyword match (0‑1). top_k – number of results returned. score – minimum similarity score to keep a result.

Typical production values set similarity_threshold and vector_similarity_weight to 0.3‑0.5 to balance terminology and semantic understanding. Adjust top_k and score together (e.g., score=0.7, top_k=3) to control result breadth versus precision.

Retrieval algorithm improvement

Hybrid search combines vector similarity (e.g., cosine) with traditional keyword matching and fuses the results. Adjustable weightings allow tailoring to specific use cases for more comprehensive and accurate outcomes.

Re‑ranking models

Enabling a re‑rank model improves answer relevance at the cost of higher latency. Frequently used models are Xinference and bge‑reranker‑large , with deployment guidance available in related blog posts.

LLM generation optimization

Stronger non‑logic models

For Chinese tasks, DeepSeek‑V3‑0324 and Qwen2.5‑72B provide higher generation quality and speed. Logic‑oriented models such as DeepSeek‑R1 or QwQ‑32b show no clear quality gain and are slower.

Prompt engineering

Multi‑query rewrite strategy

Expand the original question into 3‑5 semantically equivalent variants using an LLM. Example prompt:

你的任务是为给定的用户问题生成3 - 5个语义等价但表述差异化的查询变体，目的是帮助用户克服基于距离的相似性搜索的一些局限性，以便从向量数据库中检索相关文档。
以下是原始问题：
<question>
{{question}}
</question>
请生成3 - 5个语义与原始问题等价，但表述不同的查询变体，用换行符分隔这些替代问题。
请在<查询变体>标签内写下你的答案。

Each sub‑query retrieves its own document fragments, enriching the overall information pool.

Question decomposition strategy

For complex, reasoning‑heavy questions, break them into multiple sub‑questions. Example prompt:

你的任务是针对输入的问题生成多个相关的子问题或子查询，将输入问题分解成一组可以独立回答的子问题或子任务。
以下是输入的问题：
<question>
{{question}}
</question>
请生成3 - 5个与该问题相关的搜索查询，并使用换行符进行分割。生成的子问题/子查询应具有明确的主题和可独立回答的特点。
请在<子问题>标签内写下生成的子问题/子查询。

Summary

Applying the described optimizations—advanced chunking, Chinese‑optimized embeddings, retrieval parameter tuning, hybrid search, re‑ranking, stronger LLMs, and refined prompts—can improve RAG system performance by more than 20 %. Future work includes exploring knowledge‑graph‑enhanced RAG such as Microsoft’s GraphRAG .

prompt engineering RAG Embedding retrieval Chunking

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.