13 min read

Boost Enterprise LLM Performance: Solving Common RAG Challenges

This article explains Retrieval‑Augmented Generation for enterprise LLMs, outlines four production‑grade problems, and presents practical solutions such as parent‑child chunking, multi‑vector and multi‑query retrieval, and context‑aware question refinement with concrete prompts and workflow diagrams.

AI Large Model Application Practice

Dec 12, 2023

Boost Enterprise LLM Performance: Solving Common RAG Challenges

What is RAG and why it matters

Retrieval‑Augmented Generation (RAG) combines semantic search with large language models (LLMs) so that the model can incorporate relevant context from a private knowledge base when generating answers. It addresses the lack of domain‑specific knowledge, reduces hallucinations, avoids costly fine‑tuning, adapts quickly to knowledge updates, and can cite source documents.

Four common production‑grade RAG problems

Semantic retrieval accuracy – ensuring the most relevant knowledge chunks are retrieved.

Context‑aware knowledge retrieval – handling user queries that depend on prior dialogue or implicit context.

Multimodal content handling – processing images, tables, or other non‑textual data (covered in the next article).

Output quality evaluation – measuring how well the final answer reflects the retrieved knowledge.

Improving semantic retrieval accuracy

Key factors influencing vector recall precision are:

Quality of indexed knowledge – clear, single‑topic chunks improve matching.

Chunk granularity – smaller, semantically coherent pieces are easier to retrieve.

Embedding model quality – use models that capture Chinese semantics well.

User query quality – precise, unambiguous questions yield better vectors.

Parent‑child chunk strategy

Split each source document into small “child” chunks for embedding and larger “parent” chunks that retain full context. Retrieve child chunks first, then fetch their corresponding parent chunks to provide richer context for the LLM.

Processing flow:

User inputs a question.

The question is embedded into a vector.

Semantic search returns the N most similar child chunks.

Metadata links each child chunk to its parent chunk.

Parent chunks are assembled into the prompt and sent to the LLM.

Multi‑vector retrieval

Generate multiple vectors for each knowledge chunk (e.g., via different prompts or augmentations) to capture richer semantics, then merge and re‑rank the results for higher recall.

Multi‑query retrieval

Before searching, let the LLM expand the original question into several alternative queries. Run parallel semantic searches for each query, then combine and re‑rank the results (e.g., using Reciprocal Rank Fusion). This approach mitigates the limitation of a single, possibly ambiguous query.

Reference for the RRF re‑ranking algorithm: https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

Context‑aware question refinement

When a follow‑up question relies on previous dialogue, first ask the LLM to rewrite the question with the necessary context, then perform retrieval on the refined query.

你是一个理解用户问题的智能助手。你会根据对话历史，对本轮用户提问进行完善，并输出完善后的问题；如果没有对话历史或者不需要完善，请直接输出原用户问题。要求：
1.不要输出空问题
2.任何时候都不要直接回答问题
3.不要编造问题和改变语义
例1：========
用户历史提问：北京是哪一个国家的城市？
本次用户提问：那纽约呢？
完善后的输出：{"new_question":"纽约是哪个国家的城市？"}
==========
例2：========
用户历史提问：个人所得税怎么计算？
本次用户提问：年收入100万，帮我计算下？
完善后的输出：{"new_question":"年收入100万，请帮我计算下个人所得税"}
==========
请用JSON格式输出完善后的问题，不要有任何多余说明。注意new_question不能为空。

Example transformation: the ambiguous follow‑up “详细介绍下第六项” becomes “详细介绍下个人所得税专项附加扣除中的赡养老人项目”. After refinement, semantic search returns the correct knowledge chunk and the LLM can answer accurately.

Practical considerations

Potentially verbose output due to multiple retrieved contexts.

Increased latency because of extra LLM calls for query expansion and re‑ranking.

Risk of exceeding the model’s context window and higher inference cost.

The next article will cover multimodal content handling and batch evaluation methods for RAG systems.

LLM RAG

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.