Advanced RAG Techniques: Boosting Retrieval with Query Translation and Decomposition

The article examines how retrieval‑augmented generation suffers from poor query formulation and presents two advanced strategies—query translation, which generates multiple semantically similar variants, and query decomposition, which breaks complex questions into finer sub‑queries—detailing methods such as fan‑out retrieval, reciprocal rank fusion, HyDE, step‑back prompting, and chain‑of‑thought retrieval, and explains when to combine them.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Advanced RAG Techniques: Boosting Retrieval with Query Translation and Decomposition

Retrieval‑augmented generation (RAG) works by converting a user query into a vector embedding, retrieving similar documents from a vector database, and feeding those documents to a large language model (LLM) to generate an answer.

Basic RAG accuracy is limited by query quality; vague or poorly phrased queries lead to irrelevant retrieval results, causing “garbage in, garbage out”.

Query Translation

The core idea is to generate multiple semantically‑similar variants of the original query so that the vector search can match documents expressed in different ways, improving recall.

Example: for the question “How can RAG improve LLM response quality?” the system may expand it to:

How does retrieval‑augmented generation work?
Advantages of RAG for large language models
How does retrieval improve LLM accuracy?

These variants keep the original intent but use different wording and angles.

Fan‑Out Retrieval

In a fan‑out architecture the LLM first generates several query variants, each of which is sent concurrently to the vector store. The results are merged, duplicate documents are removed, and the final context is passed to the LLM. The six‑step workflow is:

User submits query.

LLM generates alternative queries.

All queries are executed in parallel against the vector store.

Results are merged.

Duplicate documents are filtered.

Merged context is fed to the LLM.

Because different phrasings occupy different regions in embedding space, they retrieve overlapping but not identical document sets; parallel execution exploits this diversity.

Reciprocal Rank Fusion (RRF)

When multiple retrieval streams return overlapping documents with different rankings, simple concatenation can drown high‑quality results. RRF re‑scores documents based on their rank position in each list, using the formula:

Higher ranks receive higher scores; documents that appear early in many lists accumulate higher total scores, yielding a more reliable final context than naïve merging.

HyDE (Hypothetical Document Embedding)

HyDE addresses the root cause of inaccurate queries by first prompting the LLM to generate a hypothetical answer or document for the user query. This generated text is then embedded and used for similarity search. Because the synthetic document resembles real content, retrieval accuracy often improves. However, the approach depends on the LLM’s generation quality; small models may produce distorted hypotheses that hurt retrieval.

Query Decomposition

Complex queries that contain multiple sub‑questions cannot be satisfied by a single retrieval pass; the result set is usually incomplete. Query decomposition splits the original query into finer‑grained sub‑queries, retrieves each separately, and merges the results. Decomposition can be high‑level (abstract) or low‑level (step‑wise).

High‑Level Decomposition (Step‑Back Prompting)

Step‑back prompting first asks a higher‑level question, retrieves context for that broader question, and then uses the retrieved information to answer the original specific query.

Example transformation:

Original: How can RAG improve LLM performance?
Step‑back: What limitations do LLMs have without external knowledge?

Establishing a cognitive framework first yields more complete contextual coverage.

Low‑Level Decomposition (Chain‑of‑Thought Retrieval)

Chain‑of‑thought retrieval breaks a question into ordered sub‑steps, where the result of each step guides the next retrieval.

For “How does RAG work and how does it differ from fine‑tuning?” the process is:

What is retrieval‑augmented generation?

How does RAG work?

What is fine‑tuning in LLMs?

What are the differences between RAG and fine‑tuning?

Each sub‑query retrieves a focused set of documents; the accumulated understanding steers later steps, and the LLM finally integrates all contexts into a comprehensive answer. This sequential reasoning is especially effective for comparison‑type questions with large conceptual spans.

Conclusion

Query translation and query decomposition are complementary rather than mutually exclusive. In practice systems combine fan‑out expansion, RRF re‑ranking, and decomposition pipelines to handle both broad coverage and deep, step‑wise reasoning. The optimal combination depends on typical query complexity, vector store size, and latency tolerance, requiring empirical measurement in the target environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMRAGVector SearchHybrid RetrievalQuery DecompositionQuery Translation
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.