Artificial Intelligence 9 min read

Boosting RAG Performance: Query Translation & Decomposition Techniques

The article explains two emerging RAG query‑optimization approaches—query translation and query decomposition—detailing fan‑out retrieval, reciprocal rank fusion, HyDE, step‑back prompting, and chain‑of‑thought retrieval, and shows how combining them can improve relevance and latency in LLM‑augmented systems.

Data Party THU

Mar 23, 2026

Boosting RAG Performance: Query Translation & Decomposition Techniques

Retrieval‑Augmented Generation (RAG) Overview

RAG first converts a user query into a vector embedding, retrieves the most similar documents from a vector store, and supplies those documents as context to a large language model (LLM) for answer generation. The quality of the final answer depends heavily on the relevance of the retrieved documents, which in turn is driven by the quality of the query representation.

Query Translation

Query translation expands a single user query into several semantically similar variants, increasing the chance of matching relevant documents.

Example variants for “How can RAG improve LLM response quality?”:

How does Retrieval‑Augmented Generation work?

Advantages of RAG for large language models

How does retrieval boost LLM accuracy?

Fan‑Out Retrieval

The LLM generates multiple query variants, each sent in parallel to the vector database. The retrieved result sets are merged, duplicate documents are removed, and the consolidated context is passed back to the LLM.

User submits original query.

LLM produces alternative queries.

All alternatives are executed concurrently as similarity searches.

Results are combined into a single list.

Duplicate documents are filtered out.

The final context is fed to the LLM for answer generation.

Reciprocal Rank Fusion (RRF)

RRF re‑scores documents based on their rank positions across multiple retrieval streams, rather than using raw similarity scores. The score for a document d is computed as: score(d) = \sum_{i=1}^{N} \frac{1}{k + rank_i(d)} where rank_i(d) is the position of d in the i‑th result list, k is a small constant (commonly 60) to dampen the impact of lower ranks, and N is the number of streams. Documents that appear near the top in several streams accumulate higher scores, yielding a more reliable merged context.

HyDE (Hypothetical Document Embedding)

HyDE addresses poorly phrased queries by first prompting the LLM to generate a hypothetical answer document. The generated text is then embedded, and the embedding is used for similarity search against the vector store. This approach can improve retrieval relevance because the synthetic document is often stylistically closer to real knowledge base entries. However, its effectiveness depends on the LLM’s generation quality; smaller models may produce misleading hypotheses that degrade retrieval.

Query Decomposition

Complex queries often contain multiple sub‑questions that cannot be satisfied by a single retrieval pass. Decomposition splits the original query into finer‑grained sub‑queries, retrieves each independently, and merges the results.

High‑Level Decomposition (Step‑Back Prompting)

Step‑back prompting first asks a more abstract, higher‑level question derived from the original query, retrieves context for that abstract question, and then uses the richer context to answer the specific original query. Example:

Original: “How can RAG improve LLM performance?”

Step‑back query: “What limitations do LLMs have without external knowledge?”

Low‑Level Decomposition (Chain‑of‑Thought Retrieval)

Chain‑of‑thought retrieval breaks a query into an ordered series of sub‑steps, where each step’s retrieval result informs the next step.

Understand the concept of RAG.

Retrieve detailed mechanics of RAG.

Retrieve information about fine‑tuning of LLMs.

Compare RAG with fine‑tuning.

Each sub‑step yields its own document set; the LLM integrates all contexts to produce a comprehensive answer.

Practical Considerations

Query translation and query decomposition are complementary. In production systems they are often combined: fan‑out expansion broadens coverage, RRF ensures high‑quality merging, and decomposition handles multi‑step reasoning for complex questions. The optimal configuration depends on query complexity, vector store size, and latency requirements, and should be evaluated empirically for each deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM RAG Query Optimization vector search Retrieval-Augmented Generation

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.