6 Practical Context‑Engineering Techniques to Tame RAG Hallucinations
This article explains why retrieval‑augmented generation (RAG) models often hallucinate, introduces the concept of context engineering, and details six practical techniques—including selective retrieval, context compression, hierarchical layout, dynamic query rewriting, memory management, and tool‑aware context—along with their trade‑offs and real‑world impact.
What Is Context Engineering
Context engineering is the practice of deciding, at runtime, what information an LLM sees, when it sees it, and in what structure, turning the context into a dynamic pipeline rather than a static prompt. It focuses on providing the right documents, compressing long texts into task‑specific summaries, re‑phrasing ambiguous queries, injecting cross‑session memory, anchoring answers with real‑time tools, and organizing inputs so the model knows what matters most.
Selective Retrieval: Stop Over‑Filling the Context
Feeding a large number of documents (e.g., 50) into the context leads to the "lost in the middle" effect, where the model attends mainly to the beginning and end of the token stream. The correct approach is to score, re‑rank, and prune documents so that only relevant, non‑redundant snippets enter the context window.
Relevance re‑ranking: Initial vector or keyword search returns the top 50 results; a cross‑encoder jointly reads query and each document to produce a more accurate ranking, keeping the top 5.
Redundancy removal: Cluster embeddings and drop chunks with cosine similarity > 0.9, eliminating duplicate facts.
Task‑aware filtering: Use metadata (document type, last update, product version, region, department) to filter before retrieval.
Example: a query for the latest refund policy initially returns 50 mixed‑quality chunks, many outdated or contradictory. After applying region='CN' and updated_at >= 2025‑01‑01, only 10 remain; re‑ranking keeps 5, and redundancy removal leaves 3 high‑quality chunks, improving accuracy by 15‑30% and reducing token usage by 20‑40%.
Context Compression: Make Every Token Count
Long documents exceed context limits and dilute attention. Compressing them into task‑focused dense summaries can cut 50‑75% of tokens while preserving or improving accuracy.
Constrained LLM summarization: Ask the model to summarize only facts after a certain date or about a specific topic.
Sentence‑level scoring: Use a small model (e.g., BERT) to score each sentence’s relevance to the query and keep the top 20% (Context‑Preserving Compression).
Hierarchical summarization: Summarize each chapter, then combine chapter summaries into a meta‑summary, selecting the appropriate level based on token budget.
Example: retrieving rate‑limit information from a 30‑page API doc. Three relevant chapters are compressed to 100, 100, and 300 tokens respectively, yielding a total of 500 tokens that contain only the needed numbers, avoiding the need to scan the entire document.
Hierarchical Layout: Structure Conveys Importance
Instead of a wall of text, divide the context into clear sections (system rules, task description, user profile, retrieved documents, tool outputs). This mirrors the structure of research papers (abstract, intro, methods, discussion) and helps the model allocate attention correctly.
Typical layout:
[System Rules]
[Task]
[User Profile / Memory]
[Retrieved Context]
[Tool Outputs]
[Question]Experiments show that a structured layout improves accuracy by 10‑20% across domains because the model sees instructions first, data last, matching its learned attention bias.
Dynamic Query Rewriting: Fix Vague Questions
Ambiguous user queries lack keywords, entities, or time ranges. Rewriting them into precise search queries before retrieval dramatically boosts relevance.
Clarification first: In multi‑turn agents, ask the user for missing details (time period, competitors, metrics).
HyDE (Hypothetical Document Embeddings): Generate a plausible answer, embed it, and use that embedding for retrieval.
Multi‑query expansion: Produce 3‑5 rewrites covering semantic variants, retrieve with each, then deduplicate.
Example: "How did we perform last quarter compared to competitors?" becomes "Compare Q4 2024 revenue growth of Company X vs competitors A, B, C in the internal financial report" before retrieval.
Memory & State: Preserve Relationships, Not Just Facts
Retrieval answers the current question; memory stores the user's relationship to entities across sessions. Three memory types:
Scenario memory: Summarized past dialogues (≈200 tokens) highlighting key decisions.
Semantic memory: Vector‑stored past interactions that help discover patterns.
Preference memory: Stable facts such as user’s risk tolerance, preferred tech stack, or domain.
By compressing a 50‑turn conversation into a few scenario summaries and stable preferences, the agent avoids exceeding context limits while retaining personalization.
Tool‑Aware Context: Anchor Answers in Reality
Integrating real‑time tool or API outputs (prices, weather, stock data) via a Model Context Protocol (MCP) reduces hallucinations. Tools should return structured JSON, not raw text, and be inserted into a dedicated [Tool Outputs] block.
Example tool response:
{
"get_live_price": {
"symbol": "HDFCBANK",
"price": 1842.50,
"currency": "INR",
"timestamp": "2025-02-19T14:30:00Z",
"change": "+2.3%",
"volume": 12500000
},
"get_news": {
"articles": [
{
"headline": "HDFC Bank Announces ₹19 Dividend",
"summary": "Board approves dividend of ₹19 per share for FY2024",
"date": "2025-02-19",
"source": "Economic Times"
}
]
}
}When placed in the context, the model can answer with exact numbers without fabricating data.
Decision Framework
Each technique has a cost‑benefit profile:
Selective retrieval: Best for large corpora (>1,000 docs) or when the context nears its token limit.
Compression: Ideal for very long documents (>5,000 tokens) where token cost is a concern.
Hierarchical layout: Suits multi‑agent systems or contexts with mixed sources.
Query rewriting: Works when users pose vague or domain‑specific questions.
Memory: Essential for conversational agents with repeated interactions.
Tool‑aware context: Required when answers depend on up‑to‑date data.
Conclusion
All six context‑engineering techniques improve RAG reliability, but each adds computational overhead (re‑ranking, extra LLM calls, storage, API latency). Teams should balance accuracy gains against cost, often opting for a simpler pipeline that meets business constraints.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
