Mastering RAG Prompt Engineering: Prevent Hallucinations and Boost Accuracy

This article dissects the unique challenges of RAG prompting, presents a systematic System/User Prompt design with strong constraints and citation requirements, compares constraint strengths with quantitative hallucination rates, and offers long‑context compression strategies and rigorous testing methods to ensure reliable LLM answers.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mastering RAG Prompt Engineering: Prevent Hallucinations and Boost Accuracy

1. Why RAG Prompting Differs from Ordinary Prompting

Unlike standard LLM prompts that let the model rely on its pre‑trained knowledge, RAG prompts must force the model to answer solely from the retrieved documents. If the model mixes its parametric knowledge with the retrieved text, hallucinations occur.

In a financial‑insurance Q&A system with 5,000 contracts, a query about the waiting period returned a correct document stating 180 days, yet a weak prompt caused the model to answer 90 days, mixing external knowledge and risking real‑world errors.

2. System Prompt Design: Role + Rules + Negative Constraints

Role definition : "You are an intelligent customer‑service assistant for a financial‑insurance company." This bounds the model’s domain.

Key rule : "You can ONLY answer based on the provided reference documents." The word "only" is crucial.

Constraint list (each explained):

Positive constraint : Use only information explicitly mentioned in the reference documents.

Rejection clause : If the documents lack sufficient information, state that you cannot answer fully.

Citation requirement : Cite sources in the format [Source: Document Name page X] to force the model to locate the exact text.

Numeric verification : All numbers must match the original document exactly.

Confidence marker : Prefix uncertain statements with "According to the document".

These rules are written in strong language ("only", "must not") to minimize flexibility.

3. User Prompt Design: Structured Question + Context

The User Prompt formats each retrieved snippet with an index and source, separates snippets with double newlines, and places the user question after the context. Example implementation:

def build_rag_prompt(query: str, retrieved_docs: list) -> str:
    context_parts = []
    for i, doc in enumerate(retrieved_docs, 1):
        context_parts.append(f"Reference Document{i} (source: {doc['source']}):
{doc['text']}")
    context = "

".join(context_parts)
    user_prompt = f"Reference Documents:
{context}
User Question: {query}
Please answer based on the above documents. If insufficient, say so."
    return user_prompt

Key details:

Numbered documents give clear citation anchors.

Double‑newline separation reduces boundary confusion.

Placing the question last lowers the chance of the model activating its parametric knowledge before reading the context.

Repeating the "answer based on documents" instruction at the end reinforces the constraint.

4. Constraint Strength Comparison

Four formulations were tested on 200 samples:

Weak : "Reference the above documents when answering." – Hallucination rate 18%.

Medium : "Primarily use the reference documents and try not to add external content." – Hallucination rate 12%.

Strong : "Only use information explicitly mentioned in the reference documents; do not answer anything not covered." – Hallucination rate 9%.

Strong + Citation : Same strong wording plus mandatory source citation – Hallucination rate drops to 7%, citation coverage rises from 32% to 91%.

Bad‑case example shows a weak prompt producing an invented 30‑day waiting period, while the strong + citation prompt returns accurate values with proper sources.

5. Handling Over‑Long Contexts

When retrieved snippets exceed the model’s token window (e.g., 10‑15 fragments > 32K tokens), three strategies are used:

Truncation : Keep only the top‑3 most relevant snippets – simple but may lose crucial info.

Compression (preferred) : Use an LLM to summarise each fragment in a query‑aware manner, preserving only information relevant to the current question.

def compress_context(docs: list, query: str, max_tokens: int = 2000) -> str:
    compressed = []
    for doc in docs:
        if len(doc['text']) > 500:
            summary = llm.invoke(f"Extract key information related to '{query}' from the following text (max 50 characters):
{doc['text']}")
            compressed.append(summary.content)
        else:
            compressed.append(doc['text'])
    return "

".join(compressed)

Batch processing : Split snippets into batches, get intermediate answers, then merge them with a final LLM call – highest latency but handles extreme lengths.

These methods raised successful request handling from 45% to 92% for >32K token inputs.

Prompt optimization before/after hallucination rate
Prompt optimization before/after hallucination rate

6. Systematic Prompt Evaluation

Prompt versions are stored in a Git repository with clear commit messages describing changes and metric impacts. Evaluation metrics focus on faithfulness – the proportion of answers that can be directly traced to the retrieved documents.

Test set composition (200 samples):

Questions with a clear answer in the documents.

Questions with partial answers.

Questions with no answer (model should refuse).

Measuring faithfulness across all three categories reveals hidden failure modes where models fabricate answers when no source exists.

7. How to Answer RAG Prompt Design in an Interview

A concise 60‑second framework:

Explain the unique RAG challenge – cutting off parametric knowledge.

Describe System Prompt responsibilities (role, strong constraints, citation).

Describe User Prompt responsibilities (structured context, ordering).

Compare weak vs. strong constraints and highlight citation impact on hallucination rates.

Briefly mention long‑context strategies (query‑aware compression).

Note the testing loop: version control, 200‑sample benchmark, faithfulness metric, bad‑case analysis.

This demonstrates both conceptual understanding and practical experience.

Conclusion

Effective RAG prompting requires a layered approach: a System Prompt that enforces role, strong “only” constraints, and mandatory citations; a User Prompt that cleanly packages retrieved snippets and the query; robust handling of long contexts via compression or batching; and a disciplined evaluation pipeline using versioned prompts and faithfulness metrics. Mastering these steps turns RAG from a fragile prototype into a reliable production‑grade AI assistant.

LLMRAGContext Compressionhallucination reductionsystem promptUser Prompt
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.