Artificial Intelligence 17 min read

How to Jump‑Start a RAG System Without Any Labeled Data

Building a Retrieval‑Augmented Generation (RAG) system from scratch without existing QA pairs requires a systematic cold‑start approach that creates synthetic QA data, establishes baseline metrics, iteratively improves via expert labeling and real user feedback, and ensures document quality for reliable evaluation.

Wu Shixiong's Large Model Academy

Apr 9, 2026

How to Jump‑Start a RAG System Without Any Labeled Data

1. The Real Challenge of Cold‑Start: No Answers, Yet Evaluation Needed

Setting up a RAG pipeline is technically easy—chunk documents, embed them, store them, and connect an LLM. The true difficulty lies in assessing the system’s current quality and identifying optimization directions when no historical QA pairs exist.

Missing evaluation baselines include:

Recall@K – whether the correct document is retrieved.

Faithfulness – how much the LLM’s answer drifts from the source.

Iteration reference – how to tell if a parameter change improves or degrades performance.

Relying on ad‑hoc manual checks leads to biased, overly‑simple cases and hides real user difficulties. The core cold‑start task is to quickly build a reliable evaluation benchmark without any labeled data.

2. Synthetic QA Generation: Creating an Evaluation Benchmark from Documents

When no QA pairs are available, generate them automatically. Feed each document chunk to an LLM and ask it to produce 2‑3 realistic user questions with exact answers sourced from the chunk. Record the question, answer, and evidence as a triple.

def generate_qa_from_chunks(
    chunks: list,
    llm,
    questions_per_chunk: int = 2
) -> list:
    """Generate QA pairs from document chunks for cold‑start evaluation."""
    qa_pairs = []
    for chunk in chunks:
        prompt = f"""请基于以下文档内容，生成{questions_per_chunk}个真实用户可能会提出的问题，以及对应的准确答案。

文档内容：
{chunk['content']}

要求：
1. 问题必须是文档中有明确答案的具体问题，不要问模糊的概括性问题
2. 问题要模拟真实用户的表达方式（口语化、有场景）
3. 答案必须完全来自文档，不要添加任何文档外的内容
4. 如果文档内容不适合生成有意义的问题，返回空列表

请以JSON格式返回：
[
   {"question": "...", "answer": "...", "evidence": "文档中的关键句子"},
   ...
]"""
        result = llm.generate(prompt)
        pairs = json.loads(result)
        for pair in pairs:
            qa_pairs.append({
                "question": pair["question"],
                "answer": pair["answer"],
                "evidence": pair["evidence"],
                "source_chunk_id": chunk["id"],
                "source_doc_id": chunk["doc_id"],
                "generated": True
            })
    return qa_pairs

In a financial‑insurance training project with 5,000 contract documents, this process produced roughly 8,000 candidate QA pairs; after filtering low‑quality and duplicate items, 2,100 pairs formed the initial evaluation set. The entire step required about three hours of LLM calls and no human labeling.

Synthetic QA pairs are biased toward questions that have explicit answers in the source text, while real users often ask cross‑document, reasoning‑heavy, or unanswerable questions. Recognize this as a starting point, not a final benchmark.

3. Quickly Establishing a Baseline

With the initial evaluation set, run a simple baseline to know the system’s starting point. The baseline does not need to be optimal—just measurable.

Step 1: Run the simplest configuration. Use default chunk size (512 tokens), a basic embedding model (e.g., text‑embedding‑3‑small), Top‑5 retrieval, no reranking, and a standard prompt. This “naïve RAG” becomes the reference.

Step 2: Evaluate on the 2,100 synthetic QA pairs and record three core metrics:

Context Recall : whether the document containing the correct answer was retrieved.

Faithfulness : whether the generated answer stays faithful to the retrieved document.

Answer Correctness : end‑to‑end correctness of the final answer.

Step 3: Identify bottlenecks and set optimization priorities. If Context Recall is low, focus on retrieval improvements; if Faithfulness is low, improve the generation side. In most cold‑starts, retrieval is the weakest link.

In the training project, the baseline yielded Context Recall 0.67, Faithfulness 0.71, and Answer Correctness 0.58, clearly indicating that retrieval needed the first boost.

4. Iterative Cold‑Start Strategy: From Synthetic to Real Data

Cold‑start is not a one‑off task; it evolves through three phases.

Phase 1 (Weeks 1‑2): Pure synthetic data. Use LLM‑generated QA pairs to run the system, establish tooling, and find a coarse chunking and retrieval strategy.

Phase 2 (Weeks 3‑4): Expert‑driven micro‑annotation. Recruit 3‑5 domain experts to label 20‑30 high‑impact questions each—covering difficult cases that synthetic data missed. This 100‑150 high‑quality set becomes the “golden test set.”

def prioritize_annotation_candidates(
    qa_pairs: list,
    rag_system,
    llm
) -> list:
    """Select the most valuable annotation candidates when budget is limited."""
    candidates = []
    for qa in qa_pairs:
        result = rag_system.query(qa["question"])
        confidence = estimate_answer_confidence(
            result["answer"],
            result["retrieved_chunks"],
            llm
        )
        correctness = evaluate_answer_correctness(
            result["answer"],
            qa["answer"],
            llm
        )
        candidates.append({
            **qa,
            "current_answer": result["answer"],
            "confidence": confidence,
            "correctness": correctness,
            "annotation_priority": (1 - confidence) + (1 - correctness)
        })
    return sorted(candidates, key=lambda x: x["annotation_priority"], reverse=True)

Phase 3 (Ongoing): Real user data takeover. After deployment, continuously collect real user queries and feedback. Each month add 50‑100 high‑quality real examples to the evaluation set and retire overlapping synthetic items.

The overarching logic is: synthetic data creates the initial “from zero to something” foundation, expert annotation fills blind spots, and real data ensures long‑term reliability.

5. Document Quality: The Key to Cold‑Start Success

Many teams blame poor retrieval when the real culprit is low‑quality source documents. Four typical problems and their remedies are:

OCR errors in scanned PDFs. Use a dedicated OCR tool (e.g., PaddleOCR) and clean the extracted text before indexing.

Broken document structure. Replace simple PDF parsers with structure‑aware tools (e.g., unstructured.io) to preserve headings, tables, and clause numbers.

Chunk boundaries cutting through critical information. Split by semantic paragraphs and merge based on similarity rather than fixed token counts.

Duplicate or outdated versions. Implement version control to avoid contradictory answers from multiple document editions.

In the same training project, a three‑day document‑quality remediation (OCR re‑processing, structural parsing, deduplication, version management) lifted Context Recall from 0.67 to 0.79, outperforming any retrieval‑algorithm tweak.

def assess_document_quality(doc_path: str) -> dict:
    """Quickly evaluate document quality and flag issues before ingestion."""
    issues = []
    content = extract_text(doc_path)
    garbled_ratio = count_garbled_chars(content) / len(content)
    if garbled_ratio > 0.02:
        issues.append({
            "type": "ocr_quality",
            "severity": "high" if garbled_ratio > 0.05 else "medium",
            "detail": f"Garbage character ratio {garbled_ratio:.1%}, recommend re‑OCR"
        })
    has_table_structure = bool(re.search(r'第[一二三四五六七八九十\d]+条', content))
    has_raw_table = bool(re.search(r'\|\s*\w+\s*\|', content))
    if has_table_structure and not has_raw_table:
        issues.append({
            "type": "structure_loss",
            "severity": "medium",
            "detail": "Detected clause numbers without table structure, possible loss of hierarchy"
        })
    return {
        "doc_path": doc_path,
        "quality_score": 1.0 - min(garbled_ratio * 10, 0.5),
        "issues": issues,
        "recommended_action": "reprocess" if any(i["severity"] == "high" for i in issues) else "ok"
    }

6. How to Answer the RAG Cold‑Start Interview Question

Start by stating the real difficulty (≈20 seconds): evaluation baseline, not just “connect documents to an LLM.”

Explain synthetic QA generation, its workflow, filtering, and inherent bias (≈1 minute).

Outline the three‑phase iterative strategy: synthetic → expert‑annotated → real‑user data (≈1 minute).

Conclude with document‑quality governance, citing the four quality dimensions and the observed Recall improvement (≈30 seconds).

Conclusion

Cold‑starting a RAG system is a common real‑world scenario that is rarely covered in textbook examples. The ability to quickly build an evaluation benchmark from raw documents, identify optimization directions, and iteratively replace synthetic data with high‑quality human feedback distinguishes a competent engineer and satisfies interview expectations.

LLM RAG evaluation metrics Annotation cold start document quality synthetic QA

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.