How the DB3 Team Won the Meta CRAG RAG Challenge: Prompts, Retrieval, and LoRA Fine‑Tuning
This article analyzes the Meta Comprehensive RAG (CRAG) benchmark, detailing its three tasks, evaluation metrics, and the champion DB3 team's end‑to‑end solution that combines data preprocessing, dual‑stage retrieval, prompt engineering, LoRA‑based fine‑tuning, and public data augmentation to achieve top scores across all tasks.
Background
GPT‑4’s factual accuracy on rapidly changing information is often below 35 %. Large language models (LLMs) can hallucinate because of biased training data, limited context understanding, and knowledge‑representation constraints. Reducing hallucinations is essential for trustworthy LLM‑based agents.
CRAG Benchmark Overview
The Meta Comprehensive RAG (CRAG) challenge provides a rigorous evaluation protocol for Retrieval‑Augmented Generation (RAG) systems. It covers five domains, eight question types, and a mix of head, torso, and tail entities to test reasoning and synthesis. Each query has a 30‑second time budget.
Task 1 – Web‑Based Retrieval Summarization
Participants receive five webpages per question and must extract and summarize the relevant information.
Task 2 – Knowledge‑Graph & Web Fusion
A simulated API gives access to a domain‑specific knowledge graph (KG). Participants query the KG and combine the structured results with web data.
Task 3 – End‑to‑End RAG
Each question comes with 50 webpages plus API access, increasing noise and requiring efficient selection of the most useful pieces.
Evaluation Metrics
Perfect : correct answer with no hallucination.
Acceptable : useful answer with minor errors.
Missing : no concrete answer (e.g., “I don’t know”).
Incorrect : wrong or irrelevant answer.
Scoring: Perfect = 1, Acceptable = 0.5, Missing = 0, Incorrect = ‑1. Overall score is a macro‑average weighted by entity popularity (weights undisclosed).
Champion Solution (DB3 Team)
The DB3 team from Peking University achieved first place on all three tasks, with scores of 28.4 %, 42.7 % and 47.8 % respectively.
Task 1 Pipeline
Data preprocessing : Use BeautifulSoup to extract raw text, CharacterTextSplitter (LangChain) to chunk into child chunks (~200 tokens) and parent chunks (~700 tokens), and ParentDocumentRetriever to preserve parent‑child relationships.
Retriever : bge‑base‑en‑v1.5. Retrieve the top‑50 passages. parent_chunk_size determines how many parent chunks are fed to the LLM (e.g., size 2000 → 5 chunks, size 1000 → 10 chunks).
Reranker : bge‑reranker‑v2‑m3.
Public data augmentation : Pre‑process domain‑specific tables into natural‑language statements keyed by entity. Movie domain uses Oscar awards + full MovieLens data; finance domain uses US stock PE, market cap, EPS; music domain uses Grammy awards.
Prompt engineering & SFT : Basic prompt includes token_limit, query_time, and a <doc> token that concatenates public and web retrieval results (truncated to 4000 tokens). Controlled prompts refuse to answer when the question is invalid or the knowledge is absent. SFT labels: “invalid question”, ground‑truth answer, or “I don’t know”. LoRA adapters fine‑tune Llama‑3‑8B‑instruct; multiple adapters enable rapid switching between sub‑tasks. Inference is accelerated with vLLM (noted compatibility issues).
Task 2 & 3 Strategy
API results are prioritized. If the simulated KG API returns a non‑“I don’t know” answer, it is used directly; otherwise the system falls back to the Task 1 web‑retrieval pipeline. For Task 3, a reranker selects the top‑5 of the 50 web pages before applying the same processing.
Knowledge‑Graph Retrieval Module
The module generates normalized API calls from the LLM, parses the responses, and converts them to natural language. The movie schema includes PERSON, MOVIE, CAST, CREW, and OSCAR tables. Instead of full SQL, a lightweight normalized API is used, e.g., cmp(gender,male), sort(condition,sort_key), len operators to support multi‑hop queries.
API Generation Prompt
The prompt supplies Schema_info, API_rules, the query string, and a few hand‑picked in‑context examples. After generating 100 synthetic examples, erroneous ones are added back for robustness. The same approach is applied across all five domains.
Fine‑Tuning the API Generator
Ground‑truth API pairs are first generated by GPT‑4 and manually verified. These high‑quality pairs are used to LoRA‑fine‑tune the LLM, improving API generation under the competition’s time budget.
Insights
The champion’s approach leverages strong database expertise: extensive API redesign, careful prompt crafting to suppress hallucinations, and SFT with higher‑level LLM assistance. The solution is practical rather than flashy, showing that modern LLMs have markedly improved but still require systematic engineering for reliable, production‑grade performance.
Resources
Competition page: https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024
Paper: https://arxiv.org/pdf/2410.00005
Code repository: https://gitlab.aicrowd.com/jiazunchen/kdd2024cup-crag-db3
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
