How We Won the RAG Challenge: Multi‑Router & Dynamic Knowledge Base Techniques Revealed
This article details the end‑to‑end design, parsing tricks, vector database setup, retrieval strategies, prompt engineering, and LLM reranking that powered the winning solution in a company‑annual‑report question‑answering competition.
Competition Overview
The task was to build a question‑answering system that could answer 100 randomly generated queries about 100 randomly selected company annual reports (PDFs up to 1,000 pages) within 2.5 hours. Each answer had to be exact, cite the source page, and match a predefined data type (bool, int, float, string, or list of strings).
Winning Architecture
The champion system added two routers and an LLM reranking module on top of a standard RAG pipeline. The full architecture diagram is shown below.
The generated answer set can be inspected at
https://github.com/IlyaRice/RAG-Challenge-2/blob/main/data/erc2_set/answers_1st_place_o3-mini.json.
RAG Quick Guide
Retrieval‑Augmented Generation (RAG) combines a large language model (LLM) with an external knowledge base to extend the model’s factual coverage.
Basic RAG Pipeline
Parsing : Convert PDFs to clean text, preserving tables, headings, and multi‑column layouts.
Ingestion : Load the cleaned documents into a vector store.
Retrieval : Perform semantic search to fetch relevant chunks.
Answering : Augment the LLM prompt with retrieved context and generate the final answer.
1. Parsing (Parsing)
PDF parsing proved difficult due to rotated tables, complex layouts, and font‑encoding issues. Key challenges included preserving table structure, retaining headings and bullet lists, handling multi‑column text, and extracting images or formulas.
Interesting PDF parsing problems we encountered but did not have time to solve:
Large tables rotated 90° causing garbled output.
Font‑encoding errors that behaved like a Caesar cipher with varying shift values.
We evaluated dozens of parsers (open‑source, commercial, and ML‑based) and found that none handled all edge cases perfectly. The best performer was Docling , which is co‑developed by IBM.
To meet the 2.5‑hour deadline we accelerated parsing with a GPU‑enabled VM (RTX 4090) rented from Runpod at $0.70 per hour. Parsing all 100 reports took roughly 40 minutes, a speed the team considered “extremely high.”
2. Ingestion (Ingestion)
After parsing, the text was cleaned with a series of regular expressions to remove noise and fix malformed fragments. Table serialization was explored but ultimately not used in the final system because Docling already produced high‑quality tables.
Chunking
Each page was split into 300‑token chunks (≈15 sentences) with a 50‑token overlap to avoid cutting important information. Metadata stored the original page number for later citation.
Vectorization
Chunks were embedded with text-embedding-3-large and stored in FAISS. The flat index ( IndexFlatIP) was chosen for its exact inner‑product similarity, which aligns with cosine similarity when vectors are normalized. For larger collections (>100 k vectors) we would consider IVFFlat or HNSW, accepting a trade‑off between speed and precision.
class RetrievalRankingSingleBlock(BaseModel):
"""Rank retrieved text block relevance to a query."""
reasoning: str = Field(
description="Analysis of the block, identifying key information and how it relates to the query"
)
relevance_score: float = Field(
description="Relevance score from 0 to 1, where 0 is Completely Irrelevant and 1 is Perfectly Relevant"
)We used the following configuration to control the pipeline:
class RunConfig:
use_serialized_tables: bool = False
parent_document_retrieval: bool = False
use_vector_dbs: bool = True
use_bm25_db: bool = False
llm_reranking: bool = False
llm_reranking_sample_size: int = 30
top_n_retrieval: int = 10
api_provider: str = "openai"
answering_model: str = "gpt-4o-mini-2024-07-18"3. Retrieval (Retrieval)
The retriever first performed a vector search to obtain the top‑N chunks, then applied additional logic:
Hybrid search (vector + BM25) was tested but discarded because it often reduced quality.
Cross‑encoder reranking was explored but abandoned due to API cost and latency.
LLM reranking proved effective: each candidate chunk was sent to the LLM with a prompt asking for a relevance score (0‑1). Scores were combined using vector_weight = 0.3 and llm_weight = 0.7.
The final assembled retriever followed these steps:
Vectorize the query.
Retrieve the top 30 chunks, deduplicate by page.
Pass the selected pages to the LLM reranker.
Adjust page scores with the weighted average.
Return the top 10 pages, prefix each with its page number, and concatenate them.
4. Augmentation (Augmentation)
Prompt templates were stored in prompts.py and modularized into core system instructions, Pydantic schema definitions, one‑shot/few‑shot examples, and context‑insertion blocks. A small helper function assembled the required pieces at runtime, allowing rapid experimentation with different prompt configurations.
5. Generation (Generation)
Generation consisted of three sub‑steps:
Query‑to‑Database Routing
Because each company had its own vector store, we extracted the company name from the query using re.search() and directed the retrieval to the corresponding database, reducing the search space by a factor of 100.
Prompt Routing
Four prompt variants were created for the four answer types (numeric, name, list‑of‑strings, boolean). An if‑else selector chose the appropriate variant based on the expected data type.
Compound Query Routing
For comparative questions (e.g., “Which company had higher revenue?”) the system first generated sub‑questions for each company, answered them independently, and then fed the sub‑answers back to the LLM to produce the final comparison.
Chain‑of‑Thought (CoT) & Structured Output
We forced the LLM to emit a JSON object with four fields:
step_by_step_analysis : detailed reasoning.
reasoning_summary : concise summary of the analysis.
relevant_pages : list of cited page numbers.
final_answer : the answer formatted exactly as required (numeric, "N/A", etc.).
If the model returned malformed JSON, a fallback validator ( schema.model_validate(answer)) re‑prompted the model until the output conformed to the schema, achieving near‑100 % compliance even with 8‑b models.
System Speed and Quality
The competition required answering all 100 questions within 10 minutes. By batching 25 queries per request and using the cheap gpt‑4o‑mini model (≈2 M tokens/min), the entire run completed in about 2 minutes, far exceeding the time limit.
Extensive ablation experiments showed that the LLM reranking step and carefully crafted prompts contributed the most to accuracy, while table serialization actually hurt performance because Docling already produced high‑quality tables.
Conclusion
The victory was not due to a single magical trick but to a systematic, data‑driven approach: high‑quality parsing, efficient vector search, intelligent routing, LLM reranking, and meticulously engineered prompts. The key insight is that “the magic of RAG lies in the details”—understanding the task deeply enables modest models to achieve top‑tier results.
All code has been open‑sourced, enabling anyone to reproduce the pipeline and adapt it to similar document‑question‑answering challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
