Building Scalable Enterprise RAG: Lessons, Pitfalls, and Proven Solutions
This article shares practical lessons from building a large‑scale enterprise RAG system, covering imperfect data, document quality scoring, hierarchical chunking, metadata design, semantic‑search failures, open‑source model choices, and table handling to achieve reliable AI‑driven search.
1. Introduction
This article summarizes an enterprise‑level
AI RAGproject for a regulated mid‑size company (≈ 1000 employees), focusing on practical experience beyond basic tutorials.
2. Reality: Your Data Is Not Perfect
Quick background: Companies of this size typically store 10,000 – 50,000 documents in SharePoint or legacy 2005 systems. The data is neither clean nor a curated knowledge base; it consists of decades of business documents that must be made searchable.
Document quality detection: a key point no one discusses
Most tutorials assume perfect PDFs, but in reality enterprise documents are messy. A pharmaceutical client had 1995 research papers that were scanned typewritten documents where OCR barely worked, mixed with modern clinical trial reports of 500+ pages containing tables and figures.
We discovered that document quality scoring must happen before any processing .
Our solution classifies documents into three categories:
Clean PDF (perfect text extraction): run the full hierarchical pipeline.
Decent document (some OCR artifacts): basic chunking with cleanup.
Poor document (handwritten scans): fixed‑length chunks plus manual review flags.
We built a simple scoring system that evaluates text‑extraction quality, OCR artifacts, and format consistency, routing documents accordingly. This single change fixed more retrieval issues than swapping any embedding model.
3. Why Fixed‑Size Chunking Is Usually Wrong
Every tutorial says “split into 512 tokens with overlap!” In reality documents have structure. Research methods differ from conclusions; financial reports contain executive summaries and detailed tables. Ignoring structure leads to chunks that cut sentences in half or mix unrelated concepts.
We had to build hierarchical chunking that preserves structure:
Document layer (title, author, date, type)
Chapter layer (abstract, methods, results)
Paragraph layer ( 200‑400 tokens)
Sentence layer (for precise queries)
Key insight: Query complexity should dictate the retrieval level. Broad questions stay at the paragraph level; precise questions like “What is the exact dosage in Table 3?” require sentence‑level precision.
We trigger exact mode with keywords such as exact, specific, table. If confidence is low, the system automatically drills down to finer chunks.
4. Metadata Schema: More Important Than Your Embedding Model
We spent about 40 % of development time on metadata, yielding the highest ROI.
Most people treat metadata as an afterthought, but query context is critical. A pharma researcher asking about “pediatric research” needs completely different documents than someone asking about “adult population”.
We built domain‑specific metadata schemas:
For pharmaceutical documents:
Document type (research paper, regulatory filing, clinical trial)
Drug class
Patient population (pediatric, adult, elderly)
Regulatory category ( FDA, EMA)
Therapeutic area (cardiovascular, oncology)
For financial documents:
Time period (e.g., 2023 year, Q1)
Financial metric (revenue, EBITDA)
Business segment
Geographic region
Important reminder: Do not use large models to extract metadata—they are unstable. Simple keyword matching is more reliable. For example, if a query contains “FDA”, filter on regulatory_category: FDA; if it mentions “pediatric”, apply the patient‑population filter.
Start with 100‑200 core terms per domain and expand based on mismatched queries, often with help from domain experts.
5. When Semantic Search Fails (Spoiler: It Happens Often)
Pure semantic‑search failure rates are much higher than people admit—15‑20 % in regulated fields such as pharma or law, versus the commonly quoted 5 %.
Major failure modes:
Abbreviation ambiguity: CAR means “chimeric antigen receptor” in oncology but “computer‑assisted radiology” in imaging papers.
Exact technical queries: Asking “What is the exact dosage in Table 3?” semantic search may return conceptually similar content but miss the precise table reference.
Cross‑document citation chains: Drug A studies often cite drug B interaction data; semantic search typically does not capture these relationships.
Solution: Build a hybrid approach. During processing, construct a graph layer that tracks document relationships. After semantic retrieval, verify whether the retrieved document has related documents that might contain a better answer.
6. Why We Chose Open‑Source Models (Especially Qwen)
Many assume GPT‑4o or o3‑mini are always superior, but enterprise customers face constraints:
Cost: With > 50,000 documents and thousands of daily queries, API fees explode.
Data sovereignty: Pharma and finance cannot send sensitive data to external APIs.
Domain terminology: General models hallucinate on unseen technical terms.
After domain fine‑tuning, Qwen QWQ‑32B performed surprisingly well: it is about 85 % cheaper than GPT‑4o under high‑throughput workloads, runs entirely on‑prem, can be fine‑tuned on medical/financial terminology, and offers stable latency without API rate‑limiting.
Fine‑tuning is straightforward—use domain QA pairs for supervised training, e.g., “What are the contraindications of drug X?” with the answer from the FDA guideline. Simple supervised fine‑tuning works better than complex methods like RAFT, provided the training data is clean.
7. Table Handling: The Hidden Nightmare
Enterprise documents are full of complex tables—financial models, clinical trial data, compliance matrices. Standard RAG pipelines either ignore tables or flatten them into unstructured text, losing critical relationships.
Tables often contain the most valuable information. Financial analysts need exact quarterly numbers; researchers need dosing information from clinical tables. Failing to process tables means missing half the enterprise value.
Our approach:
Treat tables as independent entities with a dedicated pipeline.
Use heuristics (blank layout patterns, grid structure) for table detection.
Simple tables → convert to CSV; complex tables → preserve hierarchical relationships in metadata.
Dual‑embedding strategy: embed the structured data and also embed a semantic description of the table.
8. Core Take‑aways
Document quality detection first. Do not process all enterprise documents the same way; assess quality before any transformation.
Metadata before embeddings. Poor metadata ruins retrieval regardless of vector quality; invest in domain‑specific metadata schemas.
Hybrid retrieval is mandatory. Pure semantic search fails frequently in specialized domains; combine rule‑based fallbacks and document‑relationship mapping.
Tables are critical. If you cannot handle table data correctly, you will lose a large portion of enterprise value.
9. Conclusion
Enterprise‑grade RAG is more an engineering challenge than a pure machine‑learning problem. Most failures stem from underestimating document‑processing hurdles, metadata complexity, and production infrastructure needs. When done right, ROI is substantial—teams have reduced document‑search time from hours to minutes.
If you encounter similar walls during implementation, feel free to reach out and discuss solutions.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
