Advanced LlamaIndex Indexing, Routing, and Multimodal RAG: A Practical Guide
This article walks through a real‑world contract‑review RAG project, diagnosing low recall, redesigning the system with multiple indexes, a RouterQueryEngine, re‑ranking, knowledge‑graph integration, multimodal support, incremental updates, and a rigorous evaluation framework that boosted recall from 60 % to 92 %.
Background and Initial Failure
A legal team needed a contract‑review RAG system that, given a new contract, retrieves similar historical clauses and provides risk hints. The first implementation used VectorStoreIndex + SimpleDirectoryReader with 512‑token overlapping chunks and OpenAI text-embedding-3-small. After deployment the recall accuracy was only 60 %, and the legal users complained that the results were "no better than not using it".
Two weeks of debugging revealed five root causes:
Uniform chunking : token‑based splitting broke semantic structures such as "甲方/乙方" and "鉴于/因此".
Similarity‑only retrieval : queries like "all force‑majeure clauses" could not be satisfied by pure vector similarity.
Table loss : tabular clauses (e.g., penalty tables) turned into meaningless character blobs after PDF parsing.
Cross‑contract references : separate indexes could not resolve references between contracts.
Missing multimodal data : scanned PDFs lacked a text layer, so pure vector search returned nothing.
Redesign: Multi‑Index + Router Architecture
The team rebuilt the pipeline with the following components:
Multiple indexes (semantic, keyword, knowledge‑graph, SQL, list) co‑existing.
A RouterQueryEngine that selects one or more indexes per query.
Fusion retrieval (Reciprocal Rank Fusion) to merge results.
Reranking with bge‑reranker‑large (top‑n = 5).
Separate handling for tables and knowledge‑graph extraction.
Multimodal vector store for image + text retrieval.
These changes lifted recall from 60 % to 92 % and made the system usable for the legal team.
Five‑Layer Model
The overall architecture can be visualised as five layers:
Query Interface : chat(), query(), aquery() (streaming).
Router Layer : RouterQueryEngine with SingleSelector or MultiSelector.
Index Layer : VectorIndex, KeywordTable, KnowledgeGraphIndex, SQLDatabase, ListIndex, MultiModalVectorIndex.
Ingestion & Transformation : Readers → Splitters → Metadata extraction → Embedding.
Storage Layer : Vector stores (Chroma, Qdrant, Milvus, Pinecone) + document store + graph store.
End‑to‑End Example
Query: "列出所有提到‘不可抗力’的合同条款,并按风险等级排序".
[User] "列出所有提到‘不可抗力’的合同条款,并按风险等级排序"
↓
[RouterQueryEngine] 解析意图 → 选用 VectorIndex、KeywordIndex、KGIndex
↓ (并行检索)
[VectorIndex] top_k=20 相似条款
[KeywordIndex] top_k=20 包含关键词的条款
[KGIndex] 检索 force_majeure 实体关联
↓
[ReciprocalRankFusion] 合并去重 → top_k=15
↓
[Reranker] bge‑reranker‑large → top_n=5
↓
[Response Synthesizer] LLM 合成答案并返回来源链接Key Code Snippets
1. Index Selection Decision Tree
# index_selector.py
from llama_index.core import (
VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex,
SQLDatabase, ListIndex, SimpleDirectoryReader
)
def select_index(documents, query_type: str):
if query_type == "semantic_similarity":
return VectorStoreIndex.from_documents(documents)
elif query_type == "exact_keyword":
return KeywordTableIndex.from_documents(documents)
elif query_type == "structured_query":
return SQLDatabase.from_uri(...)
elif query_type == "entity_relation":
return KnowledgeGraphIndex.from_documents(documents)
elif query_type == "full_context":
return ListIndex.from_documents(documents)
elif query_type == "multimodal":
return MultiModalVectorStoreIndex.from_documents(documents)2. RouterQueryEngine Setup (single vs. multi)
# router_setup.py
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.tools import QueryEngineTool
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(),
name="vector_search",
description="用于语义相似度检索。适合模糊查询、跨文档关联、概念性问题。"
)
keyword_tool = QueryEngineTool.from_defaults(
query_engine=keyword_index.as_query_engine(),
name="keyword_search",
description="用于精确关键词匹配。适合用户明确给出的实体名、编号、术语。"
)
kg_tool = QueryEngineTool.from_defaults(
query_engine=kg_index.as_query_engine(),
name="kg_search",
description="用于实体关系推理。适合查询‘哪些合同引用了 X’等。"
)
single_router = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[vector_tool, keyword_tool, kg_tool],
verbose=True,
)
multi_router = RouterQueryEngine(
selector=LLMMultiSelector.from_defaults(),
query_engine_tools=[vector_tool, keyword_tool, kg_tool],
verbose=True,
)3. FusionRetriever (RRF)
# fusion_retriever.py
from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.retrievers.fusion_retriever import FusionRetriever
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=10)
bm25_retriever = KeywordTableSimpleRetriever(index=keyword_index, top_k=10)
kg_retriever = KGTableRetriever(index=kg_index, top_k=5)
fusion_retriever = FusionRetriever(
retrievers=[vector_retriever, bm25_retriever, kg_retriever],
num_queries=4, # 自动生成 4 个查询变体
mode="reciprocal_rerank",
use_async=True,
)
nodes = fusion_retriever.retrieve("不可抗力条款")4. Two‑Stage Reranking
# reranker_setup.py
from llama_index.core.postprocessor import SentenceTransformerRerank
retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=20)
reranker = SentenceTransformerRerank(
model="BAAI/bge-reranker-large",
top_n=5,
device="cuda",
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[reranker],
)5. Knowledge‑Graph Index (Neo4j)
# kg_index.py
from llama_index.core import KnowledgeGraphIndex, StorageContext
from llama_index.core.graph_stores import Neo4jGraphStore
graph_store = Neo4jGraphStore(
username="neo4j",
password="...",
url="bolt://localhost:7687",
database="contracts",
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)
kg_index = KnowledgeGraphIndex.from_documents(
documents,
storage_context=storage_context,
max_triplets_per_chunk=10,
include_embeddings=True,
)
query_engine = kg_index.as_query_engine(
include_text=False,
retriever_mode="keyword",
response_mode="tree_summarize",
)6. Multimodal RAG
# multimodal_rag.py
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal import MultiModalVectorStoreIndex
from llama_index.multi_modal.retrievers import MultiModalRetriever, MultiModalVectorIndexRetriever
documents = SimpleDirectoryReader("./data/contracts_pdfs").load_data()
mm_index = MultiModalVectorStoreIndex.from_documents(
documents,
image_vector_store=image_store, # CLIP embeddings
text_vector_store=text_store, # text‑embedding‑3
)
retriever = mm_index.as_retriever(
similarity_top_k=5,
image_similarity_top_k=3,
)
results = retriever.retrieve("违约金的支付方式")7. Semantic Splitter vs. Custom Clause Splitter
# semantic_splitter.py
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
nodes = splitter.get_nodes_from_documents(documents) # contract_splitter.py
import re
class ContractClauseSplitter:
"""按合同条款结构切分。"""
CLAUSE_PATTERN = re.compile(r"(第[一二三四五六七八九十百]+条\s*[、\.]?\s*[^
]+)")
def split(self, text: str) -> list[str]:
chunks, current_title, current_body = [], "", []
for line in text.split("
"):
if self.CLAUSE_PATTERN.match(line.strip()):
if current_body:
chunks.append(f"{current_title}
" + "
".join(current_body))
current_title = line.strip()
current_body = []
else:
current_body.append(line)
if current_body:
chunks.append(f"{current_title}
" + "
".join(current_body))
return chunks
splitter = ContractClauseSplitter()
chunks = splitter.split(contract_text)8. Incremental Indexing
# incremental_index.py
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
# Initial build
index = VectorStoreIndex.from_documents(initial_docs)
index.storage_context.persist(persist_dir="./storage")
# Later load and add new docs
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
new_docs = SimpleDirectoryReader(input_files=["new_contract.pdf"]).load_data()
for doc in new_docs:
index.insert(doc) # incremental insert
# Periodic full rebuild (recommended weekly)
index = VectorStoreIndex.from_documents(
all_docs,
store_nodes_override=True,
)Evaluation Framework
Two‑stage evaluation combines offline retrieval metrics (MRR, Hit Rate, NDCG) with generation quality (Faithfulness, Relevancy, Answer Similarity). A golden set of ~100 queries covering typical legal scenarios is used for batch evaluation.
# evaluation.py
from llama_index.core.evaluation import (
RetrieverEvaluator, FaithfulnessEvaluator, RelevancyEvaluator, BatchEvalRunner,
)
retriever_eval = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate", "precision", "recall", "ndcg"],
retriever=index.as_retriever(similarity_top_k=10),
)
eval_result = retriever_eval.evaluate_dataset(golden_dataset)
faith_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()
runner = BatchEvalRunner({"faithfulness": faith_eval, "relevancy": relevancy_eval}, workers=4)
results = runner.evaluate_queries(
query_engine=query_engine,
queries=[g["query"] for g in golden],
)Metrics thresholds for production:
Retrieval MRR ≥ 0.7, Hit Rate ≥ 0.85, NDCG ≥ 0.75.
Faithfulness ≥ 0.85, Relevancy ≥ 0.90.
P95 latency ≤ 4 s, cost per query ≤ $0.01.
Online Deployment Checklist
Golden set ≥ 100 queries covering at least five typical query types.
Offline evaluation for each index (MRR, Hit Rate).
End‑to‑end evaluation (Faithfulness + Relevancy) passes thresholds.
Monitoring panels for query type distribution, recall latency, and alerting on P95 > 4 s or Faithfulness drop > 10 %.
Versioned indexes to allow rollback of embedding models.
Incremental indexing pipeline verified (new docs searchable within 5 min).
Common Pitfalls and Solutions
Pitfall 1: Uniform token chunking destroys semantics
Solution: Use a custom clause splitter for legal documents or SemanticSplitterNodeParser for generic texts, and attach metadata such as clause_id for filtering.
Pitfall 2: Mismatched embedding models between indexing and querying
Solution: Centralise the embedding model in a config file and assert consistency at startup.
Pitfall 3: Tables become garbled text
Solution: Parse PDFs with a dedicated parser (LlamaParse, Unstructured.io), store tables as structured CSV/DataFrame, and add metadata like {"contains_table": true}.
Pitfall 4: Context window overflow
Solution: Apply post‑processing filters (similarity cutoff) and a second‑stage reranker to reduce the number of nodes before synthesis.
Pitfall 5: Low‑quality knowledge‑graph triples
Solution: Use a stricter extraction prompt that limits triples to concrete entities and caps the number per chunk.
Pitfall 6: Multimodal index cost explosion
Solution: Downscale images, embed only key images with CLIP/SigLIP, and limit image_top_k to 2‑3.
Optimization Roadmap
Short term: Router + multi‑index + reranking (current production).
Mid term: LLM‑driven automatic index selection, dynamic semantic splitting.
Long term: Cross‑modal reasoning, automated hyper‑parameter tuning (DPO/RAGAS), heterogeneous federated retrieval.
Cheat‑Sheet
Index choice : Small homogeneous corpus → VectorStoreIndex; Structured docs → add KGIndex; Multimodal → MultiModalVectorStoreIndex; Cross‑source → RouterQueryEngine.
Router selector : LLMMultiSelector for ambiguous intents, LLMSingleSelector for clear intents.
Reranking : bge‑reranker‑large, top_n=3‑5.
Splitting : Use SemanticSplitterNodeParser (generic) or custom ContractClauseSplitter (legal).
Evaluation : RetrieverEvaluator + FaithfulnessEvaluator + RelevancyEvaluator; monitor latency and cost.
Incremental updates : index.insert(doc) for single docs, weekly full rebuild for consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
