Artificial Intelligence 22 min read

Advanced LlamaIndex Indexing, Routing, and Multimodal RAG: A Practical Guide

This article walks through a real‑world contract‑review RAG project, diagnosing low recall, redesigning the system with multiple indexes, a RouterQueryEngine, re‑ranking, knowledge‑graph integration, multimodal support, incremental updates, and a rigorous evaluation framework that boosted recall from 60 % to 92 %.

Ops Community

Jun 23, 2026

Advanced LlamaIndex Indexing, Routing, and Multimodal RAG: A Practical Guide

Background and Initial Failure

A legal team needed a contract‑review RAG system that, given a new contract, retrieves similar historical clauses and provides risk hints. The first implementation used VectorStoreIndex + SimpleDirectoryReader with 512‑token overlapping chunks and OpenAI text-embedding-3-small. After deployment the recall accuracy was only 60 %, and the legal users complained that the results were "no better than not using it".

Two weeks of debugging revealed five root causes:

Uniform chunking : token‑based splitting broke semantic structures such as "甲方/乙方" and "鉴于/因此".

Similarity‑only retrieval : queries like "all force‑majeure clauses" could not be satisfied by pure vector similarity.

Table loss : tabular clauses (e.g., penalty tables) turned into meaningless character blobs after PDF parsing.

Cross‑contract references : separate indexes could not resolve references between contracts.

Missing multimodal data : scanned PDFs lacked a text layer, so pure vector search returned nothing.

Redesign: Multi‑Index + Router Architecture

The team rebuilt the pipeline with the following components:

Multiple indexes (semantic, keyword, knowledge‑graph, SQL, list) co‑existing.

A RouterQueryEngine that selects one or more indexes per query.

Fusion retrieval (Reciprocal Rank Fusion) to merge results.

Reranking with bge‑reranker‑large (top‑n = 5).

Separate handling for tables and knowledge‑graph extraction.

Multimodal vector store for image + text retrieval.

These changes lifted recall from 60 % to 92 % and made the system usable for the legal team.

Five‑Layer Model

The overall architecture can be visualised as five layers:

Query Interface : chat(), query(), aquery() (streaming).

Router Layer : RouterQueryEngine with SingleSelector or MultiSelector.

Index Layer : VectorIndex, KeywordTable, KnowledgeGraphIndex, SQLDatabase, ListIndex, MultiModalVectorIndex.

Ingestion & Transformation : Readers → Splitters → Metadata extraction → Embedding.

Storage Layer : Vector stores (Chroma, Qdrant, Milvus, Pinecone) + document store + graph store.

End‑to‑End Example

Query: "列出所有提到‘不可抗力’的合同条款，并按风险等级排序".

[User] "列出所有提到‘不可抗力’的合同条款，并按风险等级排序"
↓
[RouterQueryEngine] 解析意图 → 选用 VectorIndex、KeywordIndex、KGIndex
↓ (并行检索)
[VectorIndex] top_k=20 相似条款
[KeywordIndex] top_k=20 包含关键词的条款
[KGIndex] 检索 force_majeure 实体关联
↓
[ReciprocalRankFusion] 合并去重 → top_k=15
↓
[Reranker] bge‑reranker‑large → top_n=5
↓
[Response Synthesizer] LLM 合成答案并返回来源链接

Key Code Snippets

1. Index Selection Decision Tree

# index_selector.py
from llama_index.core import (
    VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex,
    SQLDatabase, ListIndex, SimpleDirectoryReader
)

def select_index(documents, query_type: str):
    if query_type == "semantic_similarity":
        return VectorStoreIndex.from_documents(documents)
    elif query_type == "exact_keyword":
        return KeywordTableIndex.from_documents(documents)
    elif query_type == "structured_query":
        return SQLDatabase.from_uri(...)
    elif query_type == "entity_relation":
        return KnowledgeGraphIndex.from_documents(documents)
    elif query_type == "full_context":
        return ListIndex.from_documents(documents)
    elif query_type == "multimodal":
        return MultiModalVectorStoreIndex.from_documents(documents)

2. RouterQueryEngine Setup (single vs. multi)

# router_setup.py
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.tools import QueryEngineTool

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_index.as_query_engine(),
    name="vector_search",
    description="用于语义相似度检索。适合模糊查询、跨文档关联、概念性问题。"
)

keyword_tool = QueryEngineTool.from_defaults(
    query_engine=keyword_index.as_query_engine(),
    name="keyword_search",
    description="用于精确关键词匹配。适合用户明确给出的实体名、编号、术语。"
)

kg_tool = QueryEngineTool.from_defaults(
    query_engine=kg_index.as_query_engine(),
    name="kg_search",
    description="用于实体关系推理。适合查询‘哪些合同引用了 X’等。"
)

single_router = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[vector_tool, keyword_tool, kg_tool],
    verbose=True,
)

multi_router = RouterQueryEngine(
    selector=LLMMultiSelector.from_defaults(),
    query_engine_tools=[vector_tool, keyword_tool, kg_tool],
    verbose=True,
)

3. FusionRetriever (RRF)

# fusion_retriever.py
from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.retrievers.fusion_retriever import FusionRetriever

vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=10)
bm25_retriever = KeywordTableSimpleRetriever(index=keyword_index, top_k=10)
kg_retriever = KGTableRetriever(index=kg_index, top_k=5)

fusion_retriever = FusionRetriever(
    retrievers=[vector_retriever, bm25_retriever, kg_retriever],
    num_queries=4,  # 自动生成 4 个查询变体
    mode="reciprocal_rerank",
    use_async=True,
)

nodes = fusion_retriever.retrieve("不可抗力条款")

4. Two‑Stage Reranking

# reranker_setup.py
from llama_index.core.postprocessor import SentenceTransformerRerank

retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=20)

reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-large",
    top_n=5,
    device="cuda",
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker],
)

5. Knowledge‑Graph Index (Neo4j)

# kg_index.py
from llama_index.core import KnowledgeGraphIndex, StorageContext
from llama_index.core.graph_stores import Neo4jGraphStore

graph_store = Neo4jGraphStore(
    username="neo4j",
    password="...",
    url="bolt://localhost:7687",
    database="contracts",
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    include_embeddings=True,
)

query_engine = kg_index.as_query_engine(
    include_text=False,
    retriever_mode="keyword",
    response_mode="tree_summarize",
)

6. Multimodal RAG

# multimodal_rag.py
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal import MultiModalVectorStoreIndex
from llama_index.multi_modal.retrievers import MultiModalRetriever, MultiModalVectorIndexRetriever

documents = SimpleDirectoryReader("./data/contracts_pdfs").load_data()

mm_index = MultiModalVectorStoreIndex.from_documents(
    documents,
    image_vector_store=image_store,   # CLIP embeddings
    text_vector_store=text_store,     # text‑embedding‑3
)

retriever = mm_index.as_retriever(
    similarity_top_k=5,
    image_similarity_top_k=3,
)

results = retriever.retrieve("违约金的支付方式")

7. Semantic Splitter vs. Custom Clause Splitter

# semantic_splitter.py
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

nodes = splitter.get_nodes_from_documents(documents)

# contract_splitter.py
import re
class ContractClauseSplitter:
    """按合同条款结构切分。"""
    CLAUSE_PATTERN = re.compile(r"(第[一二三四五六七八九十百]+条\s*[、\.]?\s*[^
]+)")
    def split(self, text: str) -> list[str]:
        chunks, current_title, current_body = [], "", []
        for line in text.split("
"):
            if self.CLAUSE_PATTERN.match(line.strip()):
                if current_body:
                    chunks.append(f"{current_title}
" + "
".join(current_body))
                current_title = line.strip()
                current_body = []
            else:
                current_body.append(line)
        if current_body:
            chunks.append(f"{current_title}
" + "
".join(current_body))
        return chunks

splitter = ContractClauseSplitter()
chunks = splitter.split(contract_text)

8. Incremental Indexing

# incremental_index.py
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage

# Initial build
index = VectorStoreIndex.from_documents(initial_docs)
index.storage_context.persist(persist_dir="./storage")

# Later load and add new docs
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
new_docs = SimpleDirectoryReader(input_files=["new_contract.pdf"]).load_data()
for doc in new_docs:
    index.insert(doc)  # incremental insert

# Periodic full rebuild (recommended weekly)
index = VectorStoreIndex.from_documents(
    all_docs,
    store_nodes_override=True,
)

Evaluation Framework

Two‑stage evaluation combines offline retrieval metrics (MRR, Hit Rate, NDCG) with generation quality (Faithfulness, Relevancy, Answer Similarity). A golden set of ~100 queries covering typical legal scenarios is used for batch evaluation.

# evaluation.py
from llama_index.core.evaluation import (
    RetrieverEvaluator, FaithfulnessEvaluator, RelevancyEvaluator, BatchEvalRunner,
)

retriever_eval = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate", "precision", "recall", "ndcg"],
    retriever=index.as_retriever(similarity_top_k=10),
)

eval_result = retriever_eval.evaluate_dataset(golden_dataset)

faith_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()

runner = BatchEvalRunner({"faithfulness": faith_eval, "relevancy": relevancy_eval}, workers=4)
results = runner.evaluate_queries(
    query_engine=query_engine,
    queries=[g["query"] for g in golden],
)

Metrics thresholds for production:

Retrieval MRR ≥ 0.7, Hit Rate ≥ 0.85, NDCG ≥ 0.75.

Faithfulness ≥ 0.85, Relevancy ≥ 0.90.

P95 latency ≤ 4 s, cost per query ≤ $0.01.

Online Deployment Checklist

Golden set ≥ 100 queries covering at least five typical query types.

Offline evaluation for each index (MRR, Hit Rate).

End‑to‑end evaluation (Faithfulness + Relevancy) passes thresholds.

Monitoring panels for query type distribution, recall latency, and alerting on P95 > 4 s or Faithfulness drop > 10 %.

Versioned indexes to allow rollback of embedding models.

Incremental indexing pipeline verified (new docs searchable within 5 min).

Common Pitfalls and Solutions

Pitfall 1: Uniform token chunking destroys semantics

Solution: Use a custom clause splitter for legal documents or SemanticSplitterNodeParser for generic texts, and attach metadata such as clause_id for filtering.

Pitfall 2: Mismatched embedding models between indexing and querying

Solution: Centralise the embedding model in a config file and assert consistency at startup.

Pitfall 3: Tables become garbled text

Solution: Parse PDFs with a dedicated parser (LlamaParse, Unstructured.io), store tables as structured CSV/DataFrame, and add metadata like {"contains_table": true}.

Pitfall 4: Context window overflow

Solution: Apply post‑processing filters (similarity cutoff) and a second‑stage reranker to reduce the number of nodes before synthesis.

Pitfall 5: Low‑quality knowledge‑graph triples

Solution: Use a stricter extraction prompt that limits triples to concrete entities and caps the number per chunk.

Pitfall 6: Multimodal index cost explosion

Solution: Downscale images, embed only key images with CLIP/SigLIP, and limit image_top_k to 2‑3.

Optimization Roadmap

Short term: Router + multi‑index + reranking (current production).

Mid term: LLM‑driven automatic index selection, dynamic semantic splitting.

Long term: Cross‑modal reasoning, automated hyper‑parameter tuning (DPO/RAGAS), heterogeneous federated retrieval.

Cheat‑Sheet

Index choice : Small homogeneous corpus → VectorStoreIndex; Structured docs → add KGIndex; Multimodal → MultiModalVectorStoreIndex; Cross‑source → RouterQueryEngine.

Router selector : LLMMultiSelector for ambiguous intents, LLMSingleSelector for clear intents.

Reranking : bge‑reranker‑large, top_n=3‑5.

Splitting : Use SemanticSplitterNodeParser (generic) or custom ContractClauseSplitter (legal).

Evaluation : RetrieverEvaluator + FaithfulnessEvaluator + RelevancyEvaluator; monitor latency and cost.

Incremental updates : index.insert(doc) for single docs, weekly full rebuild for consistency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

indexing RAG router multimodal evaluation knowledge graph LlamaIndex

Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Initial Failure

Redesign: Multi‑Index + Router Architecture

Five‑Layer Model

End‑to‑End Example

Key Code Snippets

1. Index Selection Decision Tree

2. RouterQueryEngine Setup (single vs. multi)

3. FusionRetriever (RRF)

4. Two‑Stage Reranking

5. Knowledge‑Graph Index (Neo4j)

6. Multimodal RAG

7. Semantic Splitter vs. Custom Clause Splitter

8. Incremental Indexing

Evaluation Framework

Online Deployment Checklist

Common Pitfalls and Solutions

Pitfall 1: Uniform token chunking destroys semantics

Pitfall 2: Mismatched embedding models between indexing and querying

Pitfall 3: Tables become garbled text

Pitfall 4: Context window overflow

Pitfall 5: Low‑quality knowledge‑graph triples

Pitfall 6: Multimodal index cost explosion

Optimization Roadmap

Cheat‑Sheet

Ops Community

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 1: Uniform token chunking destroys semantics

Pitfall 2: Mismatched embedding models between indexing and querying

Pitfall 3: Tables become garbled text

Pitfall 4: Context window overflow

Pitfall 5: Low‑quality knowledge‑graph triples

Pitfall 6: Multimodal index cost explosion