Artificial Intelligence 23 min read

Advanced LlamaIndex Indexing, Routing, and Multimodal RAG Strategies

The article walks through a real‑world legal‑contract RAG project that stalled at 60% recall, diagnoses five root causes, and demonstrates how combining multiple LlamaIndex indexes, a Router, fusion retrieval, re‑ranking, knowledge‑graph and multimodal support raises recall to 92% while outlining evaluation metrics, latency trade‑offs, and practical deployment checklists.

MaGe Linux Operations

Jun 21, 2026

Advanced LlamaIndex Indexing, Routing, and Multimodal RAG Strategies

Background and problem

A contract‑review RAG system was built for a legal department. The goal was to retrieve similar clauses from 1,000 historical contracts and provide risk hints for a new contract. The initial pipeline used VectorStoreIndex + SimpleDirectoryReader, 512‑token overlapping chunks, and OpenAI text-embedding-3-small. After deployment the recall was only 60 % and the legal team said the results were "no better than not using the system".

Two weeks of debugging revealed five root causes:

Chunking by token length split semantically important sections (e.g., clause headings) in half.

Similarity‑only retrieval ignored structural constraints, so queries like "all force‑majeure clauses" returned only similar sentences, not every relevant clause.

Table loss – tabular clauses became a garbled string after PDF parsing.

Cross‑contract references were broken because each contract was indexed separately.

Missing multimodal support – scanned PDFs without a text layer could not be retrieved.

Rebuilt five‑layer architecture

┌──────────────────────────────────────────────────────────────────┐
│ ① Query Interface Layer (chat()/query()/aquery())                │
│   └─ streaming APIs                                            │
└──────────────────────────────────────────────────────────────────┘
                ↓
┌──────────────────────────────────────────────────────────────────┐
│ ② Router Layer (RouterQueryEngine)                              │
│   └─ SingleSelector / MultiSelector                              │
└──────────────────────────────────────────────────────────────────┘
                ↓ ↑
┌──────────────────────────────────────────────────────────────────┐
│ ③ Index Layer                                                   │
│   ├─ Vector (semantic)   ├─ Keyword (BM25)   ├─ Knowledge Graph │
│   ├─ SQL               ├─ List            ├─ MultiModal (image+text)│
└──────────────────────────────────────────────────────────────────┘
                ↓ ↑
┌──────────────────────────────────────────────────────────────────┐
│ ④ Ingestion & Transformation                                    │
│   Readers → Splitters → Metadata Extraction → Embedding          │
└──────────────────────────────────────────────────────────────────┘
                ↓ ↑
┌──────────────────────────────────────────────────────────────────┐
│ ⑤ Storage Layer (vector DB + document store + graph store)      │
└──────────────────────────────────────────────────────────────────┘

The rebuilt pipeline combined multiple indexes, a router, re‑ranking, table handling and a knowledge‑graph backend, raising recall from 60 % to 92 %.

Index selection decision tree

# index_selector.py
from llama_index.core import (
    VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex,
    SQLDatabase, ListIndex, SimpleDirectoryReader,
)

def select_index(documents, query_type: str):
    if query_type == "semantic_similarity":
        return VectorStoreIndex.from_documents(documents)
    elif query_type == "exact_keyword":
        return KeywordTableIndex.from_documents(documents)
    elif query_type == "structured_query":
        return SQLDatabase.from_uri(...)
    elif query_type == "entity_relation":
        return KnowledgeGraphIndex.from_documents(documents)
    elif query_type == "full_context":
        return ListIndex.from_documents(documents)
    elif query_type == "multimodal":
        return MultiModalVectorStoreIndex.from_documents(documents)

RouterQueryEngine: single vs. multi selector

# router_setup.py
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.tools import QueryEngineTool

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_index.as_query_engine(),
    name="vector_search",
    description="Semantic similarity search. Suitable for fuzzy queries, cross‑doc relevance, conceptual questions."
)

keyword_tool = QueryEngineTool.from_defaults(
    query_engine=keyword_index.as_query_engine(),
    name="keyword_search",
    description="Exact keyword match. Suitable when the user specifies a concrete entity, ID or term."
)

kg_tool = QueryEngineTool.from_defaults(
    query_engine=kg_index.as_query_engine(),
    name="kg_search",
    description="Entity‑relation reasoning. Suitable for queries like 'which contracts cite X'."
)

# Single‑selector router – picks one tool per query
single_router = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[vector_tool, keyword_tool, kg_tool],
    verbose=True,
)

# Multi‑selector router – can run several tools in parallel
multi_router = RouterQueryEngine(
    selector=LLMMultiSelector.from_defaults(),
    query_engine_tools=[vector_tool, keyword_tool, kg_tool],
    verbose=True,
)

# Example usage
response = single_router.query("force majeure clause")  # Router selects keyword_tool
response = multi_router.query("list all contracts mentioning data security and show their citation graph")  # Router runs vector + KG in parallel

FusionRetriever: multi‑way retrieval fusion

# fusion_retriever.py
from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.retrievers.fusion_retriever import FusionRetriever

vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=10)
bm25_retriever = KeywordTableSimpleRetriever(index=keyword_index, top_k=10)
kg_retriever = KGTableRetriever(index=kg_index, top_k=5)

fusion_retriever = FusionRetriever(
    retrievers=[vector_retriever, bm25_retriever, kg_retriever],
    num_queries=4,               # generate 4 query variants automatically
    mode="reciprocal_rerank",  # reciprocal rank fusion
    use_async=True,
)

nodes = fusion_retriever.retrieve("force majeure clause")

Two‑stage re‑ranking

# reranker_setup.py
from llama_index.core.postprocessor import SentenceTransformerRerank

# Stage 1: coarse vector retrieval (top_k=20)
retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=20)

# Stage 2: rerank top_n=5 with a cross‑encoder
reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-large",
    top_n=5,
    device="cuda",
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker],
)

response = query_engine.query("...")

Effect : General RAG recall@5 improved from 70 % to 85 %; specialized domains (legal/medical) improved from 55 % to 78 %.

Knowledge‑graph index (Neo4j backend)

# kg_index.py
from llama_index.core import KnowledgeGraphIndex, StorageContext
from llama_index.core.graph_stores import Neo4jGraphStore

graph_store = Neo4jGraphStore(
    username="neo4j",
    password="...",
    url="bolt://localhost:7687",
    database="contracts",
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    include_embeddings=True,
)

query_engine = kg_index.as_query_engine(
    include_text=False,
    retriever_mode="keyword",
    response_mode="tree_summarize",
)

Custom triple‑extraction prompt (to avoid generic relations):

CUSTOM_KG_TMPL = """
Extract (subject, relation, object) triples from the text.
Constraints:
1. Subject and object must be concrete entities mentioned in the text.
2. Relation must be a short verb phrase (e.g., "cites", "contains", "restricts").
3. Discard generic relations like "contains" or "belongs to".
4. Do not hallucinate.
5. Return at most 5 most important triples.

Text: {text}

Output JSON list:
[ {"subject": "...", "relation": "...", "object": "..."}, ... ]
"""

Multimodal RAG (images + text + tables)

# multimodal_rag.py
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal import MultiModalVectorStoreIndex

documents = SimpleDirectoryReader("./data/contracts_pdfs").load_data()
mm_index = MultiModalVectorStoreIndex.from_documents(
    documents,
    image_vector_store=image_store,   # CLIP embeddings
    text_vector_store=text_store,     # text‑embedding‑3
)

retriever = mm_index.as_retriever(
    similarity_top_k=5,
    image_similarity_top_k=3,
)

results = retriever.retrieve("payment method for breach of contract")

Use CLIP or SigLIP for image embeddings (not a text model).

Run text and image retrievers in parallel for each query.

Pass retrieved images to multimodal LLMs such as GPT‑4o or Gemini‑1.5 during synthesis.

Token cost: a high‑resolution image ≈ 1,700 tokens.

Semantic chunking vs. token chunking

# semantic_splitter.py
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

nodes = splitter.get_nodes_from_documents(documents)

Compared with SimpleNodeParser, semantic splitting yields 10‑20 % higher recall.

Custom contract clause splitter

# contract_splitter.py
import re
class ContractClauseSplitter:
    """Split Chinese contracts by clause headings like "第八条"."""
    CLAUSE_PATTERN = re.compile(r"(第[一二三四五六七八九十百]+条\s*[、\.]?\s*[^
]+)")
    def split(self, text: str) -> list[str]:
        chunks, cur_title, cur_body = [], "", []
        for line in text.split("
"):
            if self.CLAUSE_PATTERN.match(line.strip()):
                if cur_body:
                    chunks.append(f"{cur_title}
" + "
".join(cur_body))
                cur_title = line.strip()
                cur_body = []
            else:
                cur_body.append(line)
        if cur_body:
            chunks.append(f"{cur_title}
" + "
".join(cur_body))
        return chunks

splitter = ContractClauseSplitter()
chunks = splitter.split(contract_text)

Incremental indexing

# incremental_index.py
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage

# Initial build
index = VectorStoreIndex.from_documents(initial_docs)
index.storage_context.persist(persist_dir="./storage")

# Load and update
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
new_docs = SimpleDirectoryReader(input_files=["new_contract.pdf"]).load_data()
for doc in new_docs:
    index.insert(doc)  # fast incremental insert

# Periodic full rebuild (e.g., weekly)
index = VectorStoreIndex.from_documents(all_docs, store_nodes_override=True)

Evaluation suite (recall, faithfulness, latency)

# evaluation.py
from llama_index.core.evaluation import (
    RetrieverEvaluator, FaithfulnessEvaluator,
    RelevancyEvaluator, BatchEvalRunner,
)
from datasets import Dataset

golden = [
    {"query": "What is the definition of force majeure?",
     "expected_sources": ["contract_001.pdf#第8条", "contract_005.pdf#第12条"]},
    # ... up to 100 queries
]

retriever_eval = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate", "precision", "recall", "ndcg"],
    retriever=index.as_retriever(similarity_top_k=10),
)
retriever_results = retriever_eval.evaluate_dataset(golden)

faith_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()
runner = BatchEvalRunner({"faithfulness": faith_eval, "relevancy": relevancy_eval}, workers=4)
run_results = runner.evaluate_queries(query_engine=query_engine, queries=[g["query"] for g in golden])

Production health thresholds (must be met):

Retrieval MRR ≥ 0.7

Hit Rate @K ≥ 0.85

NDCG @10 ≥ 0.75

Faithfulness ≥ 0.85

Relevancy ≥ 0.90

P50 latency ≤ 1.5 s, P95 latency ≤ 4 s

Query token cost ≤ $0.01

Deployment checklist

Golden set ≥ 100 queries covering at least five typical intents.

Each index passes offline evaluation (MRR, Hit Rate) against the thresholds.

End‑to‑end evaluation (Faithfulness ≥ 0.85, Relevancy ≥ 0.90) passes.

Monitoring panel shows query‑type distribution and recall latency percentiles.

Alert if P95 latency > 4 s for 5 min.

Alert if Faithfulness score drops > 10 %.

Index versioning to allow rollback after embedding model change.

Incremental indexing pipeline can ingest new docs within 5 min.

Common pitfalls and fixes

Pitfall 1 – Token‑based chunking breaks semantic structure

Symptom : Retrieved snippets are incomplete clause fragments.

Root cause : Using TokenTextSplitter with a fixed 512‑token window.

Solution :

For legal/academic documents, use a custom clause splitter (see ContractClauseSplitter).

For generic text, switch to SemanticSplitterNodeParser (10‑20 % better recall).

Add metadata such as clause_id or section for downstream filtering.

Pitfall 2 – Inconsistent embedding models

Symptom : Index built with text-embedding-3-small but query uses text-embedding-3-large, causing chaotic recall.

Solution : Centralise the embedding model in a config file and assert consistency at startup.

# config.py
EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536

from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model=EMBED_MODEL)
assert index._embed_model.model_name == EMBED_MODEL

Pitfall 3 – Table loss after PDF parsing

Symptom : Tabular clauses become a single unreadable string.

Solution :

Use a dedicated PDF parser (LlamaParse, Unstructured.io) that preserves tables as markdown.

Store tables separately as structured data (CSV/DataFrame) and index them with KeywordTableIndex.

Add metadata like {"contains_table": true, "table_type": "fee_structure"} for selective retrieval.

# LlamaParse example
from llama_parse import LlamaParser
parser = LlamaParser(api_key="llx-...", result_type="markdown", parse_mode="parse_page_with_tables")
docs = parser.load_data("contract.pdf")

Pitfall 4 – Context‑window overflow

Symptom : Ten retrieved nodes (~500 tokens each) exceed a 4 k token LLM limit, truncating crucial information.

Solution : Apply a post‑processor pipeline to filter and re‑rank before synthesis.

# response synthesis with token budget
from llama_index.core.response_synthesizers import get_response_synthesizer
synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    use_async=True,
)
query_engine = index.as_query_engine(
    response_synthesizer=synthesizer,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),
        SentenceTransformerRerank(top_n=3),
    ],
)

Pitfall 5 – Low‑quality knowledge‑graph triples

Symptom : KG retrieval returns generic triples like ("contract", "contains", "clause").

Root cause : Prompt for triple extraction is too vague.

Solution : Use a stricter prompt (see CUSTOM_KG_TMPL) and cap the number of triples per chunk.

Pitfall 6 – Multimodal index cost explosion

Symptom : 1,000 high‑resolution images (~1,700 tokens each) cost $30 per query.

Solution :

Resize images to ≤ 1024 × 1024 and compress to JPEG quality 80.

Use CLIP/SigLIP embeddings instead of text models.

Index only key images (diagrams, figures) and skip decorative graphics.

During retrieval, limit image_top_k to 2‑3.

Optimization roadmap

Current : Single index + vector search.

Mid‑term : Router + multi‑index + re‑ranking.

Long‑term : LLM‑driven automatic index selection.

Chunking: static token → semantic/structured → LLM‑driven dynamic.

Modality: text‑only RAG → multimodal (image + text + table) → cross‑modal reasoning.

Evaluation: offline only → online A/B + user feedback → automated optimisation (DPO / RAGAS).

Retrieval: synchronous → async + caching + pre‑warming → proactive recommendation.

Storage: single vector store → hybrid (vector + BM25 + graph) → federated heterogeneous retrieval.

One‑page cheat sheet

┌──────────────────────────────────────────────────────────┐
│  LlamaIndex Advanced Cheat Sheet                           │
├──────────────────────────────────────────────────────────┤
│  Index choice: small homogeneous → VectorStoreIndex       │
│                structured docs → + KGIndex                │
│                multimodal → MultiModalVectorStoreIndex    │
│                cross‑source → RouterQueryEngine           │
│  Router: LLMMultiSelector (complex) / LLMSingleSelector│
│  Re‑ranking: bge‑reranker‑large, top_n=3‑5                │
│  Chunking: SemanticSplitter (generic) / custom clause split│
│  Evaluation: RetrieverEvaluator + FaithfulnessEvaluator│
│  Latency: P95 ≤ 4 s, monitor percentiles + alerts       │
│  Optimisation: coarse top_k=20 → re‑rank top_n=5 → synth   │
│  Incremental: index.insert() per doc, weekly full rebuild │
└──────────────────────────────────────────────────────────┘

Final note

Bottom line : VectorStoreIndex is a good starter, but for complex, heterogeneous corpora you need multiple indexes + Router + re‑ranking + systematic evaluation . Without an evaluation framework, any claimed improvement is self‑deception.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing RAG Router Multimodal Evaluation KnowledgeGraph LlamaIndex

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and problem

Rebuilt five‑layer architecture

Index selection decision tree

RouterQueryEngine: single vs. multi selector

FusionRetriever: multi‑way retrieval fusion

Two‑stage re‑ranking

Knowledge‑graph index (Neo4j backend)

Multimodal RAG (images + text + tables)

Semantic chunking vs. token chunking

Custom contract clause splitter

Incremental indexing

Evaluation suite (recall, faithfulness, latency)

Deployment checklist

Common pitfalls and fixes

Pitfall 1 – Token‑based chunking breaks semantic structure

Pitfall 2 – Inconsistent embedding models

Pitfall 3 – Table loss after PDF parsing

Pitfall 4 – Context‑window overflow

Pitfall 5 – Low‑quality knowledge‑graph triples

Pitfall 6 – Multimodal index cost explosion

Optimization roadmap

One‑page cheat sheet

Final note

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 1 – Token‑based chunking breaks semantic structure

Pitfall 2 – Inconsistent embedding models

Pitfall 3 – Table loss after PDF parsing

Pitfall 4 – Context‑window overflow

Pitfall 5 – Low‑quality knowledge‑graph triples

Pitfall 6 – Multimodal index cost explosion