11 min read

How JoyAgent Enables Multimodal RAG for Enterprise Knowledge Management

JoyAgent, JD's open‑source intelligent‑agent platform, now adds multimodal Retrieval‑Augmented Generation (RAG) capabilities, combining graph‑based knowledge, hierarchical chunking, and vision‑language models to handle text, images, tables, and API data for enterprise knowledge processing and evaluation.

JD Tech Talk

Dec 1, 2025

How JoyAgent Enables Multimodal RAG for Enterprise Knowledge Management

Background

Traditional Retrieval‑Augmented Generation (RAG) works well for pure text but cannot directly handle structured or visual information such as tables, charts, scanned images, or data that lives in real‑time systems (CRM, ERP). This limitation leads to incomplete retrieval and hallucinations in large language models when applied to enterprise knowledge bases.

Key Challenges

Multimodal Gap : Enterprise documents contain images, embedded tables, and diagrams that a text‑only RAG pipeline cannot interpret, causing loss of critical knowledge.

Data Quality and Dynamics : Knowledge bases often have inconsistent formats, outdated entries, and frequent updates. Core business data is stored in dynamic services that traditional RAG cannot query.

JoyAgent Multimodal Knowledge Management Architecture

The system is divided into a Knowledge Processing Layer and a Knowledge Usage Layer.

Knowledge Processing Layer

Temporal Knowledge Graph

JoyAgent integrates graphiti, a framework for building time‑aware knowledge graphs. It supports incremental updates and dual‑time modeling (event time vs. ingestion time), enabling reasoning over evolving business data without full re‑indexing.

Handles heterogeneous, evolving data sources for decision‑making and automation.

Provides temporal attributes useful for audit compliance, supply‑chain trend prediction, and historical analyses.

Multi‑Document, Multi‑Source Support

Supports a wide range of file formats (Excel, Word, PDF, PPT, images) and video content (automatic speech recognition and key‑frame extraction). A unified document schema and specialized parsers convert each input into a consistent structure for downstream indexing.

API‑based ingestion is also available, allowing agents to discover and invoke external services (e.g., ERP, CRM) on demand.

Knowledge Usage Layer – Multimodal RAG

Multi‑Structure Indexing

Combines three complementary indexes:

Graph‑based GraphRAG for entity‑edge reasoning.

Tag‑based keyword index for fast lookup.

Embedding index for dense vector similarity.

Hierarchical chunking preserves semantic boundaries (headings, sections) in long documents, improving retrieval relevance.

Hierarchical Chunk Index

Content is split according to its logical hierarchy (e.g., chapter → section → paragraph). This fine‑grained partitioning enables the retriever to locate precise passages while maintaining context for downstream generation.

GraphRAG Retrieval

Entities and relationships are stored in a graph, allowing multi‑hop reasoning such as product sales → customer feedback → supply‑chain adjustment . Temporal edges record validity periods, so queries can target a specific historical state. Reported experiments show up to a 35 % accuracy boost on complex queries compared with flat vector retrieval.

Agentic Search

Instead of a single‑turn “retrieve‑then‑read” flow, the system can invoke an agent that plans multiple steps, decides whether to use a traditional retriever or an agentic tool, and orchestrates external APIs as needed. Users can enable or disable this mode via configuration.

Multimodal Retrieval

Vision‑Language Models (VLM) are integrated to handle image‑based queries. The workflow is:

Submit the user query and any uploaded images to the VLM.

Merge VLM answers with text‑based chunk results.

Pass the combined context to the LLM for final generation.

Supported multimodal tools include:

Image Q&A – direct question answering, summarization, or translation on uploaded images.

Image Search – vector similarity search for images or text queries.

Agents can combine these tools with a textual search utility to address scenarios such as detecting anomalies in warehouse photos and retrieving related handling procedures.

Evaluation

On the public DoubleBench benchmark, JoyAgent was compared with MDocAgent, Colqwen‑gen, ViDoRAG, and M3DOCRAG. Answers were graded by GPT‑4o on a 0‑10 scale; scores ≥7 were counted as correct. JoyAgent achieved a 76.2 % correctness rate, outperforming the other multimodal QA systems.

Internal tests on 150 queries across 500 enterprise documents also showed higher accuracy and relevance than competing solutions (Coze, IMA).

Future Directions

Planned work includes tighter integration of Agentic and GraphRAG capabilities so that agents can dynamically choose between graph‑based retrieval and external tool execution. Continued advances in multimodal processing will expand support for additional data types, moving toward a comprehensive enterprise‑wide intelligent system.