How Knowledge Graphs Are Transforming Multi‑Modal AI: A Deep Survey
This comprehensive survey examines over 300 recent papers on knowledge‑graph‑driven multi‑modal learning and multi‑modal knowledge graphs, outlining key tasks, datasets, benchmarks, challenges, and future directions, while highlighting the impact of large language models and multimodal pre‑training techniques.
Introduction
The survey analyzes more than 300 papers published between 2020 and 2023, focusing on two major research directions: Knowledge‑Graph‑driven multimodal learning (KG4MM) and multimodal knowledge graphs (MM4KG). It defines the basic concepts of knowledge graphs (KGs) and multimodal knowledge graphs (MMKGs), discusses construction and evolution pipelines, and reviews KG‑aware multimodal tasks such as image classification, visual question answering, and multimodal KG completion. The paper also provides task definitions, benchmark datasets, and highlights emerging trends including large language models (LLMs) and multimodal pre‑training.
Paper title: Knowledge Graphs Meet Multi‑Modal Learning: A Comprehensive Survey
ArXiv link: http://arxiv.org/abs/2402.05391
Project repository: https://github.com/zjukg/KG-MM-Survey
Pages: 54, Citations: 617, Tables: 11, Figures: 13KG‑driven Multimodal Learning (KG4MM)
Understanding & Reasoning Tasks
Visual Question Answering (VQA)
Visual Question Generation
Visual Dialog
Classification Tasks
Image Classification
Fake News Detection
Movie Genre Classification
Content Generation Tasks
Image Captioning
Visual Storytelling
Conditional Text‑to‑Image Generation
Scene Graph Generation
Retrieval Tasks
Cross‑Modal Retrieval
Visual Referring Expressions & Grounding
KG‑aware Multimodal Pre‑training
Structure‑knowledge‑aware Pre‑training
Knowledge‑graph‑aware Pre‑training
Multimodal Knowledge Graphs (MM4KG)
Resources
Publicly available MMKGs are listed in the survey, together with construction methods that combine image annotation with KG symbols or align image‑derived triples to large‑scale KGs.
Acquisition & Construction
Acquisition pipelines extract multimodal triples from images and align them with existing KGs, enabling large‑scale, triple‑level multimodal data generation.
Core MMKG Tasks
Multimodal Entity Alignment
Multimodal Entity Linking & Disambiguation
Multimodal Knowledge Graph Completion
Multimodal Knowledge Graph Reasoning
Fusion & Inference
MMKG Fusion techniques
MMKG Inference methods
MMKG‑driven Applications
Retrieval
Pre‑training for multimodal large language models (MLLMs)
AI for Science
Industry applications
Challenges and Opportunities
Construction & Acquisition
Key questions include how to obtain ideal multimodal knowledge, what features an ideal MMKG should have, and whether MMKGs provide unique benefits beyond LLMs.
Feature Refinement & Hierarchical Design
Future MMKGs should be hierarchical, allowing automatic decomposition of large multimodal data and supporting fine‑grained semantic segmentation (e.g., using Segment Anything).
Abstract vs. Concrete Concepts
Abstract concepts may correspond to abstract visual representations, while concrete concepts align with specific images; visual frequency influences representation fidelity.
Storage Efficiency
MMKGs require significantly more storage than traditional KGs, posing challenges for efficient data handling across tasks.
Quality Control
MMKGs face modality‑aware quality issues such as noisy, outdated, or misaligned images; regular updates and quality‑scoring mechanisms are needed.
KG4MM Specific Tasks
Multi‑modal Content Generation
Multi‑modal Task Integration
Scaling MMKGs for Multi‑modal Tasks
Unlocking Large‑Scale MMKG Potential
MM4KG Specific Tasks
MMKG Fusion
MMKG Inference
Transferring Multi‑modal Tasks to the MMKG Paradigm
Using Multi‑modal Tasks to Augment In‑MMKG Tasks
Large Language Models (LLMs) in the Context of KG/MMKG
Fine‑tuning
Hallucination mitigation
Agent development
Retrieval‑augmented generation (RAG)
Model editing
Preference alignment
MMKG refinement
Mixture‑of‑Experts (MoE) for MMKGs
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
