How Knowledge Graphs Are Transforming Multi‑Modal AI: A Deep Survey

This comprehensive survey examines over 300 recent papers on knowledge‑graph‑driven multi‑modal learning and multi‑modal knowledge graphs, outlining key tasks, datasets, benchmarks, challenges, and future directions, while highlighting the impact of large language models and multimodal pre‑training techniques.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Knowledge Graphs Are Transforming Multi‑Modal AI: A Deep Survey

Introduction

The survey analyzes more than 300 papers published between 2020 and 2023, focusing on two major research directions: Knowledge‑Graph‑driven multimodal learning (KG4MM) and multimodal knowledge graphs (MM4KG). It defines the basic concepts of knowledge graphs (KGs) and multimodal knowledge graphs (MMKGs), discusses construction and evolution pipelines, and reviews KG‑aware multimodal tasks such as image classification, visual question answering, and multimodal KG completion. The paper also provides task definitions, benchmark datasets, and highlights emerging trends including large language models (LLMs) and multimodal pre‑training.

Paper title: Knowledge Graphs Meet Multi‑Modal Learning: A Comprehensive Survey
ArXiv link: http://arxiv.org/abs/2402.05391
Project repository: https://github.com/zjukg/KG-MM-Survey
Pages: 54, Citations: 617, Tables: 11, Figures: 13

KG‑driven Multimodal Learning (KG4MM)

Understanding & Reasoning Tasks

Visual Question Answering (VQA)

Visual Question Generation

Visual Dialog

Classification Tasks

Image Classification

Fake News Detection

Movie Genre Classification

Content Generation Tasks

Image Captioning

Visual Storytelling

Conditional Text‑to‑Image Generation

Scene Graph Generation

Retrieval Tasks

Cross‑Modal Retrieval

Visual Referring Expressions & Grounding

KG‑aware Multimodal Pre‑training

Structure‑knowledge‑aware Pre‑training

Knowledge‑graph‑aware Pre‑training

Multimodal Knowledge Graphs (MM4KG)

Resources

Publicly available MMKGs are listed in the survey, together with construction methods that combine image annotation with KG symbols or align image‑derived triples to large‑scale KGs.

Acquisition & Construction

Acquisition pipelines extract multimodal triples from images and align them with existing KGs, enabling large‑scale, triple‑level multimodal data generation.

Core MMKG Tasks

Multimodal Entity Alignment

Multimodal Entity Linking & Disambiguation

Multimodal Knowledge Graph Completion

Multimodal Knowledge Graph Reasoning

Fusion & Inference

MMKG Fusion techniques

MMKG Inference methods

MMKG‑driven Applications

Retrieval

Pre‑training for multimodal large language models (MLLMs)

AI for Science

Industry applications

Challenges and Opportunities

Construction & Acquisition

Key questions include how to obtain ideal multimodal knowledge, what features an ideal MMKG should have, and whether MMKGs provide unique benefits beyond LLMs.

Feature Refinement & Hierarchical Design

Future MMKGs should be hierarchical, allowing automatic decomposition of large multimodal data and supporting fine‑grained semantic segmentation (e.g., using Segment Anything).

Abstract vs. Concrete Concepts

Abstract concepts may correspond to abstract visual representations, while concrete concepts align with specific images; visual frequency influences representation fidelity.

Storage Efficiency

MMKGs require significantly more storage than traditional KGs, posing challenges for efficient data handling across tasks.

Quality Control

MMKGs face modality‑aware quality issues such as noisy, outdated, or misaligned images; regular updates and quality‑scoring mechanisms are needed.

KG4MM Specific Tasks

Multi‑modal Content Generation

Multi‑modal Task Integration

Scaling MMKGs for Multi‑modal Tasks

Unlocking Large‑Scale MMKG Potential

MM4KG Specific Tasks

MMKG Fusion

MMKG Inference

Transferring Multi‑modal Tasks to the MMKG Paradigm

Using Multi‑modal Tasks to Augment In‑MMKG Tasks

Large Language Models (LLMs) in the Context of KG/MMKG

Fine‑tuning

Hallucination mitigation

Agent development

Retrieval‑augmented generation (RAG)

Model editing

Preference alignment

MMKG refinement

Mixture‑of‑Experts (MoE) for MMKGs

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Knowledge Graphsmulti-modal learningKG4MMMMKG
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.