Knowledge Graph‑Based Multimodal Semantic Understanding at Baidu
This article outlines Baidu's large‑scale knowledge graph applications in AI, detailing the need for multimodal semantic understanding, challenges in text and video comprehension, and the technical solutions including entity annotation, conceptization, knowledge networks, and multimodal fusion for enhanced search, recommendation, and visual question answering.
Knowledge graphs are becoming increasingly valuable for artificial intelligence applications, and Baidu has built a massive, general‑purpose knowledge graph that is widely used in search, recommendation, and intelligent interaction products. As text, speech, and visual technologies advance, knowledge graphs face new challenges and opportunities in complex knowledge representation and multimodal semantic understanding.
Background : Baidu’s video products (information flow, short videos, iQiyi, etc.) require deep multimodal semantic understanding to support core video business. Traditional perception‑only methods (face detection, OCR) lack the ability to capture fine‑grained user interests and contextual knowledge, which knowledge‑enhanced semantic understanding can address.
Goals and Value : By leveraging knowledge graphs, Baidu aims to perform multidimensional semantic analysis of users and resources, providing reasoning and computation capabilities for downstream intelligent applications. The value lies in truly understanding the knowledge behind resources and enabling computation and inference based on the graph.
Text Semantic Understanding : The approach parses text from entities, concepts, and relations, performing entity recognition, linking to the knowledge graph, concept generalization, and relation extraction. Challenges include sparse short texts, new entities not covered by the graph, and diverse business scenarios. Solutions involve knowledge‑enhanced annotation, deep neural networks, and componentized operators to support customization.
Entity Annotation : Entities are recognized in input text, candidate entities are ranked using knowledge‑based embeddings that combine attributes and structural relations, and the top candidate is linked to the knowledge base. New entity recognition is improved via knowledge‑supervised data generation and Baidu’s ERNIE pre‑training model.
Conceptualization : Beyond named entity recognition, the system identifies the most appropriate higher‑level concept for each entity in context, using a knowledge network and random walk to infer suitable concepts.
Knowledge Networks : Four sub‑networks (isA, co‑occurrence, lexical, semantic) are constructed to provide rich entity, concept, and attribute relationships for downstream tasks.
Video Semantic Understanding : Videos are converted into knowledge sub‑graphs, enriched with knowledge, and processed with reasoning and conflict detection. The pipeline includes multimodal perception (visual, audio, text), knowledge linking, multimodal fusion, and reasoning to achieve deep understanding, supporting recommendation, search, and content generation.
Video Understanding Graph : Unlike traditional graphs, this graph emphasizes themes, entity facets, and scenes, built through ontology construction, knowledge mining, relation establishment, graph construction, and quality control.
Applications : The technology powers visual question answering (VQA) with a multi‑granularity cross‑modal attention mechanism, cross‑media generation for images and videos, and improves short‑video source retrieval by combining fingerprinting with semantic verification.
Summary : The talk covered the importance of multimodal semantic understanding, detailed Baidu’s knowledge‑graph‑driven approaches for text and video, and demonstrated various applications such as video understanding graphs, VQA, and cross‑media generation.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.