Artificial Intelligence 19 min read

Large‑Model‑Driven Data Governance: Technical Outlook and Research Highlights

This article reviews the rising importance of data quality for large models, explores data‑centric AI, large‑model pre‑training data engineering, and presents recent Fudan University research on using large models to improve data governance across multiple domains such as attribute normalization, geographic cleaning, compliance checking, and multimodal retrieval.

DataFunSummit
DataFunSummit
DataFunSummit
Large‑Model‑Driven Data Governance: Technical Outlook and Research Highlights

Introduction – The rapid development of large‑model technology has raised data‑quality requirements to a new level, prompting researchers to investigate whether large models themselves can assist in data cleaning and governance. This article provides a technical outlook on large‑model‑driven data‑governance techniques and shares research hotspots from Fudan University.

1. Data‑Centric Artificial Intelligence – Proposed by Andrew Ng in 2021, data‑centric AI emphasizes systematic data engineering over model‑centric approaches. High‑quality, diverse, and well‑governed data have become decisive factors for AI performance, especially in the era of large models where most effort shifts to data governance.

2. Data Engineering in Large‑Model Pre‑Training – Pre‑training massive models like GPT requires massive data collection, cleaning, and curation (e.g., 45 TB raw data reduced to 175 GB after cleaning). Data diversity, multimodal sources, and synthetic data generation (e.g., Sora’s video‑image‑text triples) are crucial, and scaling laws show that model performance scales with data volume.

3. Large‑Model‑Driven Data‑Governance Technologies – Data governance faces challenges such as multi‑industry standards, multimodal data types, privacy, compliance, and dynamic policy alignment. Small models struggle with these tasks, whereas large models bring four key advantages: extensive knowledge from massive training, rapid domain adaptation via fine‑tuning, multimodal handling, and emerging agent capabilities for automated planning.

4. Research Progress at Fudan University

Attribute‑value normalization: using large models to recognize diverse expressions of the same attribute (e.g., gender) and unify them.

Geographic data cleaning: leveraging language understanding of large models together with GIS techniques to standardize address information.

Design‑drawing compliance checking: employing large models as offline engines to parse national and corporate standards and transform them into executable rules.

Multimodal entity linking in live‑stream commerce: linking heterogeneous product mentions to concrete items for user assistance.

Cross‑modal image‑text retrieval: building fine‑grained datasets and enhancing retrieval performance with large models.

Multimodal education knowledge‑graph construction: extracting and linking educational resources (text, image, audio, video) into a unified graph.

5. Summary and Outlook – Large‑model pre‑training relies on data‑governance techniques, and data‑governance increasingly depends on large models. Small‑model approaches cannot meet the growing demand for massive, high‑quality data. Future research should focus on secure, trustworthy large models, advanced data cleaning, compliance automation, and agent‑driven decision planning for complex governance scenarios.

Thank you for your attention.

Data EngineeringAILarge Modelsdata governanceKnowledge Graphsmultimodal data
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.