Why High‑Quality Data Is the New Breakthrough for Large‑Scale AI Models
At the 2025 Inclusion·Bund Conference forum, leading scholars and industry experts revealed how high‑quality data and AI form a dual‑engine that reshapes model training, improves performance, and drives the next evolution of intelligent systems.
High‑Quality Data as the New Breakthrough for Large Models
At the “Data meets AI: Dual‑Engine of the Intelligent Era” forum of the 2025 Inclusion·Bund Conference, experts from academia and industry argued that data drives AI and AI drives a new evolution of data, making their fusion the key direction.
The forum was co‑hosted by the Chinese Association for Artificial Intelligence, Shanghai Jiao Tong University and Ant Group.
Why Data Quality Matters
Professor Xiao Yanghua of Fudan University warned of a “data wall” – unlabeled corpora contribute less to model performance and the cost‑benefit of larger datasets declines. He advocated a shift from expert‑experience to quantitative and self‑evolving data science, likening the needed research to Tu Youyou’s breakthrough.
He demonstrated a method using syntactic‑complexity metrics and cumulative‑distribution sampling to select high‑quality corpora. Training on the top 20 % of 1 billion financial tokens improved domain QA accuracy by 1.7 % over full‑data pre‑training.
Insights from Other Leaders
Prof. Zhai Guangtao (Shanghai Jiao Tong University) emphasized that both refined and synthetic data must prioritize quality, evaluating “experience quality” for humans and machines.
Li Ke, CEO of Haitai Ruisheng described the data industry’s shift from labor‑intensive to technology‑ and knowledge‑intensive, citing motion‑capture, autonomous‑driving annotation, and chain‑of‑thought datasets as examples of high‑quality data serving many sectors.
Shan Dongming, Chairman of KuPas Technology introduced the VALID² criteria (vitality, authenticity, volume, integrity, diversity, knowledge density) for high‑quality datasets and outlined systematic reconstruction of corpora.
Yang Haibo, President of Lightwheel AI argued that embodied intelligence requires data volumes orders of magnitude larger than LLMs, and that synthetic data must satisfy realism, human‑in‑the‑loop demonstration, rich scenarios, and closed‑loop verification.
Zhao Junbo, Head of Data Intelligence Lab, Ant Technology Research Institute presented a “Rubric‑as‑Reward” RL mechanism that uses only 5 k data points and 10 k rating criteria to build efficient RL loops, reducing reliance on massive SFT data.
Xu Lei, CTO of LanceDB showcased the open‑source multimodal data lake Lance format, which combines file‑ and table‑level features for zero‑copy evolution and fast point queries, enabling parallel feature‑engineering on petabyte‑scale video data.
Chen Chuan, Senior Director at NVIDIA described GPU‑accelerated solutions for efficient data processing from text to multimodal content, supporting generative AI workloads.
Round‑Table Conclusions
Experts agreed that as computing paradigms evolve, data‑infrastructure must be rebuilt and redefined to address current challenges and anticipate future ones. Deep integration of data and AI, robust standards, and quality‑assessment frameworks are essential to unlock the full potential of intelligent technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
