Artificial Intelligence 20 min read

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

DataFunSummit
DataFunSummit
DataFunSummit
Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

Zhang Songxin, a research scholar at Southern University of Science and Technology and senior algorithm expert at UCloud, introduces his work on efficient distributed training frameworks for large models, highlighting achievements such as the SUS‑Chat‑34B fine‑tuning process and top rankings on Open LLM Leaderboard.

He explains that scaling laws indicate that modern large models require internet‑scale data for training, yet most existing training pipelines still rely on small‑scale data paradigms, creating a mismatch between data volume and model size.

The talk outlines the difficulties of handling multi‑modal data at petabyte scales, including storage constraints, complex ETL pipelines, and the inability to simply pre‑fetch data to GPU clusters, especially when data must be filtered, labeled, or re‑captioned.

To address these issues, a streaming training approach is proposed: data is loaded asynchronously from a distributed source, decoupling data processing from model training, allowing the training process to start as soon as a small data chunk arrives.

The architecture incorporates a shared‑memory or Alluxio‑based middle layer, a metadata database, and a lakehouse storage system, enabling consistent data distribution, checkpoint management, and seamless migration across cloud platforms.

Benefits include zero startup overhead for training, reduced inter‑node network load through intra‑node shared memory broadcasts, and simplified checkpoint handling via the database.

A dynamic data‑selection mechanism is introduced, where the model’s state (loss, optimizer metrics, etc.) feeds back to the data pipeline, allowing the system to prioritize useful data and discard noisy or irrelevant samples.

The overall framework, named "navigation," aims to guide large‑model training by continuously adapting the data stream based on model feedback, thereby improving efficiency, scalability, and the ability to handle ever‑growing model and data sizes.

Large Modelsscaling lawsAI infrastructuredata pipelinesmultimodal datastreaming training
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.