Why AI Data Needs a New Approach: Managing Large‑Model Datasets with MaxCompute

The article summarizes a Cloudwise conference talk that explains how AI data differs from traditional big data in organization, cost, and comprehension, and describes why Tongyi Lab built a unified large‑model data platform on MaxCompute, detailing its architecture and processing pipelines.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Why AI Data Needs a New Approach: Managing Large‑Model Datasets with MaxCompute

Three Characteristics that Distinguish AI Data from Traditional Data

AI data lacks a standard organization; unlike traditional big‑data wide tables processed with SQL, AI data must capture complex relationships such as parent‑child video segments, image‑mapping for matting, multi‑modal dialogue lists, and varied text structures.

Video data often requires splitting into sub‑clips, preserving parent‑child links and associated tracks, titles, subtitles, etc.

Image data for matting needs mapping relationships.

Multi‑turn dialogues may contain text, video, audio, forming list‑type structures.

Text data organization varies across scenarios, demanding flexible schemas.

The second characteristic is high cost: extensive manual labeling, copyright‑cleared acquisition, large‑scale storage for multimodal data, GPU‑intensive processing, and cross‑region data movement all increase total expense.

The third characteristic is high comprehension cost; AI data requires multimodal understanding—video involves visual, audio, and textual streams, demanding OCR, ASR, and specialized models to extract meaning.

Building a Data Processing Platform on MaxCompute

Tongyi Lab chose MaxCompute because it provides unified data management for large‑model projects such as Tongyi Qianwen and Tongyi Wanxiang, supporting petabyte‑scale storage, DataWorks pipelines, rich built‑in UDFs, and multi‑language development (Python, Java).

Since 2020, the lab has constructed a data platform on MaxCompute, consisting of external sources (purchased, manually labeled, and publicly downloadable data) that are first standardized on MaxCompute, enabling algorithm teams to understand data without extra effort.

After standardization, a data marketplace stores both raw and high‑quality datasets. On top of this marketplace, MaxCompute pipelines process data using a library of operators (e.g., Minhash deduplication, language detection) and specialized pipelines for web, image, and other modalities.

Post‑processing, a clean‑train‑evaluate data flywheel iteratively refines cleaning strategies to meet quality standards before feeding the data to large‑model training, thereby continuously enriching the marketplace.

The resulting solution delivers high‑quality training data for Tongyi Qianwen and Tongyi Wanxiang.

large modelsData ManagementMaxComputeAI data
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.