Big Data 11 min read

How MaxCompute Evolves Data Platforms for AI: Architecture, Features, and Real‑World Cases

The article explains how Alibaba Cloud's MaxCompute transforms a traditional data warehouse into a cloud‑native, multimodal Data+AI platform by introducing a four‑layer architecture, SQL‑based AI functions, the Python‑native MaxFrame framework, and a series of industry case studies that demonstrate performance gains and flexible resource scheduling.

DataFunTalk
DataFunTalk
DataFunTalk
How MaxCompute Evolves Data Platforms for AI: Architecture, Features, and Real‑World Cases

MaxCompute, Alibaba Cloud's core big‑data compute platform, is being re‑engineered for the AI era. Its architecture is divided into four layers—data, model, compute, and engine—each addressing specific AI requirements.

Data Layer

The platform stores both structured and unstructured data, supporting multimodal formats (audio, video, images) via a BLOB field type. It connects to external storage engines such as OSS and Hologres through Object Table and other APIs, enabling unified metadata management without moving data.

Model Layer

MaxCompute hosts traditional machine‑learning models (XGBoost, LightGBM) and open‑source large models (Qwen, DeepSeek‑R1‑Distill‑Qwen). It also integrates commercial flagship models from the Bailei platform, providing a single point for model registration, versioning, and serving.

Compute Layer

Hybrid CPU/GPU scheduling is offered, allowing users to declare required resources declaratively. This meets the heavy compute demands of multimodal AI workloads.

Engine Layer

Two primary compute interfaces are provided:

SQL Engine : The SQL AI function lets analysts invoke large models directly from SQL for offline inference, lowering the barrier for AI adoption.

MaxFrame : A native Python distributed‑computing framework compatible with Pandas, XGBoost, LightGBM, and other open‑source libraries. MaxFrame runs on MaxCompute’s massive compute resources and integrates tightly with DataWorks, custom Docker images, and the MaxCompute Notebook.

Development Experience

Developers can install MaxFrame locally via pip install maxframe and work in VS Code or Jupyter. DataWorks Notebook offers a Magic Command to start/stop MaxFrame sessions. The platform also supports PyODPS3 for job submission and provides stable, interactive development through deep integration with DataWorks.

Key Use Cases

Large‑model data preprocessing : A leading LLM provider processed petabyte‑scale data with a 300 k‑core job, achieving >50 % performance improvement for MinHash operators and elastic scaling up to 1.6 M cores, far exceeding the 1 M‑core requirement.

Automotive embodied‑intelligence : Using MaxFrame, a customer handled multimodal sensor data (images, video, radar, GPS) with a 40 %+ speedup over single‑node Python pipelines, thanks to elastic resource allocation and distributed processing.

Multimodal image labeling : By invoking the SQL AI function and MaxFrame’s built‑in AI Function, the platform performed automatic image tagging and embedding generation for downstream retrieval, integrating large‑model inference directly on stored multimodal tables.

Conclusion

MaxCompute delivers an end‑to‑end Data+AI capability that spans storage, model management, compute, and engine layers. Its cloud‑native, elastic, and high‑performance design enables enterprises—from large‑model providers to autonomous‑driving firms—to build AI data assets and deploy intelligent applications at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeBig DataMultimodalMaxComputeData+AIMaxFrameSQL AI
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.