Big Data 15 min read

How MaxCompute Evolves for Python & AI: From SDK to Native Distributed Engine

This article outlines MaxCompute's decade‑long evolution—from the early PyODPS SDK to the native Distributed Python Engine—highlights the challenges big‑data platforms face in the AI era, and showcases Data+AI solutions and real‑world case studies across multimodal processing, massive text deduplication, and autonomous‑driving data pipelines.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How MaxCompute Evolves for Python & AI: From SDK to Native Distributed Engine

MaxCompute Evolution for Python & AI Computing

MaxCompute was launched in 2010. In 2015 the PyODPS SDK library enabled Python job submission. In 2017 a DataFrame layer with Pandas‑like API was added. In 2019 the open‑source Mars framework brought distributed NumPy/Pandas/Scikit‑learn support. In 2023 MaxFrame, a new native distributed engine compatible with Pandas, was released for Data+AI workloads. In 2025 the Distributed Python Engine (DPE) introduced native Python UDFs, heterogeneous resource scheduling and multimodal data processing.

Challenges for Big Data Platforms in the AI Era

Generative AI drives demand for fast, low‑cost inference on production data, massive compute for model pre‑training, AI‑enhanced data governance, and agile development experiences.

MaxCompute Solutions for Data+AI

MaxFrame provides a Pandas‑compatible distributed framework, built‑in data‑processing and model‑development operators, and integrates with MaxCompute Notebook and Dataworks Notebook for interactive development. Custom image support lets users package dependencies and models for consistent execution. AI Function offers low‑cost access to built‑in large‑model services (e.g., Tongyi Qianwen 3, Deepseek‑R1) and user‑provided models for tasks such as text generation, classification, extraction, sentiment analysis, and multimodal perception.

Core Data+AI Capabilities

MaxFrame runs inside the MaxCompute cluster, reading internal tables directly to avoid data movement, and can scale to hundreds of thousands of cores for rapid job execution. It supports Pandas‑style DataFrame and Tensor operations, and provides a seamless Python development experience.

Applications in Data+AI

Multimodal Data Processing

Using MaxCompute’s elastic resources and MaxFrame, a video‑to‑frame pipeline processed millions of video files for a large‑model pre‑training project, achieving order‑of‑magnitude speedup.

Massive Web‑Text Deduplication for Large‑Model Pre‑training

Custom text‑deduplication operators in MaxFrame reduced a 30‑billion‑record, 8 TB dataset to unique content in three hours on 4000 CU, doubling performance compared with on‑prem solutions.

Autonomous Driving Data Pre‑processing

MaxFrame simplified BAG‑file parsing, cleaning, slicing and sample generation for an autonomous‑driving OEM, delivering a 40 % performance boost and elastic scaling to tens of thousands of cores.

These capabilities illustrate MaxCompute’s evolution toward a native Python and AI‑centric platform that supports large‑scale data processing, model development, and multimodal workloads.

big dataPythonMaxComputeDistributed ComputingData+AIAI Functions
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.