Big Data 13 min read

How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era

This article summarizes a meetup talk by Alibaba Cloud expert Yu Deshui, detailing MaxCompute’s evolution, serverless architecture, AI‑enabled features, and the platform’s comprehensive solutions—including OpenLake, MaxFrame, Object Table, near‑real‑time computing, and AI Functions—to address the challenges of modern data‑centric AI workloads.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era

01. MaxCompute Introduction

MaxCompute is Alibaba Cloud’s self‑developed big‑data computing platform, originally called ODPS and known internally as “Cloud Ladder 2”. Over about 15 years it has evolved into a serverless, cloud‑native service that requires no resource reservation, offering elastic resource scheduling, multi‑tenant isolation, layered security, and data encryption.

It supports offline analysis, incremental and near‑real‑time scenarios, and integrates AI capabilities for data processing, BI, data exploration, and data science. Access is provided via rich SDKs, Open API, Console, and integration with PAI Studio and DataWorks for data governance, lineage, and job submission.

The compute engine includes a proprietary MaxCompute SQL engine, open‑source engines such as Apache Spark, and a custom distributed Python engine. Data is managed through unified metadata, stored on Alibaba Cloud’s Pangu storage or data lakes, and accessed via a storage‑compute separation architecture. Users can read data through the Storage API and leverage third‑party engines.

02. Challenges for Data‑Warehouse Platforms in the Data+AI Era

Generative AI has become a common demand, yet deploying large models to production remains technically challenging despite abundant open‑source models.

Data‑for‑AI requires warehouses to efficiently handle massive structured, semi‑structured, and unstructured data for model pre‑training.

AI‑for‑Data calls for intelligent query optimization, materialized view automation, and diagnostic capabilities.

The rapid iteration of Data+AI development also demands more agile development, testing, and deployment environments.

03. MaxCompute Solutions for Data+AI Scenarios

Data Management : Integration with OpenLake provides an open, controllable lake‑warehouse that combines OSS‑based storage with the DLF metadata platform, supporting structured, semi‑structured, and unstructured data, secure access, CRUD operations, and I/O acceleration.

Python Distributed Computing (MaxFrame) : A unified Python API compatible with Pandas, XGBoost, and other ML operators, automatically distributed across MaxCompute’s elastic resources, enabling efficient large‑scale data processing, visualization, scientific computing, and ML/AI development.

Interactive Development Environment : An out‑of‑the‑box notebook‑like environment with built‑in diagnostic analysis to improve development agility.

Image Management Platform : Supports custom UDF images to align development and production environments.

Distributed Computing Framework MaxFrame

Earlier Python tools like PyODPS suffered from limited compatibility, inflexible deployment, and cumbersome operations. MaxFrame resolves these issues by supporting multiple underlying engines (SQL, DPE, PAI DLC/EAS) and allowing developers to write a single Python codebase for data preprocessing, model training, and inference.

Object Table: Enhancing Unstructured Data Processing

Object Table enables SQL‑style access to OSS file metadata, versioned caching via Meta Table, document functions for content reading, and large‑scale distributed computation based on metadata partitioning. It also supports writing structured data back to warehouses and is usable from MaxFrame.

SQL reads OSS file metadata as tables

Meta Table caches and versions metadata for efficient filtering

Document Function reads file content and supports UDF processing

SQL engine parallelizes based on metadata for fast processing

Supports writing structured data to internal or external tables

MaxFrame can leverage Object Table

Near‑Real‑Time Computing + Full Incremental Integration

MCQA 2.0 interactive query engine provides resource isolation via quota groups and time‑sharing, delivering up to 2× faster query performance. Incremental compute and MV‑Pipeline orchestration enable real‑time or custom incremental refreshes. DeltaTable supports near‑real‑time writes with minute‑level checkpoint intervals, automatic file management (auto‑compaction, auto‑sorting), and SQL queries become available within 1–5 minutes of data ingestion.

AI Function: GenAI Capability

MaxCompute offers AI Function APIs that integrate Alibaba Cloud’s Feitian large model, simplifying generative AI data processing. A demo shows driving‑camera images stored in a data lake being processed via an AI Function API: users set model parameters, generate prompts, use MaxFrame to read Object Table data, invoke the AI Function for image analysis, and store results back into MaxCompute tables—all via concise API calls without complex deployment.

MaxFrame LLM Operator for Text Deduplication

The LLM operator provides efficient text deduplication using MinHash + LSH, generating hash bands to cluster similar documents and retain a single representative. Tested on the FineWeb‑edu dataset (3 billion rows, 8 TB), it completed deduplication in 3 hours using 4000 CU, demonstrating strong performance.

Intelligent Data Warehouse Capability Overview

The platform offers AI‑driven intelligent diagnosis, materialized view automation, performance tuning, and data layout optimization, collectively enhancing warehouse metrics and capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPythonData WarehouseMaxComputeDistributed ComputingAI integrationObject Table
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.