How ODPS Evolved Over 15 Years into a Next‑Gen AI‑Ready Big Data Platform
This article chronicles ODPS's 15‑year journey from its exploratory beginnings to a modern, AI‑enabled big data platform, detailing its four development phases, architectural layers, SQL engine upgrades, real‑time processing, lakehouse integration, and the new Data+AI capabilities offered by MaxCompute and DataWorks.
ODPS 15‑Year Evolution Overview
ODPS celebrated its 15th anniversary, marking four distinct development stages. The first stage (2009‑2010) was the exploration period, launching the first line of code and releasing ODPS 1.0. The second stage focused on development efficiency and distributed computing, achieving a single‑cluster scale of 5,000 machines and supporting the group’s cloud migration.
The third stage entered the era of big‑data democratization, expanding single‑cluster capacity beyond 10,000 nodes, supporting Hadoop federation queries, and introducing MaxCompute LakeHouse 2.0. The fourth stage aligns with the AI wave, emphasizing heterogeneous computing for big data and AI, and enhancing multimodal data processing capabilities.
ODPS Architectural Layers
The ODPS family consists of four layers: data integration, data storage, compute, and data application/consumption. The integration layer includes DataWorks and Data Ingestion, enabling data from various sources to be ingested into MaxCompute storage, with support for Flink CDC, DataHub, and Tunnel.
The storage layer comprises built‑in MaxCompute storage and open‑source OSS data lake storage, unified under MaxMeta for metadata management. It supports multiple compute engines such as ODPS SQL, Spark, MapReduce, and MaxFrame, and can interoperate with Hologres, MaxQA, and Flink for interactive and streaming queries.
The application layer connects to Quick BI, DataWorks Data Service, and DataV for data consumption.
SQL Engine Upgrade and Optimization
Recent years have seen a comprehensive upgrade of the ODPS SQL engine, adding richer functionality, higher performance, and lower cost. New capabilities include complex data type handling, expanded time‑type formats, flexible type and format conversion, and advanced table features such as near‑real‑time DeltaTable (both PK‑based incremental and Append DeltaTable).
ODPS now offers Auto Partitioning, time‑based functions, and dynamic partition pruning. Over 30 built‑in functions have been added, covering date/time, string, binary conversion, JSON_LENGTH, JSON_CONTAINS, and other semi‑structured data operations. Syntax enhancements include more flexible DQL, improved GROUP BY and PIVOT, richer CTE and SUBQUERY support, and expanded DML capabilities such as MERGE INTO, multi‑step UPDATE/INSERT, and DELETE FROM with aliases.
Real‑Time and Near‑Real‑Time Processing
ODPS‑MaxCompute is advancing toward near‑line and near‑real‑time processing with Delta Live MV and MaxQA, delivering a more efficient and real‑time big‑data platform. The Delta Live MV architecture integrates incremental and full‑load computation, automatically selecting the optimal mode based on declared logic, thereby optimizing resource usage and performance.
MaxQA provides near‑real‑time query capabilities and integrates with Hologres, Flink, and other streaming products for high‑efficiency query processing.
Lakehouse Integration and Multi‑Modal Data Support
With AI’s rise, ODPS now supports both structured and unstructured data, the latter accounting for over 80% of data volume. Through MaxMeta, MaxStorage, MaxStorageAPI, and MaxCatalogAPI, ODPS unifies metadata management for built‑in storage and open‑source lake formats such as Paimon, Iceberg, and Delta Lake, as well as image and video data.
This unified access enables batch and near‑real‑time incremental computation across all data types using existing SQL and Python engines.
Data + AI Integration
ODPS introduced a Data + AI product strategy, allowing AI functions to operate on multimodal data via Object Table. Users can connect internal, remote, or uploaded models (e.g., Qwen, DeepSeek, XGBoost) through SQL AI Function, enabling large‑scale AI inference for content generation, information extraction, and multimodal analysis.
The platform supports heterogeneous computing with both CPU and GPU resources, offering development in SQL for data engineers and Python for data scientists, all under a unified execution environment.
DataWorks: New Development Paradigm
DataWorks provides a unified notebook supporting both SQL and Python, enabling a Data + AI workflow with multi‑engine orchestration, mixed CPU/GPU resource scheduling, and seamless access to massive MaxCompute data. The Copilot feature offers AI‑assisted SQL generation, completion, and optimization, boosting developer productivity by over 30%.
DataWorks also includes AI agents (MCP Server) for intelligent table discovery, metadata enrichment, chart generation, code generation, AI code review, and automated ETL task creation, as well as ChatBI for conversational analytics that empowers non‑technical users to obtain insights without writing SQL.
Conclusion
Over the past 15 years, ODPS has continuously integrated with ecosystems such as Quick BI, Metabase, Tableau, and various industry solutions across manufacturing, communications, transportation, gaming, retail, logistics, automotive, and finance. Looking forward, ODPS aims to deepen its AI integration, delivering more AI‑driven applications and computations within the Data + AI big‑data processing paradigm.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
