Big Data 11 min read

Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute

The article chronicles Taobao’s 13‑year evolution of its big data platform, detailing three phases—from a single‑node Oracle setup and the Tianwang scheduler, through a Hadoop‑based “Cloud Ladder 1” architecture with real‑time analytics, to the current MaxCompute/ODPS era with cross‑region projects and advanced data services.

Architecture Digest
Architecture Digest
Architecture Digest
Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute

Since its launch in 2003, Taobao has grown rapidly, and behind its aggressive business expansion lies a continuously evolving big data platform that handles data collection, processing, and application. The platform has gone through three major stages, each addressing new technical challenges.

Figure 1: Three phases of the data warehouse platform

First stage – RAC era (pre‑2008) : Initially Taobao used a single‑node Oracle database, which quickly proved insufficient for the growing workload. In 2008 a RAC cluster (4 → 20 nodes) was introduced, becoming one of the world’s largest RAC clusters and forming the first data‑warehouse architecture. ETL was performed with Oracle stored procedures, and job scheduling relied on Crontab. The massive number of daily SQL scripts caused reliability issues, prompting the team to develop the “Tianwang” scheduling system.

Figure 2: Tianwang scheduling system architecture

Figure 3: Prototype of the Tianwang scheduler

Second stage – Hadoop era (2009‑2013) : The launch of Taobao Mall (now Tmall) and rapid growth of traffic made the RAC cluster unable to handle massive log data. The team evaluated Greenplum and Hadoop, ultimately choosing Hadoop for its linear scalability and open‑source nature. In early 2010 the “Cloud Ladder 1” Hadoop cluster was built, and all Oracle stored procedures were rewritten as Hive and MapReduce jobs. New data products such as Quantum Statistics, Data Cube, and the real‑time computing platform Galaxy were released, enabling live data dashboards for events like Double 11. To solve data‑sync problems, tools such as DATAX, Dbsync, and TT were created.

Figure 4: Cloud Ladder 1 data‑warehouse architecture

Figure 5: Data‑synchronization tools in the Hadoop era

The Tianwang scheduler was continuously improved to support hourly and minute‑level scheduling, automatic alerts, and integration with DQC, data‑map, and lineage systems.

Third stage – MaxCompute (ODPS) era (2010‑present) : Parallel to Hadoop, Alibaba Cloud developed its own ODPS system (later renamed MaxCompute). Initially called “Cloud Ladder 2”, it co‑existed with Cloud Ladder 1. In 2013 the “5K project” tackled a cross‑region cluster migration, a world‑first at that scale. Following its success, the “Moon‑landing project” moved all group‑wide data processing to ODPS, decommissioning Hadoop by 2015. The platform then expanded to provide public‑cloud big‑data services.

During this period the team also launched the “Kongming Lantern” solution, a unified data‑service framework that reduced redundancy, standardized metrics, and enabled self‑service data access for internal business units and external partners such as Gaode Maps and Alibaba Health.

Figure 6: Kongming Lantern solution architecture

By 2014 the group‑level public data layer was built, integrating services from Taobao, 1688, ICBU, and AE, and delivering products like the DIGO data‑portal. Today, data permeates every corner of Alibaba, supporting AI initiatives and driving the next “data‑intelligence” era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataTaobaoData PlatformData WarehouseMaxComputeHadoop
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.