How Alibaba Scaled Real‑Time Data Processing for Double 11: Architecture & Lessons
This article details Alibaba's real‑time computing architecture for the 2016 Double 11 event, covering background, core components such as DRC, TT, Galaxy, OTS, XTool and OneService, and explains optimization techniques, fault‑tolerance strategies, stress‑testing practices, and future upgrade plans to handle massive streaming data workloads.
1. Real‑Time Computing Architecture
1.1 Background
During the 2016 Double 11 event, Alibaba operated three real‑time data live‑screens: one for media, one for merchants, and one for internal operations. Each screen required extremely high data precision, high throughput, low latency, zero error, and high stability. The payment peak reached 120,000 transactions per second, processing billions of records in 24 hours, all of which had to be displayed in real time and be audit‑ready for regulators.
1.2 Overall Real‑Time Processing Chain Architecture
Key components include:
DRC (Data Replication Center) : Alibaba’s proprietary data‑flow product for heterogeneous database real‑time sync and change‑data capture.
TT (TimeTunnel) : A reliable, scalable message communication platform based on producer‑consumer‑topic model.
GALAXY : A global stream‑processing platform handling most of Alibaba’s real‑time tasks, offering millisecond‑level latency, linear cluster scaling, fault tolerance, multi‑tenant isolation, and processing trillions of messages daily.
OTS (Open Table Service) : A massive structured and semi‑structured storage service built on the Feitian distributed computing system, providing strong consistency, cross‑table transactions, views, and pagination for high‑volume, low‑latency queries.
1.3 XTool Aggregation Component
XTool abstracts common aggregation operations (distinct, sum, count, max, min, average, ranking) and supports real‑time multi‑table joins, static dimension joins, and window management. Built on Storm’s Trident semantics, XTool enables configuration‑driven topology definition via XML without writing code, offering features such as replay, continuation, exactly‑once semantics, input throttling, automatic hash‑bucket deduplication, and Bloom‑filter based primary‑key uniqueness.
Performance optimizations include LRU‑based in‑memory caching, Bloom‑filter keys to avoid unnecessary reads, and leveraging localOrShuffle to move computation closer to storage nodes, dramatically reducing serialization overhead.
1.4 OneService
OneService is a unified data service platform offering three main capabilities:
Simple data query service (supports HBase, MySQL, Phoenix, OpenSearch).
Complex data query service (e.g., OneID, GProfile).
Real‑time data push service (high‑performance JSONP/WebSocket interfaces for traffic and transaction streams).
High‑availability is ensured through multi‑datacenter deployment, HSF group isolation, and rapid query‑path switching between redundant HBase clusters.
2. Big Data Chain Resilience
2.1 Real‑Time Task Optimization
Allocate exclusive resources for tasks that spend >80% of time contending for shared pools.
Choose appropriate caching mechanisms to minimize database reads/writes.
Merge computation units to reduce topology depth and serialization overhead.
Share in‑memory objects to avoid costly string copying.
Balance high throughput with low latency by merging ACK and I/O operations when acceptable.
2.2 Data Link Assurance
The real‑time pipeline (data sync → computation → storage → service) is protected by multi‑link, multi‑datacenter, and cross‑region disaster‑recovery setups. Automatic comparison of results across parallel links detects anomalies, triggering instant failover to a backup link without user impact.
2.3 Stress Testing
Before Double 11, Alibaba conducts extensive online stress tests:
Data flood testing simulates peak traffic by replaying hours‑or‑days worth of data in a short burst.
Product testing replays all backend read URLs at a target QPS of 500, iteratively optimizing server performance.
Frontend stability testing runs the live‑screen UI continuously for 8–24 hours, monitoring memory, CPU, and detecting JS memory leaks.
3. Architecture Upgrade for the Future
To meet growing data volumes, Alibaba plans to migrate XTool to Apache Beam, enabling execution on multiple runners such as Flink, Apex, Spark, and Gearpump. This will provide a portable, graphical flow‑definition tool, lower development cost, improve operational efficiency, and strengthen Alibaba’s contribution to the open‑source streaming community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
