How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink
Facing growing order volumes and strict timeliness demands, Cainiao’s tech team overhauled its real‑time data warehouse by redesigning data models, adopting Flink for streaming computation, upgrading data services, and exploring innovative tools, sharing practical lessons and future directions for large‑scale logistics analytics.
1. Previous Real‑Time Data Architecture
The early real‑time stack relied on Alibaba Cloud JStorm and Spark Streaming. While they handled many scenarios, logistics supply‑chain workloads exposed serious limitations: tangled internal data models, high development cost, siloed "chimney" development, inconsistent data across business lines, and difficulty reusing models.
Data Model Issues
Confusing hierarchical layers within business‑line models caused high data‑usage cost.
Demand‑driven siloed development prevented reuse and led to high compute cost.
Cross‑line data inconsistencies resulted in large consistency gaps.
Vertical models made BI queries cumbersome.
Real‑Time Computation Issues
JStorm and Spark Streaming could satisfy most cases but struggled with logistics‑specific logic and could not simultaneously guarantee functionality, performance, stability, and rapid fault recovery.
Data Service Issues
Real‑time data was sunk into MySQL, HBase, etc., making queries and guarantees inflexible.
BI permission control and end‑to‑end reliability were unreliable.
2. Data Model Upgrade
Inspired by offline warehouses, Cainiao introduced a layered model for real‑time data.
1) Model Layering
Data is first collected from sources such as MySQL and placed into the TT messaging middleware. Using TT and HBase dimension tables, a wide fact table is generated and written back to TT. Two downstream layers are derived:
Light aggregation layer – aggregates data across multiple dimensions.
Heavy aggregation layer – serves large‑screen dashboards.
2) Pre‑Split (Pre‑Routing)
A shared data‑middle layer aggregates all business lines, then each line performs its own split to create business‑specific intermediate layers. This reduces duplicate computation and enables horizontal split such as inbound vs. outbound supply‑chain streams.
3) Cainiao Supply‑Chain Real‑Time Model
The public middle layer stores global order, logistics detail, and aggregated metrics. A split task then extracts business‑specific streams (domestic, import, export supply‑chain). This clear separation simplifies identification of tables for dashboards versus analytical queries.
3. Computation Engine Improvements
In 2017 the team migrated from JStorm/Spark to Apache Flink, gaining several powerful capabilities tailored to logistics scenarios.
Key Flink Advantages
Full‑SQL support – developers can write complex streaming logic using familiar SQL syntax.
State‑based retraction – handles order cancellations and re‑assignments automatically.
CEP (Complex Event Processing) – enables timeout statistics and pattern detection.
Batch‑stream hybrid – seamless handling of both real‑time and batch workloads.
1) Magic Retraction
Flink’s last_value function retrieves the most recent non‑null value for a key, triggering a retraction when the value changes. This solves the problem of orders that are cancelled after being counted as valid.
2) Real‑Time Timeout Statistics
To count orders that exceed a six‑hour outbound‑to‑pickup window, the team used Flink’s Timer Service. When a timeout event fires, the timer reads the stored state and emits a synthetic timeout record, enabling accurate real‑time KPI calculation.
Typical pseudo‑code:
processElement(event):
state.update(event)
timerService.registerEventTimeTimer(event.timestamp + timeout)
onTimer(timestamp):
timeoutEvent = state.read()
emit(timeoutEvent)3) From Manual to Intelligent Optimization
Data skew during the map‑shuffle phase was mitigated by hash‑based pre‑aggregation and a three‑step aggregation strategy (MiniBatch, LocalGlobal, PartialFinal). Recent Flink releases also provide built‑in skew‑avoidance.
Resource configuration was simplified using Flink’s AutoScaling and batch‑mode stress testing. In peak (big‑promotion) scenarios, QPS is pre‑estimated, jobs are restarted, and Flink automatically benchmarks required resources.
4. Data Service Upgrade
The team built a unified middleware called TianGong that standardizes database connections, permission control, and end‑to‑end reliability.
1) NoSQL → TgSQL
TianGong translates NoSQL sources (e.g., HBase) into SQL‑like queries, allowing BI and operations teams to work with familiar relational syntax.
2) Cross‑Source Data Conversion
Using TianGong SQL, data from heterogeneous sources can be joined directly, eliminating the need for custom Java ETL code.
3) Service Guarantee Enhancements
Unified SQL‑DSL interface for all databases.
Automatic detection and throttling of slow queries.
Active‑passive dual‑write strategy for high‑traffic tables.
White‑list based rate limiting for critical users.
5. Exploration of Other Technical Tools
A one‑click real‑time load‑testing tool was built to simulate peak traffic and generate reports, reducing manual restart and source/sink swapping.
Flink‑based monitoring now tracks latency, checkpoint health, and TPS alerts automatically.
6. Future Development and Thoughts
Cainiao plans to deepen batch‑stream convergence and explore AI‑driven analytics. Flink’s batch mode will allow direct ingestion of MaxCompute dimension tables without persisting to HBase, simplifying state recovery after job restarts.
Challenges such as out‑of‑order data when merging real‑time and offline streams are addressed with custom UDFs that prioritize real‑time data while falling back to offline snapshots.
Further ambitions include leveraging Alink for business‑level intelligence and extending Flink’s smart‑optimization features (e.g., auto‑skew handling, resource auto‑scaling) to more logistics scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
