Big Data 18 min read

How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink

Facing growing order volumes and strict timeliness demands, Cainiao’s tech team overhauled its real‑time data warehouse by redesigning data models, adopting Flink for streaming computation, upgrading data services, and exploring innovative tools, sharing practical lessons and future directions for large‑scale logistics analytics.

dbaplus Community
dbaplus Community
dbaplus Community
How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink

1. Previous Real‑Time Data Architecture

The early real‑time stack relied on Alibaba Cloud JStorm and Spark Streaming. While they handled many scenarios, logistics supply‑chain workloads exposed serious limitations: tangled internal data models, high development cost, siloed "chimney" development, inconsistent data across business lines, and difficulty reusing models.

Data Model Issues

Confusing hierarchical layers within business‑line models caused high data‑usage cost.

Demand‑driven siloed development prevented reuse and led to high compute cost.

Cross‑line data inconsistencies resulted in large consistency gaps.

Vertical models made BI queries cumbersome.

Real‑Time Computation Issues

JStorm and Spark Streaming could satisfy most cases but struggled with logistics‑specific logic and could not simultaneously guarantee functionality, performance, stability, and rapid fault recovery.

Data Service Issues

Real‑time data was sunk into MySQL, HBase, etc., making queries and guarantees inflexible.

BI permission control and end‑to‑end reliability were unreliable.

2. Data Model Upgrade

Inspired by offline warehouses, Cainiao introduced a layered model for real‑time data.

1) Model Layering

Data is first collected from sources such as MySQL and placed into the TT messaging middleware. Using TT and HBase dimension tables, a wide fact table is generated and written back to TT. Two downstream layers are derived:

Light aggregation layer – aggregates data across multiple dimensions.

Heavy aggregation layer – serves large‑screen dashboards.

Model layering diagram
Model layering diagram

2) Pre‑Split (Pre‑Routing)

A shared data‑middle layer aggregates all business lines, then each line performs its own split to create business‑specific intermediate layers. This reduces duplicate computation and enables horizontal split such as inbound vs. outbound supply‑chain streams.

Pre‑split model diagram
Pre‑split model diagram

3) Cainiao Supply‑Chain Real‑Time Model

The public middle layer stores global order, logistics detail, and aggregated metrics. A split task then extracts business‑specific streams (domestic, import, export supply‑chain). This clear separation simplifies identification of tables for dashboards versus analytical queries.

Supply‑chain real‑time model
Supply‑chain real‑time model

3. Computation Engine Improvements

In 2017 the team migrated from JStorm/Spark to Apache Flink, gaining several powerful capabilities tailored to logistics scenarios.

Key Flink Advantages

Full‑SQL support – developers can write complex streaming logic using familiar SQL syntax.

State‑based retraction – handles order cancellations and re‑assignments automatically.

CEP (Complex Event Processing) – enables timeout statistics and pattern detection.

Batch‑stream hybrid – seamless handling of both real‑time and batch workloads.

1) Magic Retraction

Flink’s last_value function retrieves the most recent non‑null value for a key, triggering a retraction when the value changes. This solves the problem of orders that are cancelled after being counted as valid.

Retraction example
Retraction example

2) Real‑Time Timeout Statistics

To count orders that exceed a six‑hour outbound‑to‑pickup window, the team used Flink’s Timer Service. When a timeout event fires, the timer reads the stored state and emits a synthetic timeout record, enabling accurate real‑time KPI calculation.

Timeout statistics diagram
Timeout statistics diagram

Typical pseudo‑code:

processElement(event):
    state.update(event)
    timerService.registerEventTimeTimer(event.timestamp + timeout)

onTimer(timestamp):
    timeoutEvent = state.read()
    emit(timeoutEvent)

3) From Manual to Intelligent Optimization

Data skew during the map‑shuffle phase was mitigated by hash‑based pre‑aggregation and a three‑step aggregation strategy (MiniBatch, LocalGlobal, PartialFinal). Recent Flink releases also provide built‑in skew‑avoidance.

Skew mitigation diagram
Skew mitigation diagram

Resource configuration was simplified using Flink’s AutoScaling and batch‑mode stress testing. In peak (big‑promotion) scenarios, QPS is pre‑estimated, jobs are restarted, and Flink automatically benchmarks required resources.

4. Data Service Upgrade

The team built a unified middleware called TianGong that standardizes database connections, permission control, and end‑to‑end reliability.

1) NoSQL → TgSQL

TianGong translates NoSQL sources (e.g., HBase) into SQL‑like queries, allowing BI and operations teams to work with familiar relational syntax.

NoSQL to TgSQL conversion
NoSQL to TgSQL conversion

2) Cross‑Source Data Conversion

Using TianGong SQL, data from heterogeneous sources can be joined directly, eliminating the need for custom Java ETL code.

Cross‑source join example
Cross‑source join example

3) Service Guarantee Enhancements

Unified SQL‑DSL interface for all databases.

Automatic detection and throttling of slow queries.

Active‑passive dual‑write strategy for high‑traffic tables.

White‑list based rate limiting for critical users.

Service guarantee architecture
Service guarantee architecture

5. Exploration of Other Technical Tools

A one‑click real‑time load‑testing tool was built to simulate peak traffic and generate reports, reducing manual restart and source/sink swapping.

Flink‑based monitoring now tracks latency, checkpoint health, and TPS alerts automatically.

6. Future Development and Thoughts

Cainiao plans to deepen batch‑stream convergence and explore AI‑driven analytics. Flink’s batch mode will allow direct ingestion of MaxCompute dimension tables without persisting to HBase, simplifying state recovery after job restarts.

Challenges such as out‑of‑order data when merging real‑time and offline streams are addressed with custom UDFs that prioritize real‑time data while falling back to offline snapshots.

Further ambitions include leveraging Alink for business‑level intelligence and extending Flink’s smart‑optimization features (e.g., auto‑skew handling, resource auto‑scaling) to more logistics scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkStreamingdata modelingLogistics
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.