Evolution and Practices of Cainiao's Real‑Time Data Warehouse for International Import Business
This article details the high‑complexity logistics scenario of Cainiao's international import business, explains the evolution from offline to real‑time data warehouses (versions 1.0 and 2.0), describes the layered architecture, enumerates technical challenges such as multi‑source joins, state explosion, out‑of‑order processing, and presents concrete solutions using Flink features, logical middle‑layers, union‑all joins, deduplication, timer services, and batch‑stream hybrid processing.
The import logistics chain at Cainiao involves many business nodes, massive daily data volumes, and long‑running fulfillment processes, making real‑time data warehouse construction highly challenging.
Background : Orders flow from domestic buyers to overseas sellers, through customs, trunk transport, domestic customs, and final delivery, with Cainiao coordinating resources across the entire chain. Rapid growth in order volume and long fulfillment cycles demand accurate, timely data integration.
Real‑time Data Processing Pipeline : Business databases or log sources are ingested via tools like Sqoop/DataX into a message middleware (similar to Kafka). A real‑time compute engine (Flink) consumes the messages, performs transformations, and writes results to query services such as ADB (OLAP) and HBase/Lindorm for dashboards.
Evolution Timeline :
2014 – Offline warehouse with daily reports.
2015 – Hourly reports.
2016 – Early real‑time metrics using JStorm.
2017 – Adoption of Blink (Alibaba’s Flink) and launch of real‑time detail tables.
2018 – Release of Real‑time Warehouse 1.0.
2020 – Upgrade to Real‑time Warehouse 2.0 to address inflexibility, poor extensibility, and misuse of Blink features.
Architecture of Warehouse 2.0 :
Pre‑processing layer abstracts complex source logic.
Detail layer unifies models across business lines.
Summary layer provides lightweight and heavyweight aggregations for OLAP and real‑time dashboards.
Interface services expose unified APIs.
Data applications include real‑time screens, reports, and push notifications.
Key Challenges & Solutions :
Multiple business lines & models – Built a unified logical middle‑layer to extract common entities and reduce duplicated development.
Massive data sources & state size – Leveraged Flink KeyedState and OperatorState, performed deduplication on ingestion using row_number, and reduced state storage by keeping only the latest valid record.
Complex multi‑stream joins – Replaced many joins with UNION ALL followed by a single GROUP BY, dramatically cutting state usage and latency.
Out‑of‑order processing – Ensured partition‑wise ordering by hashing on primary keys, limited parallelism per partition, and used Flink Timer Service to generate timeout events without external middleware.
Heavy metric calculations – Adopted batch‑stream hybrid processing: combine historical offline data with real‑time streams, use LastValue logic to resolve conflicts, and share state between batch and stream phases.
Long fulfillment chain & state TTL – Extended state lifetimes, added data versioning and processing timestamps to handle late‑arriving corrections.
Practical Tips :
Message retraction is impossible; mark invalid records instead of filtering.
Introduce data version and processing time fields to support accurate deduplication.
Use real‑time logarithmic monitoring to track data quality.
Summary & Outlook :
Good data models and architecture solve most problems; accurate requirements assessment avoids over‑engineering; proper use of Flink state and checkpointing is essential; future work includes unified real‑time/batch processing, automatic resource tuning, and advanced data quality monitoring.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
