Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies
This article explains the growing demand for real‑time data warehouses, outlines their objectives and layered architecture, and presents detailed case studies from Didi, Kuaishou, Tencent, Youzan and others, illustrating design choices, implementation challenges, and best practices for building scalable streaming data platforms.
1. Background of Real-Time Data Warehouse
Companies increasingly require real‑time data to support rapid decision‑making, as traditional offline warehouses with T+1 latency cannot meet these needs. Real‑time processing frameworks have matured through three generations—Storm, Spark Streaming, and Flink—allowing SQL‑based development and reducing operational complexity.
2. Objectives of Real-Time Data Warehouse
The main goals are to overcome the latency limitations of offline warehouses, provide high‑quality real‑time data for business decisions, and standardize data availability. Key motivations include the urgent need for real‑time data, lack of existing standards, and improved tool support that lowers development costs.
Support real‑time decision making for business operations.
Improve data usability and reduce resource waste.
Leverage mature development platforms for lower cost.
Application Scenarios
Real‑time OLAP analysis.
Real‑time dashboards.
Real‑time business monitoring.
Real‑time data interface services.
3. Architecture and Layered Design
The architecture follows a layered model similar to offline warehouses but with fewer layers to reduce latency.
3.1 ODS (Source Layer)
Data from binlog, public logs, and traffic logs are ingested into Kafka or DDQ, with naming conventions such as cn-binlog‑database‑table for automatically generated topics and realtime_ods_binlog_{source} for custom topics.
3.2 DWD (Detail Layer)
Follows a modeling‑driven approach to build fine‑grained fact tables, optionally denormalizing key dimensions for wide tables. Data is processed with Stream SQL, cleaned, and stored in Kafka and Druid for downstream queries.
Table naming rule: realtime_dwd_{business}_{domain}_{process}[_{tag}] (e.g., realtime_dwd_trip_trd_order_base).
3.3 DIM (Dimension Layer)
Provides consistent dimension data using HBase, MySQL, or the internal KV store (fusion). Naming rule: dim_{business}_{definition}[_{tag}] (e.g., dim_trip_dri_base).
3.4 DWM (Summary Layer)
Aggregates data for common metrics (PV, UV, order statistics) with unified calculations. Naming rule:
realtime_dwm_{business}_{domain}_{granularity}[_{tag}]_{interval}(e.g., realtime_dwm_trip_trd_pas_bus_accum_1min).
3.5 APP (Application Layer)
Writes summarized data to downstream stores such as Druid for dashboards, HBase/MySQL for services, and ClickHouse for real‑time OLAP.
4. Case Studies
4.1 Didi Real‑Time Warehouse
Implemented ODS → DWD → DIM → DWM → APP layers, reducing data duplication and improving resource utilization. Highlighted differences from offline warehouses, such as fewer layers and use of Kafka, HBase, and Druid.
4.2 Kuaishou Real‑Time Warehouse
Targeted sub‑1% deviation between real‑time and offline metrics, 5‑minute SLA for core reports, and stability across massive traffic (trillions of events per day). Described challenges of data volume, component dependencies, and job count, and presented solutions using Flink SQL Early Fire, Cumulate Window, and state‑size optimizations.
4.3 Tencent Lookpoint
Adopted Lambda architecture with Flink as the streaming engine and ClickHouse as the real‑time storage. Discussed high‑performance dimension joins, caching strategies with Redis, and fault‑tolerant checkpointing.
4.4 Youzan Real‑Time Warehouse
Followed a simplified layered design (ODS, DWS, DIM, DWA, APP) with naming conventions like deptname.appname.ods_subjectname_tablename. Emphasized real‑time ETL components, idempotent processing, and data validation methods.
4.5 Tencent Full‑Scenario Real‑Time Warehouse
Analyzed the limitations of Lambda and Kappa architectures, then introduced a Flink + Iceberg solution that provides near‑real‑time ingestion, streaming reads, and batch‑compatible storage, enabling low‑latency queries and efficient data lake management.
5. Quality, Timeliness, and Stability Guarantees
Quality is ensured through source‑level disorder monitoring, benchmark comparisons, and offline‑online consistency checks. Timeliness is achieved by pressure testing, performance evaluation, and CP recovery strategies. Stability is addressed with multi‑level redundancy, cold/hot backup data centers, and automated failover mechanisms.
6. Scaling and Storage Optimizations
ClickHouse is used with Zookeeper‑based replication, batch writes to reduce QPS pressure, and sharding to avoid hot‑spot issues. Sparse indexes and materialized views improve query performance, while routing ensures that queries hit only the relevant shard.
7. Code Snippets
create function call_dubbo as 'XXXXXXX';
create function get_json_object as 'XXXXXXX';
case
when cast( b.column as bigint) is not null
then cast( b.column as bigint)
else cast(coalesce(cast(get_json_object(call_dubbo('clusterUrl',
'serviceName',
'methodName',
cast(concat('[',cast(a.column as varchar),']') as varchar),
'key'),
'rootId') as bigint),
a.column) as bigint) end create function idempotenc as 'XXXXXXX';
insert into table
select order_no
from (
select a.orderNo as order_no,
idempotenc('XXXXXXX', coalesce(order_no, '')) as rid
from table1
) t
where t.rid = 0;Conclusion
The presented designs and practices demonstrate how to build a robust, low‑latency real‑time data warehouse that integrates streaming computation, efficient storage, and reliable delivery to downstream applications, while addressing challenges of data volume, disorder, and operational stability.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
