Big Data 48 min read

Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies

This article explains the growing demand for real‑time data warehouses, outlines their objectives and layered architecture, and presents detailed case studies from Didi, Kuaishou, Tencent, Youzan and others, illustrating design choices, implementation challenges, and best practices for building scalable streaming data platforms.

Big Data Technology & Architecture

Apr 11, 2022

Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies

1. Background of Real-Time Data Warehouse

Companies increasingly require real‑time data to support rapid decision‑making, as traditional offline warehouses with T+1 latency cannot meet these needs. Real‑time processing frameworks have matured through three generations—Storm, Spark Streaming, and Flink—allowing SQL‑based development and reducing operational complexity.

2. Objectives of Real-Time Data Warehouse

The main goals are to overcome the latency limitations of offline warehouses, provide high‑quality real‑time data for business decisions, and standardize data availability. Key motivations include the urgent need for real‑time data, lack of existing standards, and improved tool support that lowers development costs.

Support real‑time decision making for business operations.

Improve data usability and reduce resource waste.

Leverage mature development platforms for lower cost.

Application Scenarios

Real‑time OLAP analysis.

Real‑time dashboards.

Real‑time business monitoring.

Real‑time data interface services.

3. Architecture and Layered Design

The architecture follows a layered model similar to offline warehouses but with fewer layers to reduce latency.

3.1 ODS (Source Layer)

Data from binlog, public logs, and traffic logs are ingested into Kafka or DDQ, with naming conventions such as cn-binlog‑database‑table for automatically generated topics and realtime_ods_binlog_{source} for custom topics.

3.2 DWD (Detail Layer)

Follows a modeling‑driven approach to build fine‑grained fact tables, optionally denormalizing key dimensions for wide tables. Data is processed with Stream SQL, cleaned, and stored in Kafka and Druid for downstream queries.

Table naming rule: realtime_dwd_{business}_{domain}_{process}[_{tag}] (e.g., realtime_dwd_trip_trd_order_base).

3.3 DIM (Dimension Layer)

Provides consistent dimension data using HBase, MySQL, or the internal KV store (fusion). Naming rule: dim_{business}_{definition}[_{tag}] (e.g., dim_trip_dri_base).

3.4 DWM (Summary Layer)

Aggregates data for common metrics (PV, UV, order statistics) with unified calculations. Naming rule:

realtime_dwm_{business}_{domain}_{granularity}[_{tag}]_{interval}

(e.g., realtime_dwm_trip_trd_pas_bus_accum_1min).

3.5 APP (Application Layer)

Writes summarized data to downstream stores such as Druid for dashboards, HBase/MySQL for services, and ClickHouse for real‑time OLAP.

4. Case Studies

4.1 Didi Real‑Time Warehouse

Implemented ODS → DWD → DIM → DWM → APP layers, reducing data duplication and improving resource utilization. Highlighted differences from offline warehouses, such as fewer layers and use of Kafka, HBase, and Druid.

4.2 Kuaishou Real‑Time Warehouse

Targeted sub‑1% deviation between real‑time and offline metrics, 5‑minute SLA for core reports, and stability across massive traffic (trillions of events per day). Described challenges of data volume, component dependencies, and job count, and presented solutions using Flink SQL Early Fire, Cumulate Window, and state‑size optimizations.

4.3 Tencent Lookpoint

Adopted Lambda architecture with Flink as the streaming engine and ClickHouse as the real‑time storage. Discussed high‑performance dimension joins, caching strategies with Redis, and fault‑tolerant checkpointing.

4.4 Youzan Real‑Time Warehouse

Followed a simplified layered design (ODS, DWS, DIM, DWA, APP) with naming conventions like deptname.appname.ods_subjectname_tablename. Emphasized real‑time ETL components, idempotent processing, and data validation methods.

4.5 Tencent Full‑Scenario Real‑Time Warehouse

Analyzed the limitations of Lambda and Kappa architectures, then introduced a Flink + Iceberg solution that provides near‑real‑time ingestion, streaming reads, and batch‑compatible storage, enabling low‑latency queries and efficient data lake management.

5. Quality, Timeliness, and Stability Guarantees

Quality is ensured through source‑level disorder monitoring, benchmark comparisons, and offline‑online consistency checks. Timeliness is achieved by pressure testing, performance evaluation, and CP recovery strategies. Stability is addressed with multi‑level redundancy, cold/hot backup data centers, and automated failover mechanisms.

6. Scaling and Storage Optimizations

ClickHouse is used with Zookeeper‑based replication, batch writes to reduce QPS pressure, and sharding to avoid hot‑spot issues. Sparse indexes and materialized views improve query performance, while routing ensures that queries hit only the relevant shard.

7. Code Snippets

create function call_dubbo as 'XXXXXXX';
create function get_json_object as 'XXXXXXX';

case
    when cast( b.column as bigint) is not null
        then cast( b.column as bigint)
    else cast(coalesce(cast(get_json_object(call_dubbo('clusterUrl',
                                      'serviceName',
                                      'methodName',
                                      cast(concat('[',cast(a.column as varchar),']') as varchar),
                                      'key'),
                              'rootId') as bigint),
                 a.column) as bigint) end

create function idempotenc as 'XXXXXXX';

insert into table
select order_no
from (
    select a.orderNo as order_no,
           idempotenc('XXXXXXX', coalesce(order_no, '')) as rid
    from table1
) t
where t.rid = 0;

Conclusion

The presented designs and practices demonstrate how to build a robust, low‑latency real‑time data warehouse that integrates streaming computation, efficient storage, and reliable delivery to downstream applications, while addressing challenges of data volume, disorder, and operational stability.

Flink Kafka ClickHouse big-data data-warehouse stream-processing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.