Big Data 49 min read

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explains why real‑time data warehouses are becoming essential, outlines their goals, compares them with traditional offline warehouses, and presents detailed design patterns, naming conventions, and case studies from Didi, Kuaishou, Tencent, Youzan and other enterprises, highlighting challenges and solutions for streaming, storage, and query layers.

Data Thinking Notes

Dec 23, 2022

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

1. Background of Real‑Time Data Warehouse Construction

1.1 Growing Real‑Time Demand

Companies increasingly require data with near‑zero latency for product features and internal decision‑making; traditional offline warehouses operate on a T+1 schedule with daily batch jobs, which cannot satisfy high‑frequency, low‑latency scenarios.

1.2 Maturing Real‑Time Technologies

Real‑time computation frameworks have evolved through three generations—Storm, Spark Streaming, and Flink—becoming more stable. Development can now be expressed in SQL, inheriting offline warehouse design principles, while online data platforms provide better development, debugging, and operation support, reducing overall cost.

2. Objectives of Real‑Time Data Warehouse

2.1 Solving Traditional Warehouse Issues

Traditional warehouses focus on historical data accumulation from day one of a product launch. Real‑time streaming emphasizes current processing state. The goal is to combine warehouse theory with real‑time technology to address the low timeliness of offline data.

Current motivations include:

Business needs real‑time data to support rapid decision‑making.

Real‑time data lacks standards, resulting in poor usability and resource waste.

Platform tools increasingly support end‑to‑end real‑time development, lowering costs.

2.2 Application Scenarios

Real‑time OLAP analysis.

Real‑time dashboards.

Real‑time business monitoring.

Real‑time data service APIs.

3. Real‑Time Data Warehouse Design

We analyze several representative cases to provide inspiration for building a robust real‑time warehouse.

3.1 Didi Ride‑Sharing Real‑Time Warehouse Case

Didi built a real‑time warehouse that satisfies the ride‑sharing business’s diverse real‑time needs, establishing layered data (detail and aggregate) and unifying the DWD layer, which reduces big‑data resource consumption and improves data reuse.

Warehouse architecture diagram:

Fewer layers than offline warehouses – the real‑time warehouse removes some intermediate layers, reducing latency.

Different storage for real‑time data – detail data may reside in Kafka, while dimension data uses HBase, MySQL, or other KV stores.

3.1.1 ODS (Source Layer) Construction

Data sources include order binlog, safety logs, and traffic logs. Some data are written directly to Kafka or DD‑MQ, while others are collected via internal sync tools and stored in Kafka topics following naming conventions such as: cn-binlog-ihap_fangyuan-ihap_fangyuan or custom topics:

realtime_ods_binlog_ihap_fangyuan

3.1.2 DWD (Detail Layer) Construction

Based on business processes, the most granular fact tables are built. Important dimensions are denormalized to create wide tables. Data are processed with Stream SQL, handling binlog cleaning, drift, out‑of‑order data, and joins. Fact tables are stored in Kafka and also written to Druid for fast query.

Naming rule example:

realtime_dwd_trip_trd_order_base

3.1.3 DIM (Dimension Layer)

Public dimension tables are built using dimension‑modeling principles, sourced from Flink‑processed ODS data and offline jobs. Storage engines include MySQL, HBase, and a proprietary KV store (fusion), chosen based on query volume and data size.

Naming rule example:

dim_trip_dri_base

3.1.4 DWM (Aggregate Layer) Construction

Aggregations are performed per business theme (e.g., PV, UV, order metrics). Stream SQL produces 1‑minute aggregates, which are further accumulated to hourly or daily granularity. ClickHouse is used for high‑performance OLAP storage.

Naming rule example:

realtime_dwm_trip_trd_pas_bus_accum_1min

3.1.5 APP (Application Layer)

Aggregated data are written to downstream systems such as Druid (for dashboards), HBase (for real‑time services), MySQL/Redis (for product services), enabling low‑latency consumption.

3.2 Kuaishou Real‑Time Warehouse Scenario

Goals include keeping real‑time metrics within 1% of offline metrics and ensuring end‑to‑end latency under 5 minutes, even during large‑scale events. Challenges involve trillion‑level daily traffic, complex component dependencies, and thousands of core jobs.

Solution highlights:

Minute‑level deduplication of DID + dimension to reduce state size.

Assumption‑free timestamp‑based deduplication (zero‑tolerance for out‑of‑order data).

Ring‑buffer approach allowing 16 minutes of out‑of‑order tolerance.

3.3 Tencent Kankan Real‑Time Warehouse Case

Due to massive daily event volume (tens of trillions) and messy reporting formats, Tencent built a Lambda‑style architecture with Flink as the real‑time engine and ClickHouse as the storage engine, providing high‑throughput, low‑latency OLAP queries.

Key components:

Real‑time computation engine: Flink (Exactly‑once, checkpointing).

Real‑time storage engine: ClickHouse (MPP, columnar, supports high‑concurrency writes).

Application layer: APIs for C‑side queries, OLAP dashboards, and data services.

Optimization techniques include minute‑level window aggregation, Redis caching before HBase lookups, and filtering non‑existent content IDs to avoid cache penetration.

3.4 Youzan Real‑Time Warehouse Case

Youzan follows the classic offline layered design (ODS, DWS, DIM, DWA, APP) but simplifies layers for real‑time needs. Naming conventions are strictly defined, e.g., deptname.appname.ods_subject_tablename for ODS tables and deptname.appname.dws_subject_tablename_eventA for DWS tables.

3.5 Tencent Full‑Scenario Real‑Time Warehouse

Traditional Lambda architectures suffer from duplicated pipelines and high operational cost. Tencent proposes a Flink + Iceberg solution that unifies batch and streaming, enabling near‑real‑time visibility via Iceberg snapshots, supporting both streaming reads and writes, and allowing OLAP optimizations such as predicate push‑down.

Advantages of replacing Kafka with Iceberg:

Unified stream‑batch storage.

Middle‑layer OLAP support.

Efficient back‑tracking.

Lower storage cost.

Remaining drawbacks are slightly higher latency (near‑real‑time instead of sub‑second) and the need for additional integration work.

4. Common Guarantees and Operations

Quality, timeliness, and stability are ensured through multi‑layer monitoring, benchmark‑driven offline validation, checkpoint‑based recovery, and dual‑datacenter hot‑/cold‑backup strategies.

5. Conclusion

Real‑time data warehouses bridge the gap between massive data ingestion and low‑latency business insights. By carefully designing layer abstractions, naming conventions, storage choices, and fault‑tolerance mechanisms, enterprises can achieve high‑performance, scalable, and reliable real‑time analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing SQL kafka ETL Real-Time Data Warehouse Data Lake Big Data Architecture

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.