Big Data 47 min read

How Real‑Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explores the growing demand for real‑time data warehouses, compares them with traditional offline warehouses, and presents detailed architectures, layer designs, naming conventions, and case studies from companies like Didi, Kuaishou, Tencent, and Youzan, highlighting challenges, solutions, and performance optimizations.

dbaplus Community

Mar 2, 2022

How Real‑Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

Real‑time data warehouses have become essential as businesses increasingly require up‑to‑the‑second data for decision‑making, surpassing the T+1 latency of traditional offline warehouses.

Why Real‑Time Warehouses?

Business needs for immediate data are intensifying.

Streaming frameworks (Storm, SparkStreaming, Flink) have matured, allowing SQL‑based development that inherits offline design principles.

Goals of Real‑Time Warehouses

Address the low timeliness of offline warehouses, improve data usability, and reduce resource waste by standardising real‑time data pipelines.

Support rapid business decisions.

Eliminate unstandardised real‑time data sources.

Leverage mature development platforms to lower costs.

Typical Application Scenarios

Real‑time OLAP analysis.

Live dashboards.

Real‑time business monitoring.

Real‑time data‑service APIs.

Architecture Overview

The architecture follows a layered model:

ODS (Source Layer) : Ingest raw logs (binlog, event, traffic) into Kafka.

DWD (Detail Layer) : Clean, de‑duplicate, and enrich data; store in Kafka and optionally Druid.

DIM (Dimension Layer) : Store dimension tables in MySQL, HBase, or ClickHouse‑based KV stores.

DWM (Summary Layer) : Perform multi‑dimensional aggregation using Flink SQL, early‑fire windows, and cumulate windows.

APP Layer : Write aggregated results to downstream stores (Druid, MySQL, Redis, ClickHouse) for dashboards, APIs, and analytics.

Case Study: Didi Ride‑Sharing

Didi built a real‑time warehouse for its ride‑sharing product, achieving unified DWD layers, reduced data duplication, and lower resource consumption. Key differences from offline warehouses include fewer layers, omission of the application layer within the warehouse, and a focus on real‑time freshness.

Key naming conventions: realtime_dwd_{biz}_{domain}_{process}_{tag} Example:

realtime_dwd_trip_trd_order_base

Case Study: Kuaishou

Kuaishou faced trillion‑level daily traffic, requiring sub‑5‑minute latency and high stability. Solutions included:

Early‑fire Flink SQL with DID bucketing.

Cumulate windows to handle out‑of‑order data.

Three deduplication strategies for DAU/UV calculations, balancing state size and tolerance for disorder.

Case Study: Tencent Lookpoint

Tencent adopted a Lambda architecture with Flink as the real‑time engine, ClickHouse for storage, and Redis caching for fast dimension lookups. Optimisations reduced data‑processing latency from hours to seconds and cut resource usage by up to 98% for downstream applications.

Case Study: Youzan

Youzan designed a streamlined real‑time warehouse with ODS, DWS, DIM, DWA, and APP layers, using Kafka for ingestion, Flink for ETL, and ClickHouse for high‑performance OLAP queries. Naming conventions were standardised for each layer to ensure consistency.

Key Technical Practices

Real‑time ETL : Use Flink SQL for stream cleaning, dimension enrichment, and windowed aggregation.

Idempotence : Store processed keys in KV stores to avoid duplicate counting after task restarts.

Data Validation : Apply sampling (persisting streams to TiDB) and full‑volume validation (syncing HBase to Hive) to ensure accuracy.

Recovery : Follow a strict bug‑fix workflow with state replay and data back‑fill.

Architectural Evolution

Traditional Lambda architectures separate batch and stream pipelines, leading to duplicated logic and consistency issues. Kappa architectures simplify by using a single streaming platform but suffer from ordering problems and limited OLAP capabilities.

Flink + Iceberg offers a hybrid solution: Iceberg provides near‑real‑time visibility via commit‑based snapshots, enabling both streaming reads and writes while supporting incremental queries, small‑file compaction, and cost‑effective storage on HDFS/S3.

Benefits of Iceberg over Kafka include unified batch/stream processing, native OLAP optimisation (predicate push‑down), efficient back‑tracking, and lower storage costs, though latency shifts from true real‑time to near‑real‑time.

Future Directions

Integrating Alluxio caching with Iceberg aims to achieve sub‑second query latency for data‑lake analytics, further bridging the gap between real‑time and analytical workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Kafka Real-Time Data Warehouse Big Data Architecture Iceberg streaming ETL

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.