Industry Insights 18 min read

How Youzan Built a Real‑Time Data Warehouse: Architecture, ETL, and Use Cases

This article details Youzan's end‑to‑end experience building a real‑time data warehouse, covering background, application scenarios, layered architecture, streaming ETL techniques, data validation, recovery processes, and three production projects that illustrate the solution in action.

Youzan Coder

Jun 23, 2021

How Youzan Built a Real‑Time Data Warehouse: Architecture, ETL, and Use Cases

Background

Increasing demand for real‑time analytics and the limitations of traditional offline warehouses (T+1 latency) drove Youzan to develop a real‑time data warehouse capable of sub‑second data freshness. The evolution of streaming frameworks (Storm → SparkStreaming → Flink) and the maturity of Youzan's online data development platform made SQL‑based real‑time development feasible.

Application Scenarios

Real‑time BI dashboards (Druid, MySQL)

Real‑time OLAP (Druid, ClickHouse)

Unified real‑time data services (OneServices) supporting MySQL, Kylin, Druid, ClickHouse, TiDB

Real‑time monitoring and alerting

Real‑time recommendation via user behavior features

Solution Design

Layered Architecture

The design mirrors offline warehouse layers but is streamlined for latency:

ODS (Online Data Source) : Ingests raw data from NSQ messages, Binlog, and application logs into Kafka. Naming convention: deptname.appname.ods_subjectname_tablename or deptname.appname.ods_subjectname_msgname.

DWS (Detail Layer) : Fine‑grained fact tables built per business event (order, payment, etc.) using FlinkSQL. Naming: deptname.appname.dws_subjectname_tablename_eventnameA.

DIM (Dimension Layer) : Stores dimension tables, primarily in HBase. Naming: appname_dim_tablename.

DWA (Aggregation Layer) : Multi‑dimensional aggregates for downstream consumption. Two approaches: FlinkSQL‑based aggregation (fixed granularity) or real‑time OLAP tools (flexible granularity). Naming: appname_dwa_subjectname_aggname_daily.

APP (Application Layer) : Final data exported to BI tools (Druid), services (MySQL, ClickHouse), or other applications.

Real‑time ETL

Key techniques include:

Dimension Completion : Use a Dubbo‑based UDF to enrich streams; fallback to table joins when possible. Example code:

create function call_dubbo as 'XXXXXXX';
create function get_json_object as 'XXXXXXX';

case
    when cast(b.column as bigint) is not null then cast(b.column as bigint)
    else cast(coalesce(cast(get_json_object(call_dubbo('clusterUrl','serviceName','methodName',cast(concat('[',cast(a.column as varchar),']') as varchar),'key')),'rootId') as bigint), a.column) as bigint
end

Idempotent Processing : Store processed message keys in a KV store; skip already‑consumed messages on restart. Example code:

create function idempotenc as 'XXXXXXX';

insert into table
select order_no
from (
    select a.orderNo as order_no,
           idempotenc('XXXXXXX', coalesce(order_no, '')) as rid
    from table1 a
) t
where t.rid = 0;

Data Validation

Two validation methods address the unbounded nature of streaming data:

Sampling Validation : Persist Kafka messages to TiDB, run time‑bounded queries, and compare against offline or source data.

Full‑volume Validation : Sync real‑time dimension tables from online HBase to offline HBase, import to Hive, and use platform validation tools to ensure exact consistency.

Data Recovery

When a real‑time job fails, a predefined recovery workflow restores service by replaying data from checkpoints, handling code bugs, abnormal data, or cluster issues. The process is illustrated with a flow diagram (omitted).

Project Applications

1. Video‑Live‑Room Real‑time Statistics

Metrics such as payment orders, amounts, and new customers are aggregated per live‑room and product. A hybrid "real‑time + offline" model uses ClickHouse partitions (realtime_yyyymmdd, offline_yyyymmdd) to ensure up‑to‑date full‑volume data. Sample aggregation formulas are expressed as:

LatestFullData = T_day_real_time + (T‑1)_day_offline

LatestFullData = T_day_real_time + (T+1)_day_real_time + (T‑1)_day_offline

LatestFullData = (T+1)_day_real_time + T_day_offline

Dynamic customer‑type detection uses a real‑time dimension table in HBase to differentiate new vs. returning customers.

2. Salesperson Real‑time Analysis

Key indicators (new salespeople, sales amount, commissions) are stored in MySQL for low‑latency access via OneServices APIs.

3. Consumer‑Robot Service Real‑time Statistics

Metrics include problem inflow, robot sessions, and robot‑to‑human handoffs. Data is collected via Binlog, joined using FlinkSQL Interval JOIN, and de‑duplicated with Druid's hyperUnique algorithm to balance accuracy and performance.

from (
    select key, column, created_time from table1
) a
inner join (
    select key, created_time from table2
) b
on a.column = b.column
where a.created_time between b.created_time and b.created_time + INTERVAL '1' DAY

Future Outlook

Expand real‑time warehouse domain coverage to support more business scenarios.

Establish a value‑assessment framework to quantify ROI of real‑time capabilities.

Enhance the online platform with better lineage tracking, simplified parameter configuration, and local debugging tools.

ClickHouse Real-time Data Warehouse data validation Druid FlinkSQL streaming ETL industry case study

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.