How Youzan Built a Real‑Time Data Warehouse: Architecture, ETL, and Use Cases
This article details Youzan's end‑to‑end experience building a real‑time data warehouse, covering background, application scenarios, layered architecture, streaming ETL techniques, data validation, recovery processes, and three production projects that illustrate the solution in action.
Background
Increasing demand for real‑time analytics and the limitations of traditional offline warehouses (T+1 latency) drove Youzan to develop a real‑time data warehouse capable of sub‑second data freshness. The evolution of streaming frameworks (Storm → SparkStreaming → Flink) and the maturity of Youzan's online data development platform made SQL‑based real‑time development feasible.
Application Scenarios
Real‑time BI dashboards (Druid, MySQL)
Real‑time OLAP (Druid, ClickHouse)
Unified real‑time data services (OneServices) supporting MySQL, Kylin, Druid, ClickHouse, TiDB
Real‑time monitoring and alerting
Real‑time recommendation via user behavior features
Solution Design
Layered Architecture
The design mirrors offline warehouse layers but is streamlined for latency:
ODS (Online Data Source) : Ingests raw data from NSQ messages, Binlog, and application logs into Kafka. Naming convention: deptname.appname.ods_subjectname_tablename or deptname.appname.ods_subjectname_msgname.
DWS (Detail Layer) : Fine‑grained fact tables built per business event (order, payment, etc.) using FlinkSQL. Naming: deptname.appname.dws_subjectname_tablename_eventnameA.
DIM (Dimension Layer) : Stores dimension tables, primarily in HBase. Naming: appname_dim_tablename.
DWA (Aggregation Layer) : Multi‑dimensional aggregates for downstream consumption. Two approaches: FlinkSQL‑based aggregation (fixed granularity) or real‑time OLAP tools (flexible granularity). Naming: appname_dwa_subjectname_aggname_daily.
APP (Application Layer) : Final data exported to BI tools (Druid), services (MySQL, ClickHouse), or other applications.
Real‑time ETL
Key techniques include:
Dimension Completion : Use a Dubbo‑based UDF to enrich streams; fallback to table joins when possible. Example code:
create function call_dubbo as 'XXXXXXX';
create function get_json_object as 'XXXXXXX';
case
when cast(b.column as bigint) is not null then cast(b.column as bigint)
else cast(coalesce(cast(get_json_object(call_dubbo('clusterUrl','serviceName','methodName',cast(concat('[',cast(a.column as varchar),']') as varchar),'key')),'rootId') as bigint), a.column) as bigint
endIdempotent Processing : Store processed message keys in a KV store; skip already‑consumed messages on restart. Example code:
create function idempotenc as 'XXXXXXX';
insert into table
select order_no
from (
select a.orderNo as order_no,
idempotenc('XXXXXXX', coalesce(order_no, '')) as rid
from table1 a
) t
where t.rid = 0;Data Validation
Two validation methods address the unbounded nature of streaming data:
Sampling Validation : Persist Kafka messages to TiDB, run time‑bounded queries, and compare against offline or source data.
Full‑volume Validation : Sync real‑time dimension tables from online HBase to offline HBase, import to Hive, and use platform validation tools to ensure exact consistency.
Data Recovery
When a real‑time job fails, a predefined recovery workflow restores service by replaying data from checkpoints, handling code bugs, abnormal data, or cluster issues. The process is illustrated with a flow diagram (omitted).
Project Applications
1. Video‑Live‑Room Real‑time Statistics
Metrics such as payment orders, amounts, and new customers are aggregated per live‑room and product. A hybrid "real‑time + offline" model uses ClickHouse partitions (realtime_yyyymmdd, offline_yyyymmdd) to ensure up‑to‑date full‑volume data. Sample aggregation formulas are expressed as:
LatestFullData = T_day_real_time + (T‑1)_day_offline LatestFullData = T_day_real_time + (T+1)_day_real_time + (T‑1)_day_offline LatestFullData = (T+1)_day_real_time + T_day_offlineDynamic customer‑type detection uses a real‑time dimension table in HBase to differentiate new vs. returning customers.
2. Salesperson Real‑time Analysis
Key indicators (new salespeople, sales amount, commissions) are stored in MySQL for low‑latency access via OneServices APIs.
3. Consumer‑Robot Service Real‑time Statistics
Metrics include problem inflow, robot sessions, and robot‑to‑human handoffs. Data is collected via Binlog, joined using FlinkSQL Interval JOIN, and de‑duplicated with Druid's hyperUnique algorithm to balance accuracy and performance.
from (
select key, column, created_time from table1
) a
inner join (
select key, created_time from table2
) b
on a.column = b.column
where a.created_time between b.created_time and b.created_time + INTERVAL '1' DAYFuture Outlook
Expand real‑time warehouse domain coverage to support more business scenarios.
Establish a value‑assessment framework to quantify ROI of real‑time capabilities.
Enhance the online platform with better lineage tracking, simplified parameter configuration, and local debugging tools.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
