Big Data 19 min read

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.

dbaplus Community

Feb 28, 2019

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

Real‑Time Warehouse 1.0 – Spark Streaming

The first version implements real‑time ETL of traffic logs using a Lambda architecture. Data flow: SDKs → Log Collector Server → Kafka → Spark Streaming (Direct mode) → Druid (real‑time and offline) → Web front‑end.

Streaming ETL : at‑least‑once delivery; downstream deduplication provides global exactly‑once. Spark Streaming was chosen over Storm for higher throughput.

Common ETL logic uses a shared ProtoBuf schema:

message LogEntry {
  optional BaseInfo   base   = 1;
  optional DetailInfo detail = 2;
  optional ExtraInfo  extra  = 3;
}

BaseInfo – user, client, timestamp, network.

DetailInfo – view hierarchy.

ExtraInfo – business‑specific fields.

Typical streaming scenarios :

Dynamic configuration via broadcast variables with TTL, allowing metadata updates without job restart.

UTM parameter parsing to attribute traffic sources, campaigns and social shares.

New‑old user identification using a two‑level cache (ThreadLocal → Redis) backed by HBase, achieving ~260k events/s with < 1 % HBase reads.

Stability Practices for Spark Streaming

Consume Kafka in Direct mode to avoid checkpoint‑related failures.

Allocate sufficient YARN resources to prevent executor loss.

Throttle Kafka consumption via spark.streaming.kafka.maxRatePerPartition (or equivalent) to protect the broker.

Run the driver under a supervisor (e.g., systemd, supervisord) to auto‑restart on failure.

Batch ETL (Lambda Architecture)

Batch jobs complement streaming by landing raw data, repairing lost records, and bulk loading into Druid.

Batch Loader : custom MapReduce job writes Kafka ProtoBuf data to HDFS with deduplication, supports partitioned paths (p_date/p_hour/...), replay and self‑dependency management.

Repair ETL : on‑demand offline job fixes missing or erroneous streaming data, reusing the same ETL library.

Druid ingestion : Tranquility streams data with fixed time windows; offline MapReduce jobs re‑import missed windows.

Real‑Time Warehouse 2.0 – Flink Streaming SQL

Version 2.0 adds a three‑layer data model (raw, detail, summary) and processes both traffic logs and MySQL binlog changes. Automatic traffic splitting per business reduces Kafka output by an order of magnitude.

Detail layer : unified streaming ETL for traffic and binlog, reusing the same ProtoBuf schema.

Summary layer : aggregates metrics per content and user dimensions; stores high‑frequency appends in HBase and high‑OPS counters in Redis.

Computation engine : Flink replaces Spark Streaming, offering lower latency, exactly‑once semantics, Streaming SQL, stateful processing and CEP.

Key Architectural Improvements

Business‑level binlog ingestion at the raw layer; each database or instance writes to a dedicated Kafka topic.

Streaming Proxy joins traffic with metadata to automatically route each event to a business‑specific Kafka topic, enabling independent consumption.

Metric aggregation performed in Flink SQL; results persisted to HBase (append‑only) or Redis (incremental updates).

Future Directions

Streaming SQL platformization : replace Maven‑packaged JAR submissions with SQL‑file jobs to lower development overhead.

Metadata management : build a centralized system for real‑time warehouse metadata to reduce operational cost.

Automated result validation : compare streaming outputs with offline Hive/SQL results to verify correctness automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink ETL Real-Time Data Warehouse Spark Streaming Lambda architecture streaming SQL

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.