How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink
This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.
Real‑Time Warehouse 1.0 – Spark Streaming
The first version implements real‑time ETL of traffic logs using a Lambda architecture. Data flow: SDKs → Log Collector Server → Kafka → Spark Streaming (Direct mode) → Druid (real‑time and offline) → Web front‑end.
Streaming ETL : at‑least‑once delivery; downstream deduplication provides global exactly‑once. Spark Streaming was chosen over Storm for higher throughput.
Common ETL logic uses a shared ProtoBuf schema:
message LogEntry {
optional BaseInfo base = 1;
optional DetailInfo detail = 2;
optional ExtraInfo extra = 3;
}BaseInfo – user, client, timestamp, network.
DetailInfo – view hierarchy.
ExtraInfo – business‑specific fields.
Typical streaming scenarios :
Dynamic configuration via broadcast variables with TTL, allowing metadata updates without job restart.
UTM parameter parsing to attribute traffic sources, campaigns and social shares.
New‑old user identification using a two‑level cache (ThreadLocal → Redis) backed by HBase, achieving ~260k events/s with < 1 % HBase reads.
Stability Practices for Spark Streaming
Consume Kafka in Direct mode to avoid checkpoint‑related failures.
Allocate sufficient YARN resources to prevent executor loss.
Throttle Kafka consumption via spark.streaming.kafka.maxRatePerPartition (or equivalent) to protect the broker.
Run the driver under a supervisor (e.g., systemd, supervisord) to auto‑restart on failure.
Batch ETL (Lambda Architecture)
Batch jobs complement streaming by landing raw data, repairing lost records, and bulk loading into Druid.
Batch Loader : custom MapReduce job writes Kafka ProtoBuf data to HDFS with deduplication, supports partitioned paths (p_date/p_hour/...), replay and self‑dependency management.
Repair ETL : on‑demand offline job fixes missing or erroneous streaming data, reusing the same ETL library.
Druid ingestion : Tranquility streams data with fixed time windows; offline MapReduce jobs re‑import missed windows.
Real‑Time Warehouse 2.0 – Flink Streaming SQL
Version 2.0 adds a three‑layer data model (raw, detail, summary) and processes both traffic logs and MySQL binlog changes. Automatic traffic splitting per business reduces Kafka output by an order of magnitude.
Detail layer : unified streaming ETL for traffic and binlog, reusing the same ProtoBuf schema.
Summary layer : aggregates metrics per content and user dimensions; stores high‑frequency appends in HBase and high‑OPS counters in Redis.
Computation engine : Flink replaces Spark Streaming, offering lower latency, exactly‑once semantics, Streaming SQL, stateful processing and CEP.
Key Architectural Improvements
Business‑level binlog ingestion at the raw layer; each database or instance writes to a dedicated Kafka topic.
Streaming Proxy joins traffic with metadata to automatically route each event to a business‑specific Kafka topic, enabling independent consumption.
Metric aggregation performed in Flink SQL; results persisted to HBase (append‑only) or Redis (incremental updates).
Future Directions
Streaming SQL platformization : replace Maven‑packaged JAR submissions with SQL‑file jobs to lower development overhead.
Metadata management : build a centralized system for real‑time warehouse metadata to reduce operational cost.
Automated result validation : compare streaming outputs with offline Hive/SQL results to verify correctness automatically.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
