Big Data 19 min read

Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0

This article details Zhihu's real‑time data warehouse evolution, describing the 1.0 Spark Streaming architecture, its limitations, and the 2.0 redesign that introduces Flink, layered data models, streaming and batch ETL, metric storage choices, and future roadmap for scalable, low‑latency analytics.

Big Data Technology & Architecture

Sep 11, 2019

Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0

Zhihu's data engineering team shares the evolution of its real‑time data warehouse, emphasizing the importance of timely data feedback for product decisions and the need for a robust data‑intelligence pipeline.

Real‑time Warehouse 1.0 used Spark Streaming (Micro‑Batch) for ETL, processing traffic logs in a lambda architecture with separate streaming and batch paths. The architecture consisted of three parts: data collection via SDKs → Kafka, ETL (streaming and batch) → Druid, and visualization via a web front‑end.

The team adopted a lambda architecture, splitting ETL into Streaming ETL and Batch ETL. Streaming ETL handled real‑time ETL logic, while Batch ETL ensured exactly‑once semantics and data repair.

Streaming ETL

The streaming framework was chosen as Spark Streaming because, at the time, it offered higher throughput and a richer ecosystem than Storm, fitting Zhihu's log volume and latency requirements.

Data correctness was achieved with at‑least‑once semantics in Spark Streaming and downstream deduplication to approximate exactly‑once.

Common ETL logic leveraged a shared Proto Buffer schema:

message LogEntry {
  optional BaseInfo base = 1;
  optional DetailInfo detail = 2;
  optional ExtraInfo extra = 3;
}

Key reusable ETL components included dynamic streaming configuration, UTM parameter parsing, and new/old user identification using a two‑level cache (Thread‑Local → Redis → HBase).

Spark Streaming Stability Practices

Prefer Direct Kafka consumption over Receiver mode to avoid checkpoint‑related instability.

Ensure sufficient Yarn resources to prevent executor loss.

Throttle Kafka consumption to avoid overwhelming the upstream cluster.

Use a supervisor process to automatically restart failed drivers.

Batch ETL

Batch ETL handled data landing, offline ETL, and bulk loading into Druid. Custom MapReduce jobs (Batch Loader) landed protobuf data to HDFS with deduplication, while Repair ETL repaired lost or erroneous streaming data.

Shortcomings of 1.0

All traffic data shared a single Kafka topic, causing high downstream consumption.

Druid handled all metric calculations, leading to scalability issues as data grew.

Lack of data isolation and cost accounting across business lines.

Real‑time Warehouse 2.0

The 2.0 redesign adds a raw layer for traffic logs and MySQL binlog streams, a detail layer for unified streaming ETL, and summary layers for detailed and metric aggregation.

Streaming Proxy automatically splits traffic by business using metadata joins, feeding each business its own Kafka topic.

Metric aggregation now uses Flink Streaming SQL, offering low latency, exactly‑once semantics, and rich windowing. Metrics are stored in Redis for high‑frequency updates and HBase for append‑heavy, low‑frequency reads.

Integration with visualization and recommendation systems is achieved through a unified data source creation workflow involving HBase tables, metric columns, and dimension tables.

Future Outlook

Platform‑ify Streaming SQL to submit jobs as SQL files instead of JARs.

Systematize real‑time metadata management.

Automate real‑time result validation by comparing streaming outputs with offline benchmarks.

Overall, the transition from 1.0 to 2.0 demonstrates deeper architectural maturity, broader technology adoption, and a roadmap toward more scalable, maintainable real‑time analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Real-Time Data Warehouse Spark Streaming Lambda architecture zhihu streaming ETL

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.