Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO
This article reviews the evolution of data warehouses from traditional offline models to modern real‑time architectures, presenting detailed case studies of Meituan, NetEase, Zhihu, and OPPO, and discusses layer designs, technology choices such as Flink, Kafka, and storage options, and key lessons for building scalable real‑time warehouses.
The concept of data warehouses originated decades ago, with traditional offline warehouses evolving into offline warehouses built on Hive/HDFS and, more recently, into real‑time warehouses powered by streaming frameworks such as Flink, Storm, and Spark Streaming.
Four representative real‑time warehouse implementations are highlighted:
Meituan : A Flink‑based real‑time platform that collects data from binlog, service logs, and IoT devices into Kafka, stores state in HDFS and HBase, and provides job configuration, publishing, and monitoring functions across collection, storage, engine, platform, and application layers.
NetEase (Yanxuan) : An architecture that follows the classic ODS‑DWD‑DWS‑DM layering, using Kafka for ODS and DWD, Redis for dimension data, HBase for high‑concurrency queries, and MySQL/Greenplum for aggregated metrics, with Flink handling real‑time ETL and aggregation.
Zhihu : Evolution from version 1.0 (Spark Streaming, Druid for metrics) to version 2.0 (Flink Streaming, SQL‑based processing). The upgrade addresses Kafka traffic overload, Druid stability, and lack of data isolation, and introduces Streaming SQL, metadata management, and automated result verification.
OPPO : A smooth migration from offline to real‑time warehouses, using Kafka as the central message bus, Flink SQL for cleaning and aggregation, and a layered design (ODS → DWD → ADS) that ultimately feeds downstream systems such as Elasticsearch, MySQL, and Hive.
Across these cases, common design principles emerge: a four‑layer data model (ODS, DWD, DWS, ADS/DM), the preference for Kafka as the real‑time storage backbone, and Flink as the preferred streaming engine due to low latency, exactly‑once semantics, and rich SQL support.
The article concludes that building a real‑time data warehouse requires careful layer planning, appropriate technology selection (Flink, Kafka, HBase, Redis, MySQL, Hive), and continuous iteration to balance latency, scalability, and operational stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
