How NetEase Cloud Music Built a Real‑Time Data Warehouse with Flink & Calcite
This article details NetEase Cloud Music's evolution of a real‑time data warehouse built on Flink 1.9 and Calcite, covering platform scale, architectural design, metadata management, SDK simplifications, monitoring improvements, and concrete use cases such as AB‑testing, live reporting, and feature serving.
Background
By 2020, NetEase Cloud Music's real‑time computing platform operated on more than 150 machines, ran over 700 tasks, and handled peak QPS of 4 million, with roughly 180 developers using the system. The platform, launched in early 2018, underwent two major version upgrades, expanding task count by nearly 200% by mid‑2020.
Limitations of the First Version (Flink 1.7)
The initial version was built on Flink 1.7, which lacked native SQL DDL support. To compensate, a custom Antlr‑based SQL grammar was created, adding DDL and dimension‑table JOIN capabilities. However, the platform missed critical features such as data‑lineage tracking, metadata governance, and comprehensive task monitoring, making troubleshooting difficult.
Real‑Time Data Warehouse Construction (Flink 1.9)
The next generation, based on Flink 1.9, introduced several key enhancements:
Integration with a centralized metadata hub, allowing users to avoid manually defining data formats.
Provision of both SQL and a Java/Scala SDK for developers.
End‑to‑end data‑lineage collection.
Rich source‑ and task‑level monitoring, including MQ data‑volume metrics.
Architecture Overview
Data enters the system via SQL statements or SDK calls, which are parsed by the Planner. The Planner interacts with a Catalog that injects metadata from the MetaHub (the metadata center). The MetaHub manages all metadata, offering plug‑in modules for MQ metadata, unified data types, and searchable metadata.
Data Warehouse Layers
The warehouse is divided into three parts: a unified table naming convention ( catalog.db.table), layered storage (offline → real‑time), and table‑level permission management. The real‑time warehouse mirrors the offline model, replicating tables to provide low‑latency access.
SDK Simplification
The SDK encapsulates internal SQL execution, exposing a concise API and automatic lineage capture. A real‑world demo reduced the implementation from over 190 lines of code to just a dozen lines, dramatically improving developer productivity.
Monitoring Enhancements
Fine‑grained metrics are collected at the task level, and MQ data volumes are tracked. According to the contributor, robust cluster‑level monitoring becomes indispensable once the platform reaches a certain scale.
Practical Use Cases
AB‑Testing
Raw data is first stored in Hive, cleaned and aggregated with Spark, then written to real‑time tables. The new real‑time AB‑Test pipeline eliminates the previous Hive + Spark batch process, delivering faster feedback and better resource utilization.
Real‑Time Reporting
Live dashboards, such as the real‑time playback count for NetEase Cloud Music live streams, are built on the warehouse. The streamlined task creation and clearer data‑issue tracing simplify operations.
Real‑Time Feature Serving
Feature reuse and lineage are supported. By analyzing feature generation across algorithm teams, the platform identified significant duplication, leading to resource waste. Features are now layered, isolated by business domain, and searchable, enabling teams to discover and reuse existing features efficiently.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
