How NetEase Cloud Music Scaled Its Data Warehouse for Billion‑User Traffic
This article details NetEase Cloud Music's journey of redesigning its data warehouse and governance processes to support over a billion monthly active users, covering pain points, standardization, shared services, self‑service tools, and the resulting improvements in data quality, latency, and operational efficiency.
Warehouse Pain Points and Objectives
Rapid growth to over 100 million MAU caused fragmented business units, inconsistent data definitions, high development cost, and difficulty integrating new services. The main objectives were to lower the data‑usage barrier, improve data utility, and enable data‑driven business growth.
Data consumers include:
Analysts – need consistent metrics, rich dimensions, and cross‑dimensional analysis.
Algorithm teams – require stable, near‑real‑time data and standardized APIs for model iteration.
Product operations – need self‑service access for A/B testing and rapid feature validation.
Standardization, Sharing, and Self‑service
Three coordinated initiatives were launched:
Standardization : define common data models, metric definitions, and naming conventions; enforce through a model design center.
Sharing : expose data via service‑oriented APIs (easyFetch) and rule‑based validation (easyTracker).
Self‑service : allow analysts and product teams to query assets directly through easyFetch.
Traffic Data Governance
Problem: chaotic event tracking with scattered definitions, missing documentation, and low data quality.
Solution workflow:
Pre‑development – abstract and standardize event definitions using an e‑commerce‑inspired three‑element schema (event name, key‑value pairs, resource‑ID naming).
During development – enforce an event review checkpoint.
Post‑deployment – conduct a gray‑scale event audit before full release.
The easyTracker system ingests event definitions, runs rule‑based validation, and produces clean DWD traffic tables.
Legacy versions (6.0/7.0) contained ~8,000 distinct events. Scripts reconciled and normalized them to ~3,000 standardized points, achieving >90 % coverage for traffic queries.
New events are captured weekly; a semi‑automated ETL extracts definitions from logs, validates them, and updates DWD, cutting traffic‑data processing time by ~4 hours per day.
Key metrics: Android event bug rate dropped from 9.10 % to 4.07 %; overall data delivery latency improved by three hours.
Data Asset Consolidation – “OneData” Model
Goal: eliminate data silos and provide a single, consistent definition per metric across all business units.
Architecture follows the industry‑standard ODS → DWD → DWS → ADS layers:
ODS : raw source tables and logs.
DWD : detailed fact tables preserving all granular attributes.
DWS : lightweight aggregations. Two primary domains: content (songs, videos, comments) and user behavior . Light aggregations retain key dimensions; heavy aggregations are built on top of them.
ADS : wide tables for analyst and product consumption.
Incremental vs. historical data handling:
Daily partitions (1‑day, 7‑day, 28‑day) are processed in parallel flows.
Cumulative tables are processed serially to avoid resource contention.
Model governance is enforced by the NetEase Data Model Center: every new model must be registered, reviewed, and pass automated validation (≈80 % rule coverage) before production deployment.
Automated testing with easyTest checks null ratios, value ranges, and custom lineage scripts, reducing manual QA effort.
Social interaction data (likes, shares, comments, plays) are consolidated into a single fact table keyed by playback source, simplifying downstream attribution.
Results: data production latency reduced by three hours, metric definitions unified, easyFetch active users >400 (180 weekly active) with >7,000 weekly queries.
Operational Practices
A workflow center isolates testing from production; only models that pass review are scheduled.
Example pseudo‑workflow for incremental tables:
# 1‑day flow (parallel)
spark-submit --class com.netease.etl.DailyJob \
--conf spark.sql.shuffle.partitions=200 \
--args --date ${process_date} --partition 1
# 7‑day flow (parallel)
spark-submit ... --args --date ${process_date} --partition 7
# Cumulative flow (serial)
spark-submit ... --args --date ${process_date} --mode cumulativeFuture Directions
The next step is to shift from daily batch to minute‑level streaming to support algorithmic model training (recommendation, search). Real‑time stream processing will be co‑built with the data platform to provide low‑latency, high‑frequency data feeds.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
