Big Data 9 min read

Reconstructing Cloud Music Membership Automation Data Warehouse: Practices, Standards, and Performance Optimizations

This article details the background, challenges, and step‑by‑step reconstruction of Cloud Music's membership automation data warehouse, covering modeling standards, task decoupling, Spark performance tuning, and data quality checks to achieve higher efficiency and reliability.

DataFunTalk

Jul 4, 2021

Reconstructing Cloud Music Membership Automation Data Warehouse: Practices, Standards, and Performance Optimizations

The Cloud Music data warehouse evolved from a chaotic early stage to a more structured system, yet many legacy tasks remain poorly maintained; the article uses the member automation operation model reconstruction as a case study to share practical data task refactoring.

Background : Automated membership operations package audience, resource slots, and rules into various strategies to drive precise in‑app resource delivery, improving conversion, penetration, and renewal rates. Evaluation relies on numerous metrics (PV/UV, click counts, funnel conversions) across dimensions such as strategy, resource, position, audience, OS, and SKU type.

Problems : Early warehouse development lacked methodology, resulting in siloed, tightly coupled tables without layering, domain separation, or standards, causing stability and usability issues.

Reconstruction focuses on three aspects:

1. Standards : Adopt high cohesion and low coupling, define clear business domains (e.g., cloud‑music‑fact‑transaction‑revenue), and implement layered tables – dwd (detail), dws (light aggregation), ads (highly aggregated). Table naming follows the task name, and tasks are organized by business ownership and layer within the data platform.

2. Efficiency : Migrate all workflows from Hive to Spark, apply dynamic allocation, adjust CPU/memory ratios, enable broadcast joins, tune parallelism and repartition, activate Parquet conversion, optimize lateral view explode, control output file size, and leverage Spark 3 AQE. These changes yielded a 5× speedup per node, a 3‑hour reduction in overall runtime, and an 80% drop in storage usage.

3. Quality : Ensure data accuracy and consistency through validation (counts, distinct counts, null checks, range checks, min/max comparisons) and Data Quality Center (DQC) rules at table and column levels, with configurable actions on rule violations.

Example validation query:

-- 分区前缀代表不同优化策略
select dt,
    count(1) as c,
    count(distinct os) as c_os,
    count(distinct positionid) as c_pos,
    sum(vipbuy_amt) as s_amt,
    max(trigger_impress_cnt) as max_c,
    min(trigger_impress_cnt) as min_c
from music_new_dm.ads_act_vip_stgy_di
where dt like '%2021-06-09%'
group by 1
order by 1;

Operational recommendations include using views or WITH clauses for small data, persisting temporary tables for large datasets with lifecycle management, avoiding temporary syntax unsupported by the platform, and fixing naming conventions with dynamic date partitions.

Development follows a two‑mode approach (development and online) with thorough testing before release, followed by approval workflows and scheduling configurations (period, dependencies, concurrency). Alerts are set up for failures and delays.

The author, a senior data engineer at NetEase Cloud Music, concludes by thanking readers and inviting them to join the DataFunTalk community for further big‑data and AI discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data quality Data Modeling Data Warehouse Spark optimization

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.