Big Data 27 min read

How NetEase Yanxuan Built a Robust Data Task Governance System in 2020

This article details NetEase Yanxuan's 2020 initiative to improve data task governance, describing identified pain points, the pre‑mid‑post framework for model, baseline, and incident handling, and the resulting products, processes, and future plans for a more reliable data warehouse.

Yanxuan Tech Team

Feb 5, 2021

How NetEase Yanxuan Built a Robust Data Task Governance System in 2020

Background

In 2020 NetEase Yanxuan identified several urgent points for improvement in data task governance. At the beginning of the year, under the leadership of department heads and Hangzhou Research Data Science, a team was formed to co‑build solutions. After a year of joint development, the project solved practical warehouse problems and produced several useful products and construction ideas, earning the 2020 “Technology Sharing Co‑construction Award”.

What improvement points were identified?

Model : Need more standards, procedures and knowledge accumulation for model design and development.

Task Operations : Tasks should be produced timely, accurately and stably; problems should be quickly located, impact assessed and assisted.

Alarm Optimization : Reduce night‑shift alarms caused by task errors and provide intervention measures.

Link Awareness : Complex task lineage requires awareness of downstream impact; used to reduce asset‑loss incidents.

Testing Bottlenecks : Lack of testing environment and assistance; need testing environment and QA testing points.

Rapid Recovery of Major Incidents : Core tasks or source data failures could take 1‑2 days to recover; need technical assistance for faster resolution.

1 Pre‑stage – Model‑level safeguards

The first line of defense before model launch consists of five guarantees: process guarantee, model design guarantee, data quality guarantee, testing guarantee, and link‑awareness guarantee.

1.1 Process guarantee

Yanxuan’s data‑warehouse development follows a three‑stage process: requirement stage, development stage, and production stage.

Requirement stage : Requirements come from business, analysts, product managers, recorded in JIRA, clarified for background, content, value, timeline, then reviewed and entered into the metric management system (Cangjie).

Development stage : After requirement review, developers design models, hold design review meetings, document design, store in Model Design Center, develop tasks following conventions (one task per model, task name matches model name, separate sync tasks), then test with tools, and finally the requester accepts the task in JIRA.

Production stage : Focuses on data‑quality audit configuration and task operations. Audits are configured in the Data Quality Center; task operations are handled by the Task Operations Center, covering alarm handling, data backfill, rerun, problem handling, incident grading, and stability monitoring.

1.2 Model design guarantee

Yanxuan follows “design before development”. A hierarchical warehouse (rdb → dwd → dws → dm) is defined. Model design captures dimensions, measures, granularity, etc., stored in the Model Design Center. Model cross‑layer dependency rate and model heat are used to evaluate design quality.

Layer definitions: rdb (source), dwd (business process, no external requirement), dws (requirement‑driven, core metrics), dm (data marts, non‑core metrics).

1.3 Data quality guarantee

Data quality is enforced via the Data Quality Center, covering completeness, uniqueness, validity, consistency, and accuracy through audit rules at table and field levels. Strong rules block task execution; weak rules allow continuation but send notifications. Rules can be defined in SQL. The center also provides dashboards, rankings and scores.

1.4 Testing guarantee

The testing center was built as a product. Phase 1 built a testing environment that mirrors production metadata and creates *_dev databases for core layers. Phase 2 added two core functions: data comparison and shape inspection, enabling data consistency checks and model attribute analysis.

1.5 Link‑awareness guarantee

Link awareness tracks lineage from source → rdb → dwd → dws → dm. When a downstream model changes, upstream impact is flagged and a QA approval workflow is triggered.

2 In‑stage – Baseline‑based task operations

Key improvement points for task operations included alarm noise, unclear task levels, coarse assessment metrics, lack of early warning, difficulty locating problems in complex dependency chains, and establishing a smooth incident workflow.

The solution was the Task Operations Center, built around the concept of “baseline”. Tasks are assigned to baselines (e.g., 02:30, 04:30, 07:30, 09:30, default). Baselines define priority, resource limits, and alarm policies.

2.1 Baseline division principles

Baselines are defined for daily scheduled tasks only. Core applications (vipapp, YouShu, etc.) and core metrics (GMV, UV) are attached to baselines. Two time concepts: pre‑warning time and breach time (e.g., 07:30 baseline with 07:00 pre‑warning).

Configured baselines: 02:30 (dwd), 04:30 (dws), 07:30 (core apps), 09:30 (all), default (non‑critical).

2.2 Alarm reduction measures

Baselines consolidate alarms, separating pre‑warning/breach and baseline‑related task failures. Features include silent periods, intelligent phone‑alarm cancellation, adjustable alarm intervals, and merging repeated alarms.

2.3 Key‑link diagnosis

Identifies the longest‑running task chain on a baseline every 10 minutes, compares recent performance with a 14‑day average, and highlights the responsible task.

2.4 Impact assessment

When a task fails, the system lists affected downstream tasks and services. Future work aims to drill down to metric‑level impact.

3 Post‑stage – Intervention measures and regular mechanisms

3.1 Intervention measures

Kill default‑baseline tasks.

Set silent periods for alarms.

Key‑link diagnosis and impact analysis.

Dynamic cluster‑queue balancing.

Knowledge base of problem handling.

3.2 Rapid recovery of major incidents

Introduced a “freeze pool” concept: freeze root tasks and downstream running tasks, then thaw and rerun with controlled parallelism, avoiding duplicate execution.

3.3 Regular practices

Cold‑task grading and deprecation.

Time‑consuming task ranking and optimization.

Engine switch from Hive to Spark.

Long‑chain task splitting.

Dimension table design optimization.

Task alarm fallback optimization.

3.4 Monitoring and retrospection

Metrics such as phone‑alarm count, effective response count, response rate, and average response time are displayed on the operations dashboard. Weekly reviews and BI reports track baseline completion trends and incident lists.

4 Future considerations

Multi‑link diagnosis.

Joint baseline task locating.

Metric‑level impact analysis.

Alarm configuration optimization.

Data‑quality evaluation system.

Author bio

Jing Yuan, senior data‑development engineer at NetEase Yanxuan, responsible for supply‑chain and finance domain architecture and data‑application development, with extensive data‑warehouse and dimensional‑modeling experience.

Recruitment

NetEase Yanxuan data team is hiring senior big‑data development engineers for e‑commerce data‑warehouse construction, ETL development and data‑standard enforcement. Interested candidates can view the original article for details.

Data quality Data Warehouse Baseline Management Data Governance Task Operations

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

What improvement points were identified?

1 Pre‑stage – Model‑level safeguards

1.1 Process guarantee

1.2 Model design guarantee

1.3 Data quality guarantee

1.4 Testing guarantee

1.5 Link‑awareness guarantee

2 In‑stage – Baseline‑based task operations

2.1 Baseline division principles

2.2 Alarm reduction measures

2.3 Key‑link diagnosis

2.4 Impact assessment

3 Post‑stage – Intervention measures and regular mechanisms

3.1 Intervention measures

3.2 Rapid recovery of major incidents

3.3 Regular practices

3.4 Monitoring and retrospection

4 Future considerations

Author bio

Recruitment

Yanxuan Tech Team

How this landed with the community

Was this worth your time?

0 Comments

1 Pre‑stage – Model‑level safeguards

2 In‑stage – Baseline‑based task operations

3 Post‑stage – Intervention measures and regular mechanisms

4 Future considerations