Big Data 11 min read

How NetEase Gaming Cut Data Warehouse Costs by 85%: A Data Governance Case Study

This case study details how NetEase Interactive Entertainment’s data team tackled massive log‑management chaos, storage bloat, and high overseas costs by standardizing logs, sharing a real‑time ODS layer, automating lifecycle management, tiered storage, and peak‑shaving compute scheduling, ultimately saving millions of yuan.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
How NetEase Gaming Cut Data Warehouse Costs by 85%: A Data Governance Case Study

Project Background

Our team provides a wide range of game analytics services for NetEase Interactive Entertainment, supporting over 300 games with BI dashboards, custom metrics, and various analysis topics. The offline data warehouse executes up to 4 million SQL queries per month.

Data analysis overview
Data analysis overview

Project Challenges

Rapid business growth led to more than 3 million monthly offline warehouse jobs and highlighted several problems:

Chaotic log management with inconsistent standards, making parsing and maintenance difficult.

Continuous operation creates long‑running jobs and an unclear data lifecycle.

Duplicate storage of raw logs and formatted tables, inflating costs.

Overseas market expansion caused storage costs abroad to far exceed domestic costs.

Challenge illustration
Challenge illustration

Project Implementation

We pursued optimization in two major areas: storage and compute.

Storage Optimizations

The ODS layer accounts for 75% of total storage, so we focused on pre‑storage, during‑storage, and post‑storage improvements.

Pre‑storage: Log Format Standards

We introduced three principles for log printing:

Standard : Follow a unified format to ensure downstream parsability and avoid wasted space.

On‑Demand : Emit only logs required by downstream business, minimizing transmission volume.

Compact : For frequently repeated fields, print full information only once per session and reference subsequent logs via session ID or dimension tables.

Using a log‑reporting module that maps abbreviated column names to full English names, we reduced log storage by roughly 10%‑20%.

Log format example
Log format example

During‑storage: Shared Warehouse Solution

Previously, each department maintained its own ETL‑derived ODS, causing duplicate raw logs. We built a shared real‑time ODS layer where streams are ingested, ETL‑processed, and stored as separate tables with minute‑level freshness. Departments now query the shared ODS directly, eliminating redundant raw‑log storage and improving query efficiency.

Shared warehouse architecture
Shared warehouse architecture

Post‑storage: Dynamic Lifecycle Management

Manual ODS table lifecycle configuration caused conflicts, maintenance overhead, and legacy issues. We automated expiration by monitoring SQL audit logs and automatically dropping long‑unread partitions, then recreating them on demand. Between June and September 2022, we reclaimed 2 229.5 TB of overseas data, reducing total storage to 386.3 TB—a 85% reduction.

Lifecycle management diagram
Lifecycle management diagram

Post‑storage: Tiered and Compressed Storage

Based on data age and access frequency, we introduced three storage tiers:

Tier 1: Hot Hadoop cluster for high‑performance reads/writes.

Tier 2: Cold Hadoop cluster with slightly lower performance.

Tier 3: Cold S3 cluster, read‑only with the lowest performance.

We also applied two levels of compression:

Level 1: ORC or TextFile with Snappy compression, achieving 30%‑48% size reduction.

Level 2: Erasure coding to reduce replication from 3× to 1.5×, saving about 50% of space.

Tiered storage diagram
Tiered storage diagram

Compute Optimizations

Pre‑compute: Efficiency Improvements

We prioritized long‑running, widely used P1 metrics. By redesigning data models to reduce cross‑layer references and reusing intermediate layers, we cut required compute resources for targeted jobs by 70%‑90% and lowered overall compute consumption for regular operational metrics by 30%‑50%.

Compute efficiency chart
Compute efficiency chart

During‑compute: Peak‑Shaving Scheduling

Previously, many jobs ran during busy hours (02:00‑08:00), where compute cost is four times higher than idle periods. By classifying tasks, moving non‑critical analytics to idle times, and adjusting priorities, we reduced overall compute cost.

Scheduling diagram
Scheduling diagram

Project Results

After the September 2021 optimization, storage savings reached up to 65% and compute savings up to 71%, estimating an annual cost reduction of about 9.6 million CNY. From June 2022 onward, focusing on overseas projects, storage and compute improvements contributed 45% and 55% respectively, with an estimated annual saving of over 15 million CNY.

2021 cost reduction chart
2021 cost reduction chart
2022 cost reduction chart
2022 cost reduction chart
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Log Managementstorage tiering
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.