Big Data 24 min read

Cost Governance Practices in Youzan's Data Middle Platform

Youzan's data middle platform faced cost growth outpacing business due to low utilization and storage inefficiencies; they applied utilization standards, containerization, COS storage migration, offline task optimization, and fine-grained cost-billing, achieving a 12% compute boost, 17% batch savings, 80% storage cost cut, and over 25% overall cost reduction.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Cost Governance Practices in Youzan's Data Middle Platform

Background

With the rapid growth of live‑e‑commerce, Youzan's business has expanded quickly, but the data middle platform’s storage and compute consumption has risen even faster, at times exceeding the overall business growth rate. This unsustainable trend prompted a cost‑governance effort led by Youzan’s Vice President of Technology and Tencent Cloud’s TVP expert, Shen Kan.

1. Data Platform Machine Resource Situation

The platform operates roughly 1,500 machines (mostly physical, some virtual), hosts about 100 applications, provides around 40,000 CPU cores and stores about 15 PB of data.

Offline compute accounts for nearly 50 % of the total cost, while other workloads (real‑time, online services, platform applications) consume the remaining share.

2. Cost Growth Outpacing Business Growth

During the first half of the year the focus was on offline compute; real‑time cost reduction is still being planned.

The cost growth of the data platform outstripped overall business growth, making cost governance a critical priority.

3. Problem Analysis

The main issues identified were:

Low resource utilization (average CPU 11 %, memory 30 %; offline CPU ~25 %, memory ~40 %).

High scaling‑up/down cost due to month‑level procurement and long‑term resource retention during promotions.

Storage as a bottleneck – dedicated physical machines for cold‑backup data lead to low compute utilization.

Offline compute waste – batch jobs were not fully optimized.

Lack of cost awareness among developers compared with business‑support awareness.

4. Comprehensive Governance Measures

4.1 Improve Resource Utilization

Set utilization standards for different environments (QA, pre‑release, production) and workloads (offline, online, real‑time). Standardize machine types (compute‑optimized, memory‑optimized, storage‑optimized) and replace legacy heterogeneous servers with uniform hardware to simplify scheduling and containerization.

Implement workload‑level throttling and peak‑shaving by splitting batch tasks.

4.2 Containerization Refactor

Containerize offline compute to achieve minute‑level scaling via API‑driven procurement with Tencent Cloud. This also enables compute‑storage separation, allowing idle online resources to be repurposed for night‑time batch jobs.

Containerization Benefits
Containerization Benefits

4.3 Storage Optimization

Move cold‑backup data to Tencent COS (standard storage) – cost reduced by ~80 % compared with dedicated clusters.

Clean up Hive partition tables (over 90 % of partitions cleared) saving >20 % storage.

Migrate Hadoop from 2.x to 3.x to leverage built‑in compression, potentially halving storage for cold data.

4.4 Offline Task Optimization ("Six Sword" Strategy)

Offline data decommission – identify and drop unused tables using lineage analysis.

Task scheduling – eliminate duplicate computations and adjust job frequencies.

High‑to‑low frequency – reduce hourly jobs to daily/tri‑daily where possible.

Task replacement – consolidate legacy jobs with newer implementations.

Small‑file merging – unify Spark‑generated small files to improve performance.

Delayed start – prioritize high‑priority jobs, then run lower‑priority jobs in off‑peak windows.

4.5 Cost Operation Mechanism

Introduce a cost‑billing model that breaks down machine cost into CPU, memory, storage and time components. Collect fine‑grained usage data via Spark monitor and STS, apply a 10 % loss factor for pre‑warm and release phases, and calculate per‑task cost.

Publish cost bills at task, user, team and platform levels to increase cost visibility and drive self‑service cost reduction.

Establish a cost‑operation loop: cost awareness → self‑service reduction → team‑level nudging → reward/punishment (e.g., red/black lists, "Youzan Coins").

5. Results and Outlook

Resource utilization improved modestly (CPU & memory). Key achievements:

Compute performance up 12 % via uniform hardware.

Batch optimization saved 17 % compute.

COS cold‑data storage cut storage cost by 80 %.

Self‑driven cost reduction exceeded 25 %.

Going forward, Youzan plans larger mixed‑mode (online/offline) clusters, expand on‑demand procurement, extend cost‑billing to business units, and continue refining the ROI‑driven cost model.

6. Q&A Highlights

Continuous operation requires ~10 % of a person’s effort to maintain cost savings.

The data platform is fully on‑cloud (Tencent Cloud).

Cost awareness is primarily driven by incentives rather than strict penalties.

Data lineage is captured via Hive/SparkSQL parsing and manual task‑dependency registration.

Cold‑data backup uses a Hadoop distcp‑based tool with encryption to copy data to COS.

Real‑time analytics use Kafka + Flink; offline analytics use Hive + SparkSQL; Presto is used for ad‑hoc insights.

Storage of structured data uses JSON/Avro; unstructured data is compressed text.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datacloud computingData PlatformcontainerizationData Cost Management
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.