Big Data 23 min read

How Baidu MEG Cut Data Costs: Inside a Big Data Governance Playbook

This article details Baidu's MEG data cost governance practice, covering background challenges, a unified governance framework, health‑score metrics, platform and engine capabilities, concrete compute and storage optimization techniques, achieved results, and future plans for continuous cost reduction.

Baidu Tech Salon

Nov 26, 2025

How Baidu MEG Cut Data Costs: Inside a Big Data Governance Playbook

Background

Rapid growth of Baidu's products has led to exploding offline data volumes and rising storage and compute costs. An analysis of resource, management, and cost status revealed scattered resources, low utilization, and lack of unified governance standards across product lines.

Data Cost Governance Overview

The practice outlines the current problems, optimization schemes for compute and storage, achieved outcomes, and future directions, providing a reference for the industry.

Overall Framework

A unified governance framework was built around three pillars: data asset measurement, platform capabilities, and engine empowerment. It creates unified views for compute resources, storage resources, tasks, and costs, enabling systematic cost reduction.

Data Asset Health Metrics

Health scores are used to evaluate assets. Compute health score combines queue usage average, usage balance, and weighted compute‑governance items. Storage health score combines storage account usage average, peak usage, cold‑data proportion, and weighted storage‑governance items.

Platform Capabilities

Compute view – overview of queue usage and detailed governance items; supports task registration, control, scheduling, and full‑lifecycle management.

Storage view – detailed usage of each storage account and governance items; provides tools for directory cleaning, migration, and cold‑data mining.

Cost view – aggregates compute and storage costs per product line for intuitive governance results.

Engine Empowerment

Compute scenario – uses historical task data and machine‑learning models to implement intelligent parameter tuning, ensuring tasks run with near‑optimal settings.

Storage scenario – performs intelligent compression of massive storage data without affecting read/write performance.

Compute Cost Governance

Management and Control

For thousands of EMR queues and tens of thousands of Hadoop/Spark tasks, the platform registers resources, collects metadata, extracts governance items (e.g., uneven queue usage, long‑running high‑resource tasks, data skew, invalid tasks), and applies health scores to guide remediation.

Mixed Scheduling

A hybrid scheduler selects the optimal queue for each Hadoop or Spark job based on priority, submission time, and a chain of >20 policies (resource balance, locality, peak usage, etc.). This reduces queue fragmentation and improves overall utilization.

Intelligent Tuning

Two tuning flows are implemented:

Basic parameter tuning – a closed‑loop of task submission, result reporting, model training, and SLA protection automatically adjusts spark.executor.instances, spark.executor.cores, and spark.executor.memory to minimize resource waste.

History‑Based Optimization (HBO) – collects Spark task history, then during planning and submission adjusts join algorithms, data‑skew handling, aggregation strategies, shuffle partitions, and enables features like Kryo serialization for complex parameter scenarios.

Storage Cost Governance

Lifecycle Management

The platform defines a five‑layer lifecycle (access, service, storage, execution, user) and builds tools for account onboarding, quota enforcement, cold‑data handling, automated cleaning, compression, and monitoring.

Basic Governance

By parsing AFS quota data and directory metadata, the system provides trend analysis, cost calculation, anomaly detection, and actionable recommendations for storage paths, migration, and compression.

Intelligent Compression

Two scenarios are addressed:

Data‑warehouse tables – automatic profiling, partition‑level ZSTD compression, page‑size tuning, and re‑writing to achieve high compression without impacting query performance.

Non‑warehouse AFS data – cold‑warm‑hot classification, selective compression parameter tuning for hot writes, and periodic offline compression for warm/cold data.

Governance Results

Data development efficiency : Full‑lifecycle management reduced resource provisioning from weeks to days, halved queue waiting times, and doubled data delivery speed.

Compute cost reduction : Balanced usage across thousands of queues raised average EMR utilization by >30%, saving tens of millions of RMB annually.

Storage operation efficiency : Managed thousands of AFS accounts, cleaned unused data, and introduced monitoring, greatly improving operational control.

Storage cost reduction : Increased overall storage utilization by >20% across thousands of PB, saving tens of millions of RMB per year.

Governance assets : Established standards for resource delivery, task development, data quality, and security; created comprehensive cost and asset dashboards.

Future Planning

The team will continue to refine standards, automate more governance scenarios, and enhance intelligent optimization to achieve even more precise and automated cost control.

cost optimization data governance

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.