How Baidu MEG Cut Data Costs: Inside a Big Data Governance Playbook
This article details Baidu's MEG data cost governance practice, covering background challenges, a unified governance framework, health‑score metrics, platform and engine capabilities, concrete compute and storage optimization techniques, achieved results, and future plans for continuous cost reduction.
Background
Rapid growth of Baidu's products has led to exploding offline data volumes and rising storage and compute costs. An analysis of resource, management, and cost status revealed scattered resources, low utilization, and lack of unified governance standards across product lines.
Data Cost Governance Overview
The practice outlines the current problems, optimization schemes for compute and storage, achieved outcomes, and future directions, providing a reference for the industry.
Overall Framework
A unified governance framework was built around three pillars: data asset measurement, platform capabilities, and engine empowerment. It creates unified views for compute resources, storage resources, tasks, and costs, enabling systematic cost reduction.
Data Asset Health Metrics
Health scores are used to evaluate assets. Compute health score combines queue usage average, usage balance, and weighted compute‑governance items. Storage health score combines storage account usage average, peak usage, cold‑data proportion, and weighted storage‑governance items.
Platform Capabilities
Compute view – overview of queue usage and detailed governance items; supports task registration, control, scheduling, and full‑lifecycle management.
Storage view – detailed usage of each storage account and governance items; provides tools for directory cleaning, migration, and cold‑data mining.
Cost view – aggregates compute and storage costs per product line for intuitive governance results.
Engine Empowerment
Compute scenario – uses historical task data and machine‑learning models to implement intelligent parameter tuning, ensuring tasks run with near‑optimal settings.
Storage scenario – performs intelligent compression of massive storage data without affecting read/write performance.
Compute Cost Governance
Management and Control
For thousands of EMR queues and tens of thousands of Hadoop/Spark tasks, the platform registers resources, collects metadata, extracts governance items (e.g., uneven queue usage, long‑running high‑resource tasks, data skew, invalid tasks), and applies health scores to guide remediation.
Mixed Scheduling
A hybrid scheduler selects the optimal queue for each Hadoop or Spark job based on priority, submission time, and a chain of >20 policies (resource balance, locality, peak usage, etc.). This reduces queue fragmentation and improves overall utilization.
Intelligent Tuning
Two tuning flows are implemented:
Basic parameter tuning – a closed‑loop of task submission, result reporting, model training, and SLA protection automatically adjusts spark.executor.instances, spark.executor.cores, and spark.executor.memory to minimize resource waste.
History‑Based Optimization (HBO) – collects Spark task history, then during planning and submission adjusts join algorithms, data‑skew handling, aggregation strategies, shuffle partitions, and enables features like Kryo serialization for complex parameter scenarios.
Storage Cost Governance
Lifecycle Management
The platform defines a five‑layer lifecycle (access, service, storage, execution, user) and builds tools for account onboarding, quota enforcement, cold‑data handling, automated cleaning, compression, and monitoring.
Basic Governance
By parsing AFS quota data and directory metadata, the system provides trend analysis, cost calculation, anomaly detection, and actionable recommendations for storage paths, migration, and compression.
Intelligent Compression
Two scenarios are addressed:
Data‑warehouse tables – automatic profiling, partition‑level ZSTD compression, page‑size tuning, and re‑writing to achieve high compression without impacting query performance.
Non‑warehouse AFS data – cold‑warm‑hot classification, selective compression parameter tuning for hot writes, and periodic offline compression for warm/cold data.
Governance Results
Data development efficiency : Full‑lifecycle management reduced resource provisioning from weeks to days, halved queue waiting times, and doubled data delivery speed.
Compute cost reduction : Balanced usage across thousands of queues raised average EMR utilization by >30%, saving tens of millions of RMB annually.
Storage operation efficiency : Managed thousands of AFS accounts, cleaned unused data, and introduced monitoring, greatly improving operational control.
Storage cost reduction : Increased overall storage utilization by >20% across thousands of PB, saving tens of millions of RMB per year.
Governance assets : Established standards for resource delivery, task development, data quality, and security; created comprehensive cost and asset dashboards.
Future Planning
The team will continue to refine standards, automate more governance scenarios, and enhance intelligent optimization to achieve even more precise and automated cost control.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
