Optimization Practices for Offline Big Data Computing and Storage at Baidu MEG
Baidu MEG’s offline big‑data platform cut costs and boost efficiency by applying intelligent scheduling, storage‑separation, tide‑power workload profiling, remote shuffle services and dynamic quota resizing, raising compute utilization from 55 % to 80 % and storage from 63 % to 78 %, slashing annual expenses by roughly ¥70 million and reducing task duration by about 30 %.
Background: The rapid growth of Baidu App's daily active users has driven increasing demand for offline computing and storage, leading to high costs; the goal is to minimize resource expenses while supporting business growth.
Challenges include management chaos (uncontrolled queues and tasks), low resource utilization (millions of cores and exabyte‑scale storage with low usage), and inefficiency (queue congestion, tasks unable to run).
Optimization approaches: intelligent scheduling (Python‑based client packs jobs, multi‑feature sorting with priority, wait time, and concurrency, locality‑aware placement, filtering and degradation policies); storage‑separation technology (pre‑created UGI with temporary storage and compute rights, transparent pooling); tide power (night‑time resource utilization via workload profiling and a time‑acceleration model that predicts whether a task can finish before the tide window ends); RSS (Remote Shuffle Service) that stores shuffle data remotely so that reduce tasks can resume from pre‑empted state without re‑reading map output; quota resize (dynamic quota allocation based on sliding‑window usage, minute‑level resource awareness, buffer pool, and tiered shrinkage).
Results: compute utilization rose from 55% to 80%, saving hundreds of thousands of cores and tens of millions of yuan annually; storage utilization increased from 63% to 78%, saving tens of millions; task duration dropped by about 30%; delivery efficiency improved from weekly/monthly to daily.
Overall impact: the optimizations cover over 80% of MEG’s offline resources, yielding annual compute cost reductions of roughly ¥40 million and storage cost reductions of about ¥30 million, while supporting growing business needs; continuous innovation is required.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.