Baidu's Cloud-Native Cost Optimization: Overselling and Workload Co-location Practices
Baidu's cloud‑native cost‑optimization platform combines dynamic resource overselling and online‑offline workload co‑location, leveraging visual tracking, profiling, and eBPF‑based quality monitoring to boost CPU utilization from 20% to 40% and deliver up to 30% cost savings while preserving service performance.
According to Gartner, global enterprise cloud infrastructure spending reached approximately $333 billion by the end of 2022. McKinsey's research reveals that in 2020, due to lack of cost optimization measures, 80% of enterprises significantly exceeded their cloud budget, while 45% over-purchased 55% resources during direct cloud migration and spent 70% extra in the first 18 months.
This article introduces Baidu's cloud-native cost optimization system, focusing on two key technologies: resource overselling and online-offline workload co-location (混部).
Cost Optimization Implementation Path:
1. Cost Insight: Visual resource tracking at cluster and node levels; application-level usage analysis with Pod-level allocation accounting; utilization statistics for CPU, Memory, GPU, and disk across multiple resource dimensions.
2. Cost Optimization: Resource optimization includes reserved instance conversion, elastic scaling, spot instances, and Serverless instances. Application optimization follows a three-step path: resource profiling for quota recommendation (20% cost savings), online overselling for capacity increase (30% savings), and online-offline co-location (CPU utilization improved from 20% to 40%, 30% cost reduction).
3. Cost Operations: Quality-based billing分为独占型、共享型和抢占型三种模式。
Core Technology 1: Online Overselling
Online resources typically have high allocation rates but low actual utilization. The solution involves dynamic overselling through node-level overselling coefficients set by operators, using webhooks to intercept kubelet reporting. Key advantages include dynamic overselling based on resource profiling, applicability to clusters of all sizes, and significant cost reduction. In practice, the EKS large cluster achieved 125% allocation rate with 0.3% hotspot rate through dynamic coefficient overselling, reducing compute node costs by 20%.
Core Technology 2: Online-Offload Co-location
Co-location mixes online services and offline tasks on the same physical resources through resource isolation and scheduling controls. The architecture uses Prometheus for metrics collection, resource profiling (SRP) for modeling, and supports multiple resource types: stable used/request (Burstable/Guaranteed), normal used (BestEffort), and BE used. Quality monitoring employs eBPF for kernel-level metrics including CPU CPI and scheduling latency.
Quality Assurance:
After improving resource utilization, quality is maintained through node hotspot management (CPU/Memory >80%), fine-grained scheduling via resource profiling to avoid hotspots, and hotspot governance with ordered migration based on application migration levels.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
