vivo's Online-Offline Co-location Technology Practice: Data Center Resource Optimization
Vivo’s online‑offline co‑location platform consolidates latency‑sensitive online services and batch offline workloads on shared Kubernetes nodes, using differentiated resource views, priority‑based QoS, and safety watermarks to boost CPU utilization from 13 % to 25 %, adding 20 000 cores and 50 TB memory for peak‑hour offline tasks.
This article introduces vivo's practice and exploration of online-offline co-location technology in data centers. With the rapid development of vivo's internet business, data center scale continues to expand, making cost optimization increasingly critical. Co-location technology can significantly improve data center resource utilization while ensuring service quality.
Background: Data center services are divided into online services (long-running, latency-sensitive like e-commerce and gaming) and offline tasks (short-running, fault-tolerant like data transformation and model training). Online services have low average resource utilization with peak-valley patterns, while offline tasks have high resource utilization. Before co-location, these were deployed separately without resource sharing, leading to waste.
Co-location Platform Practice: The platform requires two key capabilities: powerful scheduling and isolation, plus comprehensive monitoring and operations. The system implements differentiated resource views where online services see full machine resources while offline tasks see total resources minus online usage. A safety watermark is set to regulate available offline resources.
QoS Levels: Services are categorized into high, medium, and low priority levels. High-priority services support CPU binding for latency-sensitive online services. This hierarchy enables effective resource suppression and isolation, ensuring online service quality.
Architecture: All co-location components run as plugins without invading native K8s. A unified co-location scheduler handles both online and offline tasks, avoiding multi-scheduler resource ledger conflicts. Each physical machine runs a co-location agent for real-time container resource data collection and offline task suppression based on safety watermarks. The system uses Anolis OS for strong resource isolation capabilities.
Spark on K8s Practice: The team adopted Spark on K8s over YARN on K8s due to better compatibility with Spark 3.X (vivo's mainstream offline engine) and lower transformation costs. Implementation followed a three-phase strategy: getting tasks running smoothly, ensuring tasks run stably and accurately, and achieving intelligent task execution.
Results: After optimization, CPU utilization in one co-location cluster increased from 13% to approximately 25%. The platform now supports nearly 20,000 schedulable tasks with over 40,000 daily scheduling operations, providing an additional 20,000 cores and 50TB memory for offline tasks during peak hours.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.