Cloud Native 13 min read

How 360’s AI Platform Boosted GPU Utilization with Volcano Scheduler

360’s AI platform migrated its GPU clusters to a cloud‑native architecture and adopted the Volcano scheduler, achieving over 45% GPU utilization, less than 7% fragmentation, and more than 1000000 scheduled Pods, while leveraging flexible plugins, hierarchical queues, and resource pooling to optimize AI and big‑data workloads.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How 360’s AI Platform Boosted GPU Utilization with Volcano Scheduler

Background

360 Group operates multiple self‑built GPU compute clusters with over ten thousand GPUs. After migrating from a YARN‑based resource management model to a cloud‑native architecture, the AI platform selected Volcano as the sole scheduler for GPU tasks. Continuous optimization has increased resource utilization, reduced fragmentation, and improved training performance, scheduling over 1,000,000 Pods with fragmentation below 7%, allocation above 85%, and utilization above 45%.

Practice

2.1 Action and Plugin Mechanism

Volcano’s core includes an Action and Plugin design pattern. During each scheduling cycle, registered Actions execute sequentially, defining key points and providing standardized function hooks for Plugins. Plugins implement these interfaces to embed various scheduling algorithms such as DRF, Gang, and Binpack.

The architecture is highly modular; most Actions and Plugins are configurable, allowing users to combine different algorithms for diverse scheduling scenarios and offering standardized extension points for custom logic.

Key plugins used in 360’s clusters include:

capacity: provides Capability, Deserved, and Guarantee strategies for multi‑tenant resource allocation and idle‑time sharing.

gang: ensures atomic scheduling of tasks to avoid partial pod waiting or deadlocks.

drf: implements fair sharing based on the DRF algorithm, balancing CPU/GPU usage across tenants.

priority & preempt: defines task priority and enables preemption to meet SLA for high‑value jobs.

task‑topology: topology‑aware scheduling that optimizes pod placement for NVLink‑accelerated training.

conformance: protects critical namespaces such as kube‑system and volcano‑system.

nodeorder: compatible with Kubernetes default scheduler, supporting node affinity, taints, tolerations, and image cache awareness.

360 has enhanced some plugins, for example extending the priority plugin and preempt action to allow queue‑specific preemption rules.

Custom plugins developed by 360 AI platform include a network‑topology‑aware plugin that maps workloads to the SuperPod SU architecture (200 A800 GPUs and 4 leaf switches per SU), reducing communication latency and improving training performance by 15‑20%.

Volcano 1.11 later added native network‑topology‑aware scheduling, automatically placing communication‑intensive pods on the same switch to cut AllReduce overhead.

High‑priority large‑task starvation‑avoidance plugin: delays scheduling of low‑priority tasks when large‑scale jobs are waiting, preventing long‑term starvation.

Additional native Job Controller plugins enhance batch jobs: env (injects common environment variables), ssh (sets up password‑less SSH between pods), svc (creates headless services for stable DNS), and pytorch (configures PyTorch master/worker settings).

2.2 Tenant Isolation Journey

Initially, all departments submitted jobs to a single Volcano queue, leading to resource contention and management challenges. As cluster count and usage grew, 360 introduced queue partitioning by department, enabling fine‑grained control and priority‑based preemption.

With Volcano 1.11’s hierarchical queue feature, 360 contributed to its design and code review, establishing a three‑level hierarchy: ROOT (all cluster resources), resource‑group queues (project collections), and project‑level queues. Using the capacity plugin, idle resources can be shared across projects, dramatically improving overall utilization.

2.3 Resource Pooling and Idle Sharing

360 AI platform distinguishes three resource types: project‑hosted nodes, self‑owned public nodes, and exclusive nodes. Early on, node‑affinity scheduling ensured strong isolation but caused waste when one project’s nodes were idle while another lacked resources.

To address this, 360 re‑engineered the system: device‑plugin reports specific GPU models, allowing quotas to be expressed in GPU cards rather than nodes; the Capacity plugin’s deserved and capability modes enable idle exclusive nodes to be borrowed by other projects and reclaimed when needed, boosting utilization without sacrificing isolation.

Future Plans

360 AI platform will continue investing in scheduler enhancements to improve execution efficiency, hardware utilization, and cost reduction. In big‑data scenarios, Volcano will integrate with Spark Operator to migrate Spark workloads from YARN to Kubernetes, achieving unified scheduling for general compute, AI training, large‑model tasks, and data analytics, while maintaining active collaboration with the open‑source community.

Reference Links

Volcano Official Site (Chinese)

Volcano GitHub Repository

Network Topology Plugin Repository

cloud nativeKubernetesGPU schedulingAI PlatformVolcano
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.